Programmers hate deleting things; weā€™ve all felt that feeling in the pit our stomach when we realise that thing we deleted was really deleted, and on the other hand the security of deleting some unused code, safe in the knowledge that itā€™s still really there in version control. In the sphere of databases, this terror of deleting things leads people to advocate soft deletion: instead of really deleting a record, you add a field which marks the record as deleted, and you treat any record marked in that way as if it were deleted. This is generally a bad idea, and there are a number of better ways of ensuring access to old data.

The main problem with soft deletion is that youā€™re systematically misleading the database. This is most obvious in the case of foreign key constraints. If itā€™s important to your data integrity that some column in one table references a row that exists in another table, a foreign key constraint instructs the database to enforce that by, among other things, preventing the deletion. A similar issue arises with unique constraints: conceptually, uniqueness probably shouldnā€™t consider deleted items (say a user deletes their account, and later wants to sign up again with the same email address), but the database doesnā€™t know that. You also face difficulties with modifying the database structure, as you have to consider, not just how these modification affect live data, but how it would affect historical data, where the change might not make sense. You can mitigate some of these problems by carefully crafting triggers and constraints to take into account the deletion marker, but that adds significant complexity.

Because youā€™re misleading the database, you also encounter problems with querying the data. You can no longer rely on SELECTs to only return live data; you have to remember to check the deletion marker in each WHERE clause (and in each JOIN; I nearly forgot about that while writing this, a reflection of the many times Iā€™ve forgotten about it while implementing this). You can potentially avoid this by creating views that filter out the deleted rows, or you can outsource the problem to an ORM (with something like Hibernateā€™s @Where annotation which can add the deletion marker to its constructed queries). But these are fragile and, especially for the ORM case, leaky abstractions; itā€™s easy to make a mistake and end up querying soft deleted data, or for the ORM to make a mistake and try and query data which it then fails to find because it thinks its deleted.

Of course, the extra complexity introduced by soft deletion isnā€™t necessarily a reason to avoid it; perhaps itā€™s a price worth paying in the noble quest to never lose data. But itā€™s worth thinking about how much soft deletion actually helps us here. The thing is, in the common case of a database thatā€™s the live storage for an application, we remove data all the time ā€” not by deleting it, but by overwriting it. Here, the database represents the current state of the application, not an archive (for something like a data warehouse, which is frequently added to but less frequently mutated, different rules apply, and is one of the cases where my ā€œgenerallyā€ above might not apply). So soft deletion, as well as being complex, might be insufficient. Soft deletion is often touted as a ā€œbest practiceā€ and implemented without much consideration; but we might do better to consider exactly what data permanence we want to achieve, and adopt an approach which more precisely meets those needs.

continue reading on www.cultured.systems

āš ļø This post links to an external website. āš ļø