Good Normalization, Bad Normalization

Journal einhverfr's Journal: Good Normalization, Bad Normalization 4

Journal by einhverfr on Saturday December 01, 2007 @07:44PM

I have recently been in a discussion on the LedgerSMB development list over the value of normalization of our database and where I feel that we need to go. This lead to a fairly heated debate between those of us who work on the database back-end in part because a lot of databases are sometimes normalized in ways which are not practical. My view, which I still hold, is that normalization is always a good thing if done properly, but that sometimes people try to do recipe-book normalization without understanding the process and create horrible messes.

For people who don't know what normalization of a relational database schema involves, this is the process by which data dependencies are analyzed and tables broken down in such a way as to duplicate as little data as possible. As one progresses through the normal forms (First Normal Form, Second Normal Form, Third Normal Form, Boyce/Codd Normal Form, Forth Normal Form, and Fifth Normal Form) one breaks up tables into smaller entities so that single facts are stored in one place only (and therefore are less likely to get out of date). Just as good object-oriented design breaks up data structures along lines on dependencies of program flow (so that proper encapsulation is possible), good relational design breaks up data relations along the lines of dependencies on data (so that central data management is possible).

In my view, relational data modelling is a fairly straight-forward mathematical exercise. And in general, normalizing to third normal form does not cause people to make too many mistakes. However, further normalization is fraught with hazards because many database engineers try to apply some method of arriving at a model of data which does not take into account its own structure. In many cases (key-value modelling, for example) these approaches are foreign to the relational model and hence result in problems. I call these messes "misnormalized databases." In most of these cases, though, the best option is to avoid such ideas and stick with table structures that best represent the structure and dependencies of your data.

When one progresses beyond third normal form, one of the problems that one encounters is the fact that it is not possible to know how normalized a table is without knowing how the table is used and what real-world constraints are on the data (data may exist independent of any given program flow, but it does not exist independent of its use). In general the places where one runs into trouble are:

Tables where the information in them is no longer sufficient to derive any authoritative statement from the record and those records noted via foreign keys specifically in that table.
Tables normalized in a way which makes it more difficult to enforce real-world data constraints (this usually means you are missing real data dependencies).
Tables normalized for the sole purpose of reducing duplicate information without addressing internal data semantics.

Additionally, relational modeling allows for easier enforcement of data constraints in data not subject to prior data structure constraints (accounting data in particular does not allow for all constraints to be specified in relational terms in the current generation of RDBMS's). Hence things like addresses, location data, and customer information are more easily stored and checked than financial data (one area where I see room for improvement in RDMBS's).

In general, normalization is always a win if it is done well. If it is done badly without proper analysis of real world data constraints, you end up with a mess.

This discussion has been archived. No new comments can be posted.