×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Good Normalization, Bad Normalization

einhverfr (238914) writes | more than 6 years ago

User Journal 4

I have recently been in a discussion on the LedgerSMB development list over the value of normalization of our database and where I feel that we need to go. This lead to a fairly heated debate between those of us who work on the database back-end in part because a lot of databases are sometimes normalized in ways which are not practical. My view, which I still hold, is that normalization is always a good thing if done properly, but that sometimes people try to do recipe-book

I have recently been in a discussion on the LedgerSMB development list over the value of normalization of our database and where I feel that we need to go. This lead to a fairly heated debate between those of us who work on the database back-end in part because a lot of databases are sometimes normalized in ways which are not practical. My view, which I still hold, is that normalization is always a good thing if done properly, but that sometimes people try to do recipe-book normalization without understanding the process and create horrible messes.

For people who don't know what normalization of a relational database schema involves, this is the process by which data dependencies are analyzed and tables broken down in such a way as to duplicate as little data as possible. As one progresses through the normal forms (First Normal Form, Second Normal Form, Third Normal Form, Boyce/Codd Normal Form, Forth Normal Form, and Fifth Normal Form) one breaks up tables into smaller entities so that single facts are stored in one place only (and therefore are less likely to get out of date). Just as good object-oriented design breaks up data structures along lines on dependencies of program flow (so that proper encapsulation is possible), good relational design breaks up data relations along the lines of dependencies on data (so that central data management is possible).

In my view, relational data modelling is a fairly straight-forward mathematical exercise. And in general, normalizing to third normal form does not cause people to make too many mistakes. However, further normalization is fraught with hazards because many database engineers try to apply some method of arriving at a model of data which does not take into account its own structure. In many cases (key-value modelling, for example) these approaches are foreign to the relational model and hence result in problems. I call these messes "misnormalized databases." In most of these cases, though, the best option is to avoid such ideas and stick with table structures that best represent the structure and dependencies of your data.

When one progresses beyond third normal form, one of the problems that one encounters is the fact that it is not possible to know how normalized a table is without knowing how the table is used and what real-world constraints are on the data (data may exist independent of any given program flow, but it does not exist independent of its use). In general the places where one runs into trouble are:

  1. Tables where the information in them is no longer sufficient to derive any authoritative statement from the record and those records noted via foreign keys specifically in that table.
  2. Tables normalized in a way which makes it more difficult to enforce real-world data constraints (this usually means you are missing real data dependencies).
  3. Tables normalized for the sole purpose of reducing duplicate information without addressing internal data semantics.

Additionally, relational modeling allows for easier enforcement of data constraints in data not subject to prior data structure constraints (accounting data in particular does not allow for all constraints to be specified in relational terms in the current generation of RDBMS's). Hence things like addresses, location data, and customer information are more easily stored and checked than financial data (one area where I see room for improvement in RDMBS's).

In general, normalization is always a win if it is done well. If it is done badly without proper analysis of real world data constraints, you end up with a mess.

4 comments

Example beyond 3NF? (1)

cerberusss (660701) | more than 6 years ago | (#21548241)

When one progresses beyond third normal form
Can you give me an example where you think it's appropriate to progress beyond 3NF? Which part of LedgerSMB its database has been normalized further than that?

Oh and keep up the good work :-)

Re:Example beyond 3NF? (1)

einhverfr (238914) | more than 6 years ago | (#21548387)

In the areas of the database I would like to see further normalized. There are some open questions about addresses which have not been addressed or fully reviewed/discussed. In short addresses probably could be more normalized but it is not clear yet how this should be done. My guess is that we will see a gradual normalization in this area as data dependencies become more fully understood and addressed. Part of the problem with addresses is that there are a lot of corner cases which need to be considered before going too far in this direction. We already break out countries (and I expect that at some point we will break out cities and states as well), but zip codes (or the like) pose real problems.

Re:Example beyond 3NF? (1)

cerberusss (660701) | more than 6 years ago | (#21550271)

Heh yeah, addresses. It sounds simple until you try to gather _all) requirements. I've seen the way the Oracle eBusiness suite stores addresses. It's about 20 tables or so. :-) Thanks for the answer by the way.

Re:Example beyond 3NF? (1)

einhverfr (238914) | more than 6 years ago | (#21548443)

Sorry, forgot to give an example of something where one should go beyond 3NF.

The basic issue is that normal forms tend to handle questions of atomic facts-- if a row isn't an atomic fact, the table structure can be broken up into smaller tables.

Here is an example of 3NF which should probably be considered to be separate:

CREATE TABLE street_address (
      address text,
      city text,
      state text,
      country text,
      primary key (address, city, state)
);

In this case, country is dependent on state which is a part of the candidate key. Hence although it is 3NF, it is not BCNF.

Better, would be:
CREATE TABLE country (
      name text,
      id int not null unique, -- secondary key
      iso_code char(2) not null,
      primary key (name)
);

CREATE TABLE state_province (
        name text,
        country_id int references country(id),
        id int not null unique, -- secondary key
        primary key (name, country_id)
);

CREATE TABLE address (
        address text,
        city text,
        state_id int references state_province (id),
        id int not null unique, -- secondary key
        primary key(address, city, state_id)
);

Technically because a given entity (city, state_id) is an atomic fact, it probably should be broken out into another table too.
Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...