Sunday, November 17, 2013

Database Design: Data Masking must be a criteria

I started reading the works of C.J.Date on RDBMS back in the late 1970’s. While I have kept up with his later writings, I have found that Scott Ambler books are very good practical books, namely:

  • Agile Database Techniques, Wiley, 2003

  • Refactoring Databases, Evolutionary Database Design, Addison Wesley, 2006.

These books have been invaluable in designing and implementing effective databases but what is missing from all of them are discussion of data masking.

I tend to be of the old rigorous school for green-field database development: create a fully normalized logical database model, ideally pushing the model to a full 6th normal form. After this is done, denormalized to a physical database model that addresses performance and ease of use criteria of the customer. I believe an additional criteria needs to be included in the logical-to-physical data model activity, inclusion of data masking criteria.

Two important criteria to consider are:

  • Strive to have masking possible at a column-atomic level, if there is column-correlation then use adjunct tables for the columns that are correlated in a table.

  • Complete avoidance of natural keys.

Column Level Masking and Correlation

If there are no dependencies on other columns, you are probably in normal form and have no hard-dependencies. This is also a soft dependency which I term column-correlation which I define as:

  • The value in column A results in a subset of values being valid for another column B.

In most cases, the columns will be category columns. For example: Gender and Method of Address are correlated:

  • F –> Ms, Mrs, Dr. Prof.

  • M –> Mr. Dr. Prof.

  • U –> Dr. Prof.

A randomizer will result in M- Ms, or F-Mr (ignoring issues with transgender)

This situation and standard forms in data modeling have a painful co-existence. Traditionally an address will be decomposed into atomic components such as zip code, city, state, county, address line 1, address line 2, etc. Ideally, it would also have a column indicating if this was a postal address or the delivery service address. When it is time to mask  these columns a host of issue arise because these columns are correlated.

A typical issue is sales  tax calculations. Sales tax calculations uses one or more  addresses (destination, shipper, billing) to calculate the sales tax. If you masked one column then the address may be deemed invalid and the call for tax may fail. For example, my home zip code covers two counties and 3 towns/cites. There are a few cases where a community crosses an international border. 123 Main Street may be in the US, 223 Main Street may be in Canada.

My approach is to break out these correlated columns into molecules as separate adjunct tables using surrogate keys. This allows an easy shuffling of the keys, or the substitution of the rows with alternative valid records.

Avoidance of natural keys

Surrogate keys should be used for all referential integrity, foreign keys, primary key and  alternative primary keys. A natural key, such as a two character state (WA,CA, etc) may be tempting to use, but it then means that data masking may not be column-atomic. I have found some shops have been aggressive on this point by requiring all referential integrity to use GUID/Unique Identifiers. This slightly extreme approach has some advantages because often enumerations are saved as integers creating a quasi-natural key that creates unforeseen problems with data masking.

Other Data Modeling Criteria for Data Masking?

Readers may wish to suggest other criteria, there are several more that I know of, but they occur rarely so I will not burden the reader with academic issues.