Sunday, November 17, 2013

Database Design: Data Masking must be a criteria

I started reading the works of C.J.Date on RDBMS back in the late 1970’s. While I have kept up with his later writings, I have found that Scott Ambler books are very good practical books, namely:


  • Agile Database Techniques, Wiley, 2003

  • Refactoring Databases, Evolutionary Database Design, Addison Wesley, 2006.

These books have been invaluable in designing and implementing effective databases but what is missing from all of them are discussion of data masking.

I tend to be of the old rigorous school for green-field database development: create a fully normalized logical database model, ideally pushing the model to a full 6th normal form. After this is done, denormalized to a physical database model that addresses performance and ease of use criteria of the customer. I believe an additional criteria needs to be included in the logical-to-physical data model activity, inclusion of data masking criteria.

Two important criteria to consider are:


  • Strive to have masking possible at a column-atomic level, if there is column-correlation then use adjunct tables for the columns that are correlated in a table.

  • Complete avoidance of natural keys.


Column Level Masking and Correlation


If there are no dependencies on other columns, you are probably in normal form and have no hard-dependencies. This is also a soft dependency which I term column-correlation which I define as:


  • The value in column A results in a subset of values being valid for another column B.

In most cases, the columns will be category columns. For example: Gender and Method of Address are correlated:


  • F –> Ms, Mrs, Dr. Prof.

  • M –> Mr. Dr. Prof.

  • U –> Dr. Prof.

A randomizer will result in M- Ms, or F-Mr (ignoring issues with transgender)

This situation and standard forms in data modeling have a painful co-existence. Traditionally an address will be decomposed into atomic components such as zip code, city, state, county, address line 1, address line 2, etc. Ideally, it would also have a column indicating if this was a postal address or the delivery service address. When it is time to mask  these columns a host of issue arise because these columns are correlated.

A typical issue is sales  tax calculations. Sales tax calculations uses one or more  addresses (destination, shipper, billing) to calculate the sales tax. If you masked one column then the address may be deemed invalid and the call for tax may fail. For example, my home zip code covers two counties and 3 towns/cites. There are a few cases where a community crosses an international border. 123 Main Street may be in the US, 223 Main Street may be in Canada.

My approach is to break out these correlated columns into molecules as separate adjunct tables using surrogate keys. This allows an easy shuffling of the keys, or the substitution of the rows with alternative valid records.


Avoidance of natural keys


Surrogate keys should be used for all referential integrity, foreign keys, primary key and  alternative primary keys. A natural key, such as a two character state (WA,CA, etc) may be tempting to use, but it then means that data masking may not be column-atomic. I have found some shops have been aggressive on this point by requiring all referential integrity to use GUID/Unique Identifiers. This slightly extreme approach has some advantages because often enumerations are saved as integers creating a quasi-natural key that creates unforeseen problems with data masking.


Other Data Modeling Criteria for Data Masking?


Readers may wish to suggest other criteria, there are several more that I know of, but they occur rarely so I will not burden the reader with academic issues.

Tuesday, October 29, 2013

Data Masking: the problem and attitude to the problem

This is the first of a series of posts dealing with data-masking. There are many software providers in this specialized industry. Some of the providers are also database providers attempting to capture this additional add-on market. There is a traditional tendency for these big companies creating add- ons to produce only a basic version lacking key features. These key features will often be provided by a company whose sole business is data-masking. These companies always strive to differentiate themselves by providing more at less cost.

The primary motivators for data-masking are government laws and regulations. A few examples are:

The core reasons for this protection are typically:

  • Protecting an individual's privacy
  • Protecting an organization's privacy. Corporations and most organizations are deemed "persons" in most of the western world.
  • Preventing information being available that may assist inside trading of stock (or equivalent)
  • Preventing unfairness in the marketplace: for example, exposing a firm's customers, what they ordered, and the price actually charged for goods.

For several years I was working for Patchlink, and Lumension Security. I was their representative to the Security Content Automation Protocol and other initiatives sponsored by the National Institute of Standards and Technology (NIST). NIST activities have yet to expand to data masking, but such action is expected in the next few years. NIST has produced only one related paper, "Guide to Protecting the Confidentiality of Personally Identifiable Information (PII)" SP-800-122 (Apr 2010), which is worth reading. If you are the data masking owner in your organization, this is not an optional read;; the contents would have considerable legal weight as "normal or expected practices" because the source is NIST.

A Higher Level of Protecting Information

Often management in corporate America takes a minimalist approach "if it follows the general advise of my data-masking provider, it is good enough" which translate to, "if things goes bad, I want to be absolved of responsibility and have some other party guilty of not doing their job". With the duration of time in most IT jobs being short before moving on to the next position in a different company, this approach is a safe bet for the manager (but may not be a good bet for the company). I am of the temperament of being very pro-active and wish to prevent data exposure ever happening;; be it on my shift, or after my shift.

Looking at best practices for Data Masking is one of the goals of this blog.