The Data Masker
Monday, March 24, 2014
The Price for Non-Compliance
Saturday, March 1, 2014
Masking? Encryption? Confusion of Sorts.
You have probably heard these phrases before. What exactly do they tell us?
Data is sensitive. During data lifetime, it goes through many “hands” and gets seen by many
“eyes”. When we want to protect sensitive data from exposure, we need to understand who we
protect against and what the points of exposure are.
We start with the following use case: we inserted data into a system and our data is saved on the
disk. Who do we not want to see our data and why? First, the malicious outsiders.
The very first
data we usually protect are the logins and passwords, because these pieces provide the “keys to
the locks”. We secure them with encryption. If we want to be more cautious, we encrypt all the
sensitive data on the disk – to protect against the case of theft or loss of the storage device itself.
This is especially relevant in light of the BYOD – bring your own device to work – trend. Even
the most cautious among us often leave laptops or tablets in cars unattended. The majority of us
are not intelligence agency operatives and are not trained to never leave a trace behind. The best
protection against a malicious outsider is ENCRYPTION.
However, almost 40% of data fraud happens not with outsiders, but with insiders. These cases
involve people accessing the data across the whole spectrum: from the CxO office with internal
trading cases to unscrupulous or naïve developers. Few developers, of course, are unscrupulous
or naïve. Yet, unfortunately, breaches do happen. All of us are well aware of the latest case of
an “insider threat”: the case of Mr. Snowden.
Regardless of his intentions, he demonstrated that
a developer, bound only contractually, has unrestricted access to data and, as such, can present
a threat. Encryption does not protect against insiders. A CxO sees the data naked because s/
he works with it in production. The developer sees encrypted data be it in production or non-
production. Now, guess, who has the keys to encryption when production data is recreated in
development as necessary in many scenarios? Yes, you guessed it right: the developer does!
The only protection against an unscrupulous CxO is legal recourse. However, there is both legal
recourse and technological protection against an unscrupulous or ‘naïve” developer - DATA
MASKING. Masked data retains the look and functionality of real data. It fits the field size,
passes unit tests and gives real numbers at performance testing - as it would in the production
environment.
With data masking, the only data that has real value is data in the production environment. The
environments outside of production have fake data that has significantly less value on the black
market. Data masking is NOT encryption but rather a "one-way street" to removing sensitive information.
The bottom line: fewer people have access to real data when we use data masking.
Developers do not have access to sensitive information, be it encrypted or not.
In the next blog posts, we will be talking about data at rest vs. data in transit, production data
masking scenarios, as well as how we decide on data sensitivity
Sunday, November 17, 2013
Database Design: Data Masking must be a criteria
- Agile Database Techniques, Wiley, 2003
- Refactoring Databases, Evolutionary Database Design, Addison Wesley, 2006.
These books have been invaluable in designing and implementing effective databases but what is missing from all of them are discussion of data masking.
I tend to be of the old rigorous school for green-field database development: create a fully normalized logical database model, ideally pushing the model to a full 6th normal form. After this is done, denormalized to a physical database model that addresses performance and ease of use criteria of the customer. I believe an additional criteria needs to be included in the logical-to-physical data model activity, inclusion of data masking criteria.
Two important criteria to consider are:
- Strive to have masking possible at a column-atomic level, if there is column-correlation then use adjunct tables for the columns that are correlated in a table.
- Complete avoidance of natural keys.
Column Level Masking and Correlation
If there are no dependencies on other columns, you are probably in normal form and have no hard-dependencies. This is also a soft dependency which I term column-correlation which I define as:
- The value in column A results in a subset of values being valid for another column B.
In most cases, the columns will be category columns. For example: Gender and Method of Address are correlated:
- F –> Ms, Mrs, Dr. Prof.
- M –> Mr. Dr. Prof.
- U –> Dr. Prof.
A randomizer will result in M- Ms, or F-Mr (ignoring issues with transgender)
This situation and standard forms in data modeling have a painful co-existence. Traditionally an address will be decomposed into atomic components such as zip code, city, state, county, address line 1, address line 2, etc. Ideally, it would also have a column indicating if this was a postal address or the delivery service address. When it is time to mask these columns a host of issue arise because these columns are correlated.
A typical issue is sales tax calculations. Sales tax calculations uses one or more addresses (destination, shipper, billing) to calculate the sales tax. If you masked one column then the address may be deemed invalid and the call for tax may fail. For example, my home zip code covers two counties and 3 towns/cites. There are a few cases where a community crosses an international border. 123 Main Street may be in the US, 223 Main Street may be in Canada.
My approach is to break out these correlated columns into molecules as separate adjunct tables using surrogate keys. This allows an easy shuffling of the keys, or the substitution of the rows with alternative valid records.
Avoidance of natural keys
Surrogate keys should be used for all referential integrity, foreign keys, primary key and alternative primary keys. A natural key, such as a two character state (WA,CA, etc) may be tempting to use, but it then means that data masking may not be column-atomic. I have found some shops have been aggressive on this point by requiring all referential integrity to use GUID/Unique Identifiers. This slightly extreme approach has some advantages because often enumerations are saved as integers creating a quasi-natural key that creates unforeseen problems with data masking.
Other Data Modeling Criteria for Data Masking?
Readers may wish to suggest other criteria, there are several more that I know of, but they occur rarely so I will not burden the reader with academic issues.
Tuesday, October 29, 2013
Data Masking: the problem and attitude to the problem
This is the first of a series of posts dealing with data-masking. There are many software providers in this specialized industry. Some of the providers are also database providers attempting to capture this additional add-on market. There is a traditional tendency for these big companies creating add- ons to produce only a basic version lacking key features. These key features will often be provided by a company whose sole business is data-masking. These companies always strive to differentiate themselves by providing more at less cost.
The primary motivators for data-masking are government laws and regulations. A few examples are:
- Sarbanes-Oxley,
- Payment Card Industry (PCI) Data Security Standard (DSS),
- Health Insurance Portability and Accountability Act (HIPAA),
- EU Data Protection Directive
- General Data Protection Regulation
The core reasons for this protection are typically:
- Protecting an individual's privacy
- Protecting an organization's privacy. Corporations and most organizations are deemed "persons" in most of the western world.
- Preventing information being available that may assist inside trading of stock (or equivalent)
- Preventing unfairness in the marketplace: for example, exposing a firm's customers, what they ordered, and the price actually charged for goods.
For several years I was working for Patchlink, and Lumension Security. I was their representative to the Security Content Automation Protocol and other initiatives sponsored by the National Institute of Standards and Technology (NIST). NIST activities have yet to expand to data masking, but such action is expected in the next few years. NIST has produced only one related paper, "Guide to Protecting the Confidentiality of Personally Identifiable Information (PII)" SP-800-122 (Apr 2010), which is worth reading. If you are the data masking owner in your organization, this is not an optional read;; the contents would have considerable legal weight as "normal or expected practices" because the source is NIST.
A Higher Level of Protecting Information
Often management in corporate America takes a minimalist approach "if it follows the general advise of my data-masking provider, it is good enough" which translate to, "if things goes bad, I want to be absolved of responsibility and have some other party guilty of not doing their job". With the duration of time in most IT jobs being short before moving on to the next position in a different company, this approach is a safe bet for the manager (but may not be a good bet for the company). I am of the temperament of being very pro-active and wish to prevent data exposure ever happening;; be it on my shift, or after my shift.
Looking at best practices for Data Masking is one of the goals of this blog.