Faking data for security sounds unrealistic. However, it is an important strategy across data platforms. Formally known as data masking, it adds functional value to the same set that appears differently. Remember, its market value is marching towards USD 1268 million by 2028, making it a differentiator while picking the best tool. Before I discuss those, let’s understand what it’s all about and the top techniques.
Data masking creates a ‘copy’ of a data set with a similar structure but may vary in value. It is an important technique to protect sensitive data by making it unidentifiable and immutable to unauthorized users.
Data masking is important for ensuring consistency and usability across multiple databases. The purpose behind creating a functional substitute is to use it for QA, user training and demonstrations without revealing the actual values.
Importance of data masking
Emulating data to protect sensitive information and yet achieve a few other things has the following benefits:
• It enables organizations to stay in compliance with GDPR by eliminating the risk of sensitive data exposure. Thus, it offers a competitive edge to others.
• It ensures end-to-end security and makes the data useless even if hackers access it.
• Eliminates the risk of exposure during data sharing with third-party applications.
Moreover, organizations engaging in outsourced partnerships are at continuous risk of exposing their data to a third party. With masking, they can proceed with confidence and without any concerns. Among many, the following are the most common types of data where masking is widely applicable:
• Protected health information
• Personally Identifiable Information
• Payment Card Information
• Intellectual property
Top Data Masking Techniques
While there are many on the list, I am narrowing it down to the most important ones:
As the name suggests, this technique uses an encryption algorithm to mask the data. Only an encryption key can be used to decrypt it. The data is secure as long as authorized users only hold the key. In any case, unauthorized access could expose the data.
Scrambling jumbles the characters, numbers and special characters into a new data value that hides the original content.
It’s a simple technique, doesn’t imply to all data types and is not the best of all techniques to mask sensitive data. For example, the employee ID that reads 12345 is masked into 23154; it may not be difficult for hacking algorithms to decipher the original number.
This technique applies a null value to the targeted data column so that actual data stays hidden from any unauthorized user. However, it reduces data integrity and makes QA harder.
It masks the data by substituting its original value with a new one; without impacting the original contents and details. The simplistic technique works well across several data types. For example, masking the business partner names with a lookup file. Disguising the original ‘look’ often helps in protecting it from breaches.
Shuffling is a substitution done differently. It shuffles the masking data column with others. For example, shuffling the business partner names across multiple records. The new data looks accurate and yet doesn’t reveal any personal information. The only way to breach this technique would be to hack the shuffling algorithm.
Based on the pre-defined masking policy, this technique alters the data field. This includes increasing or decreasing the values. A simple example would be decreasing the date of the birth field by 100 days. The drawback of this method is that because the same policy applies to all values in a field, the compromise of one value results in the compromise of all values.
Pseudonymisation is a relatively newer term and, thus technique introduced with the GDPR guidelines. It implies that the data can’t be used for personal identification and requires removing direct identifiers and multiple identifiers that, upon combining, could potentially disclose an identity. Encoding identifiers protects user privacy and preserves the credibility of the masked data.
Redaction implies using generic values to replace sensitive data that is not required for QA or development purposes. Here, the data has no attributes similar to the original set.
Averaging means hiding the individual values except for their aggregate or average. A very simple example would be hiding the salaries in an employee details table and only showcasing their average.
High-performance data platforms
The degree of security by masking data directly depends upon the performance of the data management platform. That is exactly why many data platforms, especially test data management solutions, pitch masking as an integral component.
For example, Oracle’s Data Masking Subsetting solution abbreviates cost by provisioning masked data for testing purposes. It helps in abbreviating IT costs.
Informatica’s dynamic data masking de-identifies the data sets and prevents unauthorized access to production environments such as order management, customer support, etc. It hides user-sensitive data such as name, age, accounts, role etc.
While we are at it, K2View’s data masking is surely the highlight of 2022. The popular data fabric and product platform capture data from fragmented points according to product schemas such as the business entity.
The fabric saves masked data for every business entity in an exclusive micro-database. With such an innovative approach, K2View executes dynamic masking for varied use cases such as test data management, legacy application modernization, pipelining and tokenization.
Other popular names include Delphix, DataProf, IBM Infosphere, CA, etc.
As discussed, data masking techniques are mostly simple yet highly effective in ensuring an end to end security for large data volumes. They enable the real data to be used for alternative functionalities such as testing, demos and training. While the scope of masking goes beyond, the above-mentioned techniques provide starting details. I recommend partnering with the right data product platform that provides integrated components including masking, pipelining, orchestrating, etc.