Securing Your Databases Is Good, Securing Your Data Is Better
[Earth had] a problem, which was this: most of the people living on it were unhappy for pretty much all of the time. Many solutions were suggested for this problem, but most of these were largely concerned with the movements of small green pieces of paper, which is odd because on the whole it wasn’t the small green pieces of paper that were unhappy. — Hitchhiker's Guide to the Galaxy
You’ve heard about data breaches and what they do to company (and employee) fortunes, so you’re working hard to secure your database — upgrade, firewall, encryption, auditing, etc. Oh yes, and remember to change the default password. What about access control? Do you have the right policies? It feels like I'm forgetting something!
As a security professional, I like more security. You can, and should, nay, must, do all of the above; but not just for your databases. You have to look at your backups, secure the servers that process this data, the data pipelines that move this data around, the logs your applications generate into which data can leak, and so on.
As a security professional, I like more job security. However, you don’t have to be a security professional to see that this is, at best, an indirect path to the most important thing that you were trying to secure — the data itself.
What is Tokenization?
When securing data, the first step is to encrypt it. If you’re following all the best practices — choosing the appropriate encryption algorithms, IV construction, chaining modes, key generation, key management, etc. — you’ve essentially replaced your sensitive data with ciphertext. Ciphertext is a bunch of bytes which are meaningless to anyone without access to decryption keys, and which leak very little information. This is a great start!
But, there is no lunch without small green pieces of paper. Apart from hiring a cryptography nerd, you now have folks complaining you broke their system — customer service was using the last four digits of social security numbers (SSNs) to validate users and now they don’t have that; the analytics stack was built on software that assumes the email field (which analytics may never read) will always look like an email and now they need to fix their stack. You can solve this by giving everyone access to the decryption keys, but if you do that, you’re not much better off than when you started.
The main point of the encryption-based approach — replacing the sensitive data with “something else” — is exactly right. What you need is to replace the sensitive data with a “something else” that solves these new problems, as well.
What you need is tokenization. With tokenization, just like with encryption, you rip out your sensitive data and replace it with a placeholder — a token. Unlike encryption, you don't have to worry about an “adaptive chosen ciphertext attack” (or some cryptographic attack not yet discovered), because a good token generation scheme will neutralize these attacks.
Tokenization is not encryption, which means that de-tokenization is not decryption! So, you don't have to hand over access to your keys to anyone who wants to work on the tokenized data because of:
- Format preservation: Your tokens can be “format preserving” — they can look like email addresses, or social security numbers, etc. so that old code that was used to seeing email addresses doesn’t crash
- Partial detokenization: You don’t have to give customer service access to the entire SSN just so that they can compare the last four. You can, given the right tokenization solution, make sure customer service can detokenize ONLY the last four digits. This allows you to follow the security design principle of “least privilege” — that every process or user should be able to access only the information needed to do their job.
In some cases, you can choose to have tokens that allow some computations (e.g., find common users across different datasets) that might otherwise have required access to sensitive data. These tokens expose just enough information to be useful (e.g. whether or not two records refer to the same user). The key point is that you get to control the exact tradeoff between the security and usability of your data using simple token configuration; not complex cryptography. To learn more about how tokenization works at Skyflow, take a look at our documentation.
This is really just scratching the surface. We haven’t talked about your other problems around compliance, data-residency, governance, etc. We also have barely discussed the various kinds of tokens and their information-hiding properties.
If you’d like to dig deeper into the pros and cons of encryption vs. tokenization, I encourage you to check out our data vault white paper. If you want to know more about how to use a data vault to solve these problems, please contact us.