No items found.

AI Data Privacy: A New Product Essential

No items found.
October 26, 2023

In the ever-evolving tech landscape, where innovation meets the promise of generative AI, one crucial question looms: Can we truly harness the transformative power of AI while safeguarding sensitive data?

As the Chief Product Officer at Skyflow, where we’ve built a privacy platform that isolates, protects, and governs sensitive data, I’ve witnessed advancements in large language model (LLM) -based AI tools like Chat GPT with growing interest. Throughout my career, I have worked closely with developing AI technologies at companies such as Salesforce (as VP of Product Management), Topsy, and others to democratize data and AI. And, like many leaders at global companies, the impact of generative AI on data privacy and technical solutions to address this is always top of mind.

Of course, AI isn’t a new development – in fact, it’s been a part of the technological landscape since the 1950s and continues to play a significant role in our daily lives. In the early days, we mostly worried about issues like bias and fairness. However, with the emergence of generative AI, where training on vast datasets has become commonplace, a new concern has come to the forefront: data privacy for sensitive data like PII.

As product and business leaders, we’ve all been on a journey together since ChatGPT started making headlines in late 2022, moving through a progression of play, setup, and optimization

  • First, we all started by playing with generative AI to see what we could do with it.
  • More technical business leaders next shifted from play into a setup phase, which involves picking an LLM to work with and putting controls in place.  
  • Finally, we moved into optimization, which is where many companies are now. Optimization today usually involves fine-tuning the model, prompt engineering, updating workflows, and building retrieval indices.

As we move along this journey, it’s critical to protect sensitive data in the datasets used to train these models. The time to protect PII is during the setup phase which should involve configuring data governance to control who can access which model, the flows of sensitive data, and plan how to keep PII isolated – and out of your LLMs. I often joke that LLM-based AI tools are like a baby with the memory of an elephant. Just like a baby, it’s very trusting and continuously learning; and just like an elephant, it never forgets.

It’s dangerous to feed sensitive data to LLMs without putting controls in place for several reasons:

  • LLMs don’t have a delete button, which means you can’t delete a piece of sensitive data without retraining the model
  • Sensitive customer PII requires protection as part of compliance with laws like CCPA and GDPR, and similar considerations apply to sensitive data under standards like PCI or SOC2
  • Data privacy is critically important throughout the model lifecycle, including training and inference
  • Internal and business data or IP such as project code names also need protection to avoid disclosing product roadmaps, trade secrets, or sensitive company information, as Samsung discovered when it adopted ChatGPT

Another issue that’s being debated is: what’s more important for effective AI implementations, model size or dataset size? One crucial but often overlooked issue in this debate is the need for flexibility – the freedom to utilize whichever model fits your needs. Flexibility is one major advantage of using datasets devoid of sensitive data. Eliminating privacy concerns from model selection criteria lets you leverage LLMs without constraint. This is especially valuable with the proliferation of new LLMs tailored to niche use cases, which means keeping sensitive data out of LLMs isn’t just an issue of trust and compliance – it also impacts how effectively your business can harness generative AI.

So, how can you realize the promise of generative AI while protecting the privacy of sensitive data, democratizing access to data, and enabling the use of any model? 

By using a data privacy vault to protect sensitive data across all end-to-end workflows for any data, any model, and any CRM or SaaS app.

What is a Data Privacy Vault?

A data privacy vault isolates, protects, governs, and localizes sensitive data.

In simple terms, a data privacy vault acts as a secure repository for sensitive data while letting customers use the data, keeping it safe from unauthorized access, and easing compliance with data privacy regulations.

With a data privacy vault, sensitive data is:

  • Isolated from the point of ingestion or storage: The APIs or SDKs allow the collection of sensitive data and isolate it in a vault that's separated from your other systems, reducing your scope of compliance and the surface area for attacks. 
  • Protected through tokenization (or masking and redaction) and encryption: Protection goes beyond swapping sensitive data for tokens or a redacted field and encrypting it in a vault. Effective data protection supports secure workflows, custom logic, and the ability to run analysis on sensitive data without compromising data privacy.
  • Governed using fine-grained access controls: With fine-grained access controls, you can govern sensitive data access so that only the right people get access to the right data at the right time.
  • Localized using regional vault instances: With sensitive data isolated from a company's other systems, you can store it in regional vaults without the need to replicate wholesale infrastructure instances in each region, making data residency compliance cost-effective and scalable.

These capabilities are why over a hundred companies, including giants like Netflix and Apple and Skyflow customer Lenovo, are using data privacy vaults in production today.

Companies Need an AI Privacy Strategy

The need for architectural solutions to AI data privacy is not exclusive to large enterprises; smaller companies need it too. And, smaller companies can gain a competitive advantage by more rapidly implementing AI data privacy. Even companies that can’t afford to train their own LLMs can use data privacy vaults to de-identify sensitive data so they can use their datasets with public LLMs without having to fine-tune or train their own models.

However, AI data privacy can't be a band-aid solution. Companies need data privacy and security throughout the entire data lifecycle, including the LLM lifecycle – from data collection to model training and inference. 

A data privacy vault should sit between your data sources – training data, fine-tuning data, inference data – and your LLM, machine learning, and other infrastructure. Here's an overview:

  • A data privacy vault detects and protects sensitive data to prevent it from ending up in LLMs or machine learning tools, allowing you to use these tools without data privacy concerns. This detection can be based on identifying sensitive data types by their structure (as with SSNs), or using a company-defined sensitive data dictionary.
  • A vault stores sensitive data and provides pseudonymized stand-ins to LLMs. This means that LLMs only interact with tokens, keeping sensitive data secure. Alternatively, a vault lets you completely redact sensitive data (instead of tokenizing it) to keep it out of LLMs.
  • When LLMs provide output data, a data privacy vault can swap stand-ins for sensitive data elements, like a name or date of birth, safeguarding user data privacy without affecting the user experience.
  • With a vault, you can also create audit logs and delete sensitive data to ease compliance with data privacy laws and standards.

LLM Data Privacy for Healthcare

To illustrate the practical applications of a data privacy vault, let's consider the healthcare industry, where data privacy is regulated by laws like HIPAA in the US, and similar laws internationally.

Let's say you need to aggregate unstructured data, like doctor's notes in patient healthcare records, to train an LLM model for healthcare. To do this, you can detect and de-identify sensitive data with a vault and then feed de-identified data into an LLM for training purposes.

A HIPAA-compliant data privacy vault makes this easy because all identifying details are kept out of the LLM. After the model is trained, you can continue to use the data privacy vault to prevent sensitive data from entering the model during inference. Sensitive identifying information can be detected and stored in the vault, so only fully redacted placeholders make their way to the LLM.

Final Thoughts

We are still in the early days of generative AI. Your product team needs to find ways to harness this technology without risking damage to customer trust or noncompliance with data privacy regulations. You need to define the problems your team is hoping to solve with generative AI, instead of taking a reactive approach of adopting without proper safeguards AI because your competitors are.

The journey of integrating generative AI into your products and services should be guided by a commitment to data privacy, and the most effective data privacy solution available is a data privacy vault. As we navigate this evolving landscape, remember that trust and compliance are as essential as innovation and creativity, and you need all four to unlock the full potential of generative AI while safeguarding sensitive data.

Note: This post was originally published on Mind the Product.

Keep Reading

December 12, 2024

Unlocking Privacy-Preserving AI with Skyflow’s Secure AI Functionality

Discover how Skyflow’s Secure AI Functionality empowers businesses to build privacy-preserving AI applications with enhanced usability, advanced privacy controls, and seamless data management—unlocking innovation while safeguarding sensitive information.
August 20, 2024

LLM Data Privacy: How to Implement Effective Data De-identification

Using this sensitive data to train an LLM raises significant data privacy concerns and compliance risks. The solution then is to de-identify sensitive data before using it to train LLMs.
April 18, 2024

How to Protect, Secure, and Use Unstructured Data

Unstructured data, which makes up approximately 80 to 90% of all data, has remained largely untapped due to lack of proper tooling. With the introduction of data lakes and lakehouses in the past decade, and more recently LLMs, organizations have begun unlocking the potential of this data.