How to Protect, Secure, and Use Unstructured Data

April 18, 2024

Structured data and relational databases have long been the backbone of analytics and application development. Yet, unstructured data, which makes up approximately 80 to 90% of all data, has remained largely untapped due to lack of proper tooling. With the introduction of data lakes and lakehouses in the past decade, and more recently Large Language Models (LLMs), organizations have begun unlocking the potential of this data.

A key challenge with unstructured data is it often contains sensitive information, such as customer personally identifiable information (PII), causing concerns about privacy and data protection. Companies struggle to leverage this data for analytics and AI while safeguarding privacy.

This blog post explores the untapped value of unstructured data, the hurdles in managing it, and introduces Skyflow's solutions for secure file storage, secure operations on files, secure file sharing, and sensitive data detection. With Skyflow, businesses can support analytics and generative AI workflows, ensuring sensitive information is isolated, protected, and governed correctly.

Structured vs Unstructured Data

While structured data is neatly organized and easier to handle, unstructured data's diverse formats introduce complexity in management and analysis.

The Modern Data Stack for Structured and Unstructured Data

Structured data shines in scenarios where accuracy and easy access to records are important, such as managing financial information, tracking inventory, and organizing customer databases. Its precise, tabular format allows for straightforward queries and analyses.

On the other hand, unstructured data is invaluable for applications that demand insights from more complex information, including text, images, and multimedia. It is essential for sentiment analysis and trend forecasting, where the depth and nuance of data provide a richer, more detailed understanding of content and context.

Unstructured data typically requires significant pre-processing to map to a structured schema to facilitate search and analysis. 

Data Lake Pipeline: Mapping Unstructured Data Into a Schema

Unstructured data also has unique privacy and security challenges, which is covered in the next section.

Privacy and Security Challenges with Unstructured Data

Unstructured data, inherently challenging to classify, significantly complicates efforts in maintaining security and privacy. Anticipated to reach a staggering 175 billion terabytes by 2025, the vast amount of such data presents an overwhelming challenge. The management complexity is a contributing factor to why an estimated 40% of companies are unaware of the whereabouts of their data. Without clear insights into what is being stored and its location, ensuring data security, privacy, and compliance becomes an increasingly intractable problem.

Below are challenges that businesses are struggling with when it comes to the safe handling of unstructured data.

Compliance Risks

The challenge of adhering to regulations like GDPR, CCPA, HIPAA, and others is amplified with unstructured data. These compliance frameworks mandate strict control and knowledge of where personal data is stored and how it is used. The scattered nature of unstructured data complicates compliance efforts, as organizations struggle to pinpoint exactly where sensitive data resides. This makes complying with something like a data subject request or right to be forgotten request nearly impossible. Or complying with data residency regulations by storing files containing regulated data within the correct region.

Secure File Collection and Storage

Many applications collect sensitive documents and other files from the application’s front-end where the file may contain sensitive and regulated data. For example, this could be financial statements, healthcare information, images of a passport, and other files of your customer’s that you wouldn’t want falling into the wrong hands.

While the files could be stored encrypted within your lake, you first need to collect the data and pass it downstream, likely through multiple services before it arrives at your lake. The more touchpoints, the larger the surface area for a potential data leak.

Companies are left to figure out how to securely collect files containing sensitive data from their front-end and not risk exposure at any point as the file passes through their system. Each touch point is not only a risk, but it also increases your compliance scope. If the document contains financial information, you may need to make sure each part of your data flow is PCI DSS compliant.

Operations on Files with Sensitive Data

Once files are collected, there are many business use cases that involve processing or operations on sensitive data within the files. For example, if you’re collecting pictures of passports or driver’s licenses, you may need to extract the birth date to verify the age of the user. Or you’re working with external consultants who need to review financial statements to analyze overall financial performance but they shouldn't have access to individual employee salaries so that information needs to be automatically redacted prior to being shared with the contractor.

To perform operations like this means you need to decrypt the files. You need to think carefully about where these operations take place, how secure is the environment, what gets logged, and who has access. Human error is a significant factor in data breaches, with as many as 88% of data breaches being due to an employee mistake. With so many factors involved with managing unstructured data, adhering to regulations and making sure everything is secure, it’s a lot to juggle and mistakes happen.

Secure File Sharing

Files collected from customers frequently need to be shared with third-party services or other users under specific conditions. 

For instance, government-issued IDs might be gathered, stored, and then shared with a Know Your Customer (KYC) verification provider to comply with regulatory requirements. Similarly, contracts requiring signatures could be shared with relevant parties, ensuring that access is restricted solely to the intended recipient and only for a necessary duration. Additionally, when handling customer transaction statements from Visa, it's crucial to ingest and process these documents while ensuring that the entire system remains compliant with PCI DSS. 

Sensitive Data Detection

Detecting sensitive data in unstructured data is essential for privacy protection and regulatory compliance, but it's a complex task. Part of the challenge is due to the variety of file types, from emails to images, each requiring different detection methods. Additionally, what counts as sensitive can vary, making it hard to identify consistently across all files.

Regulations often demand that certain data within a file be redacted, complicating detection efforts. Safeguarding sensitive information isn't just about limiting file access but also about controlling who can see specific data within those files.

In essence, the task involves not only finding sensitive data among diverse file formats but also understanding its context, ensuring compliance, and managing access at a fine-grained level. Organizations need sophisticated tools that can handle these complexities to protect sensitive data effectively.

How Skyflow Helps

Skyflow is a data privacy vault, which isolates, protects, and governs sensitive data (both structured and unstructured) while facilitating region-specific compliance through data localization. When used properly, all sensitive data is transformed by the vault into non-sensitive de-identified data, taking the existing application infrastructure out of scope for data security and compliance. The de-identified data, in the form of vault generated-tokens, is passed along and stored within traditional application systems like the database and various logging systems, behaving as a reference or pointer to the original data.

For example, in the image below, unstructured data, like documents, audio files, and log data is first processed by the data privacy vault. The vault detects sensitive data, de-identifies it, replacing the sensitive information with vault-generated tokens. The tokens act like a pointer or reference for the original data but carry no exploitable information. Because of Skyflow’s polymorphic encryption and tokenization, the vault-generated tokens can be created in a variety of ways to support various workflows. This includes consistently generated tokens, random tokens, and format-preserving tokens.

Detecting and De-identifying Sensitive Data in an Unstructured Data Pipeline

In the following section, we discuss how Skyflow helps with secure file storage, performing secure operations on files, secure file sharing, and detection of sensitive data.

Privacy-safe Unstructured Data Management with Skyflow

Secure File Storage

In light of the privacy challenges associated with sensitive data residing in files, let's explore how Skyflow can address these concerns effectively. Skyflow's Data Privacy vault offers a comprehensive solution by not only securely storing discrete sensitive data such as Social Security Numbers and credit card numbers but also providing the capability to store entire sensitive files directly within the vault.

Once securely stored, Skyflow provides robust governance and access controls, ensuring that the same level of protection is applied to files as with other types of data. Files can be uploaded into the vault through our intuitive Management Console (Studio), via APIs, or using Skyflow SDKs, which are available for a variety of programming languages. Skyflow supports uploading files of any type, with a maximum file size limit of 32MB.

Recognizing that files can serve as potential attack vectors, Skyflow goes a step further by offering built-in support for running antivirus scans on uploaded files during the upload process. If a file is detected to be infected, Skyflow automatically quarantines it, preventing any further access or download.

We’ve seen Skyflow customers utilize this functionality in several different ways. The often repeated pattern is where our customer collects sensitive files like drivers license, W9 forms etc as part of their onboarding process via our frontend form elements called Skyflow Elements and stores it in the vault. Once in the vault, Skyflow can also render these files in front-end applications of the customer so that it can be viewed/edited or replaced.

Secure Functions on Files

In addition to storing sensitive files securely, there's often a need to extract information from these files or perform computations on them. However, handling such files directly on your infrastructure can expose you to compliance risks, especially when dealing with PII or payment card information (PCI).

With Skyflow Secure Functions, you can develop custom logic to process files and execute it on Skyflow's secure compute platform, effectively removing yourself from compliance scope.

One common use case we've observed involves extracting information from identity cards such as driver's licenses or passports to determine details like age and identification numbers. For instance, a function can utilize AWS Textract to securely extract the date of birth and age from a US driver's license. You can explore an example of such a function in our Functions Catalog: Age Verification.

Another compelling use case for Secure Functions is custom redaction of sensitive data within documents stored in the vault. Using optical character recognition (OCR), Skyflow customers have deployed functions to redact Aadhar card numbers, names, and photos, ensuring compliance and privacy protection.

By leveraging Skyflow's Secure Functions, you are able to directly interact with sensitive data stored in files and at the same time mitigate risks associated with handling sensitive files directly on your infrastructure - thus avoiding compliance scope.

Secure File Sharing

Another use case where customers leverage Skyflow Secure Functions is in the creation and sharing of files through third-party services like DocuSign or Dropbox. Oftentimes, businesses encounter scenarios where they need to securely disseminate sensitive files without the sensitive data/files ever traversing their infrastructure.

Consider a real-world example from a Skyflow fintech customer: during the customer onboarding process, there's a necessity to store sensitive financial information in a file. Utilizing Skyflow Functions, the customer generates a PDF document integrated with DocuSign eSignature functionality. This document is then dispatched to the end customers for signatures. Once digitally signed, the document is securely stored in the Skyflow vault as a system of record, guaranteeing data privacy and compliance with regulatory standards. Here is a sanitized version of this Secure function in our catalog.

End users can then conveniently access and download the signed contract as needed, with the assurance that the customer's backend operations remain entirely out of the compliance scope. This seamless integration of secure file creation and distribution via third-party services not only enhances operational efficiency but also lets the customer focus on their actual business needs instead of worrying about compliance regulations.

Detecting Sensitive Data

As companies embrace the power of generative AI and LLMs to drive productivity, the need to safeguard data privacy becomes paramount. These technologies rely heavily on ingesting vast amounts of training data, often sourced from unstructured sources such as blogs, social media posts, audio recordings, text files, and logs — each potentially brimming with sensitive information. Before feeding this data into models, it's crucial to ensure it's de-identified to protect privacy.

Skyflow uses advanced AI/ML capabilities designed to process various types of data, including text files (e.g., PDFs, Word documents), audio files (e.g., WAV, MP3), text blobs, image files (e.g., JPEG, PNG, medical images), and videos. Skyflow accurately identifies common and custom PII elements like names, addresses, social security numbers, and credit card numbers, as well as visually sensitive information such as faces, ensuring none of this sensitive data leaks into the training data.

For instance, a prominent AI-based healthcare provider relies on Skyflow Secure Pipeline to detect sensitive medical information within unstructured data like audio files and text files before undergoing training, preventing any inadvertent leakage of patient data into the model. Having a HIPAA compliant secure pipeline that can detect PII and other sensitive data and replace it without compromising the quality and the context of the data was crucial for this customer and they were able to achieve this with Skyflow.

Additionally, Skyflow offers certification services through third-party validation to ensure the data set is free from sensitive information — an added layer of assurance for compliance.

In addition to the vast training data needed for LLMs, enterprises often grapple with managing unstructured data housed in data warehouses like Databricks, Snowflake, and Redshift. Here, business analysts play a crucial role, utilizing tools to access this data and run queries, yielding critical insights for the business.

Skyflow can help here by seamlessly integrating with these data warehouses using external functions or User Defined Functions. With this integration, sensitive data is automatically tokenized before any analytics are executed, mitigating the risk of exposure and ensuring compliance with data privacy regulations.

In instances where re-identification of sensitive data is necessary to provide context for analysis, Skyflow's Governance Engine can augment these User Defined Functions by restricting access to sensitive data to a select group of privileged users, effectively minimizing compliance scope while maintaining the integrity of the analysis.

In essence, Skyflow provides a comprehensive solution for enterprises, safeguarding sensitive data throughout the analytics process within data warehouses, thereby enabling business insights while upholding the highest privacy standards.

Wrap Up

Handling sensitive information within unstructured data, such as PII, presents significant privacy and security challenges. Organizations must navigate these carefully to avoid breaches and comply with data protection regulations.

Skyflow provides solutions for these challenges through secure file storage, operations on files, file sharing, and sensitive data detection. This ensures that sensitive information is properly managed and protected across various applications and workflows.

Skyflow not only secures sensitive data but also allows for its safe usage and sharing through secure functions and integrations, enabling businesses to maintain compliance while leveraging the data for business insights and operations.

By adopting Skyflow, businesses can unlock the value of unstructured data while ensuring that sensitive information remains isolated and secure, paving the way for innovation without compromising on privacy.

Keep Reading

December 12, 2024

Unlocking Privacy-Preserving AI with Skyflow’s Secure AI Functionality

Discover how Skyflow’s Secure AI Functionality empowers businesses to build privacy-preserving AI applications with enhanced usability, advanced privacy controls, and seamless data management—unlocking innovation while safeguarding sensitive information.
November 12, 2024

Navigating China’s PIPL Requirements: How to Unlock China Go-to-Market

In this post, we show how companies can address China's PIPL regulation by leveraging AWS infrastructure in China in combination with Skyflow Data Privacy Vault.
Data Privacy Vault
Data Residency
October 28, 2024

India SEBI's New Cybersecurity and Cyber Resilience Framework: Data Protection Strategies for Regulated Entities

Learn about SEBI’s new Cybersecurity and Cyber Resilience Framework (CSCRF) for regulated entities in India. Discover key data protection strategies for compliance and enhanced security.