Tech topics

What is File Analysis?

Illustration of IT items with focus on a question mark


File analysis helps organizations address their increasing data volumes by mapping the locations of their data and identifying who has access to what data (including file shares, email databases, enterprise file sync and share, records management, enterprise content management, Microsoft SharePoint, and data archives).

File analysis solutions analyze, index, search, track, and report on file metadata and content. This enables organizations to view and organize detailed metadata and contextual information, improve PII oversight and information governance, and manage unstructured data more efficiently.

File analysis solutions also protect and secure unstructured data. Organizations can make better decisions about content analysis, while mitigating risk and reducing costs associated with data. These solutions help to ensure data security, lifecycle management, data access governance, mapping, and classification while enabling key data insights and analysis that drive and protect the business. These key capabilities help organizations address digital transformation use cases for risk mitigation, governance and compliance, efficiency and optimization, and data insight.

File Analysis

Why file analysis?

Organizations are under increasing pressure to transform their business. Whether that journey starts with accelerating efforts to move to the cloud, support remote workers, or prepare for data privacy, file analysis solutions can help to optimize data and applications and intelligently identify, secure, and classify data. File analysis solutions can also provide insight across data to ensure compliance and enable smarter data migrations.

File analysis solutions can scale to meet the need of today’s modern workloads and identify areas where data can be optimized and defensibly deleted – driving down costs, improving efficiency, and ensuring compliance. Projects that deploy file analysis require velocity to keep up with an ever-changing business environment. Speed, scale, and rapid time to value are essential for maximizing value from these solutions.

File analysis solutions provide access to the most common sources of unstructured data (on-premises or in the cloud) to assess risk, identify sensitive and high-value data, and provide actions that protect, secure, and govern the data over its lifecycle.

How does file analysis help with data efficiency and optimization?

Data efficiency and optimization across unstructured data starts with understanding what data you have and where it is stored. Through data mapping, you can use file analysis to identify where all your data is located and identify “dark data” that is misplaced, orphaned, duplicated, obsolete, or trivial. Projects that leverage file analysis provide faster return on investment by actively deleting or optimizing data that serves no value to the organization.

How does file analysis help with risk mitigation?

File analysis solutions assist with data risk mitigation by optimizing, protecting, and securing data found during the content analysis stage. This includes:

  • Detecting, managing, and processing PII, PCI, PHI, and IP.
  • Managing the flow of information.
  • Handling sensitive data.
  • Providing identity protection, metadata reporting, identity access rights, data-centric access protection, policy controls, and audit trails.

Once data optimization is complete, any data that has no business value is no longer consuming data storage space. Only data that that is highly valuable and actively used by the business remains.

Understanding access and permissions is essential. File analysis solutions that provide remediation tools help ensure that proper controls are applied to data while it’s in active use. Some solutions include additional protections, such as the ability to encrypt data at the endpoint to ensure proper use. And finally, properly deployed file analysis solutions can prevent users from moving or deleting data without understanding its business purpose. File analysis solutions that deploy a “manage-in-place” model minimize the risk of disruption to business users.

How does file analysis help with governance and compliance?

Deploying a file analysis solution can help ensure that the right data is available to the right user at the right time. It helps organizations meet their regulatory, legal, and internal governance and compliance objectives by:

  • Providing metadata governance, legal holds, quarantine, and discovery.
  • Optimizing data volumes.
  • Governing appropriate permissions.
  • Granting role-based access.
  • Identifying high-value assets.
  • Applying data lifecycle policies.

How does file analysis help with PII data and data privacy compliance?

Organizations are in a race to find, protect, and secure personal data (including consumer, citizen, and employee data). This global trend – which includes GDPR (EU), CCPA (California), KVKK (Turkey), PIPEDA (Canada), and POPIA (South Africa) – has brought new attention to file analysis solutions. By leveraging content analysis capabilities and detection techniques, file analysis solutions are ideal for ensuring compliance and assisting in responding to consumer requests or data subject access requests.

Data privacy preparedness is an example of where file analysis solutions shine. It also emphasizes the need for a process in which PII files can be easily identified, indexed, and retrieved.

The end-to-end process should look something like this:

  1. Find repositories and identify files.
  2. Extract all the metadata and content from the file.
  3. Analyze the file content and metadata for specific entities or classify the file based on conceptual content.
  4. Secure the data by applying business rules based on the results of the analysis to ensure appropriate access levels and sensitive data handling (i.e., encryption). You can also apply a category or classification to help manage the lifecycle of the assets.

What is classification or categorization for files?

File analysis solutions use simple classification methods based on metadata tags, keywords, or terms lists. Some solutions leverage conceptual classification of the file content and combine these methods with found documents, images, or data entities to improve the accuracy of the categorization. Other solutions take it a step further with machine learning and guided learning using sample documents, which enable you to define the classifications to be used.

For example, a Human Resources document with health or insurance information can use a data classification policy based on sample data. For other elements, such as age and location, you can apply a risk score and additional permissions to further define the policy.

How does file analysis provide data governance and data preservation?

File analysis solutions provide capabilities to help organizations automatically take action on data, as well as a rich toolset to help govern and preserve data. The solutions typically include the following options, driven by corporate data governance:

  • Delete the data. If there is no need to keep the file, remove it. Is it too old? Is it a duplicate? Does it provide any value to the business? Has the consumer requested that his or her data be destroyed? File analysis solutions maintain an audit trail of both what you did and why you did it.
  • Secure the data. If you need to keep the data, then secure it. Some file analysis solutions can change access controls or encrypt the data. Another option is to move it to a secure location, such as a records management tool, for long-term preservation.
  • Redact the data. You might need to keep some of the data, but not the PII. Some file analysis solutions support redaction to create a clean copy of the original file without the PII content. The original file is then deleted or secured as described above.

What is “manage-in-place”?

Manage-in-place is a key concept of data lifecycle management and governance. It is “how” the metadata (including location, permissions, and content) is analyzed by the file analysis solution where it resides. The actual object is not moved, copied, or stored in another location or preservation area during the analysis.

Understanding grammars for entities

Two basic types of data discovery grammars (rule sets) are used to describe the entities you are trying to identify: curated and user-generated.

The grammars include:

  • PII: Personally identifiable information, which can differ from region to region (including format, which can cause false positives).
  • PHI: Personal health information, typically associated with the North American health industry.
  • PCI: Personal credit card information.
  • PSI: Personal security information, for account details access keys.

Look for curated and optimized grammars, which can’t be modified by the user. These grammars use context and landmarks for more accurate results and provide a “confidence score” to help you filter out false positives. Context and landmarks can be phrases, single words, or individual characters.

Context is key. File analysis solutions that use proximity to the entity candidate and the strength of the context (based on natural-language processing techniques) contribute to confidence scores. You can obtain more granular scores by leveraging comprehensive lists of specific entities, countries, or regions.

Tuning and flexibility. If none of these grammars covers your specific use case, you can use a file analysis solution that allows for creating custom grammars. These grammars are typically defined by using format-descriptive RegX or simple lists.

What are false positives?

By definition, a “false positive” is a test result that incorrectly indicates the presence of a particular condition or attribute1. In the case of file analysis solutions, a false positive indicates a pattern, grammar, or keyword match that is incorrectly identified during content analysis. File analysis solutions that simply use pattern or keyword matching typically have higher false positive rates than those with contextually aware content analysis capabilities.

Scanned documents and audio recordings

File analysis solutions can analyze text-based documents for risk, but PII can also reside in other forms of data. Performing file analysis on scanned documents, recorded conversations, and video conference recordings is becoming increasingly common. Some file analysis solutions can process these files prior to applying PII discovery techniques.

Scanned paper documents stored as images (inside a PDF file, for example) should be processed with optical character recognition (OCR) to extract the text and, ideally, the associated structural information. Many organizations keep scanned ID documents on record, such as employees’ driver's licenses or passports.

File analysis solutions that support analyzing audio or video recordings require processing by a speech-to-text engine that can create a written transcript for analysis.

Benefits of contextual, AI-driven content analysis:

  • Increases accuracy and detection of sensitive and high-value data.
  • Reduces false positives.
  • Increases efficiency via AI-trained categorization and reduces the manual intervention required to classify data.

Benefits of “manage-in-place” models:

  • Data is easy to find and is where end users expect it to be.
  • Reduces the threat of data loss, productivity loss, and end-user disruption.
  • Increases cost savings and speed by eliminating the need to transfer data across the network or to the cloud in order to analyze it.

The difference between on-premises and SaaS solutions for file analysis

What is a file analysis SaaS solution?

File analysis can be offered via software as a service (SaaS), where the customer consumes services provided by an application security provider for a monthly or annual fee. This approach doesn’t require hardware procurement or traditional perpetual licensing. It relies partially or completely on the SaaS vendor (or a managed service provider in some cases) to provide access to the application in order to conduct content analysis, search, governance actions, and analytics. SaaS provides an easy way to get started with content analysis and offers high scalability, speed, and fast time to value. Depending on the location of the SaaS hosting environment, data residency and data sovereignty concerns might need to be weighed against the commercial benefits of SaaS

What is a file analysis on-premises solution?

File analysis solutions can also be run on-premises and operated and maintained by in-house teams. This approach requires organizations to provide the infrastructure and personnel and acquire and manage application security solutions. On-premises assures organizations that their application data is not shared with third parties and does not leave the premises. Typically, on-premises solutions are sold through a perpetual license. More recently, subscription licensing has been used to provide more flexibility in how the software is consumed and billed.

OpenText provides file analysis tools

Voltage File Analysis Suite by OpenText™ SaaS file analysis solution enables organizations to quickly and efficiently reduce information risk; ensure data privacy; and analyze, optimize, and secure employee access to critical data that drive and protect the business. Our solution ensures data lifecycle management and data access governance while mitigating the risk associated with managing sensitive data. File Analysis also provides identity and access governance, complete data visibility, reduction in storage costs, actionable analytics that improve efficiency, and data quality. In addition, it supports data privacy compliance while addressing governance for high-value assets (e.g., contracts, intellectual property, patents, etc.), and sensitive data (e.g., PI/ PII, PCI, PHI, etc.).

OpenText™ File Reporter inventories network file systems and delivers the detailed file storage intelligence you need to optimize and secure your network for efficiency and compliance. It enables you to identify access risks when you discover and analyze files and associated permissions for data stored across your enterprise. Engineered for enterprise file system reporting, File Reporter gathers data across the millions of files and folders scattered among the various network storage devices that make up your network. Flexible reporting, filtering, and querying options then present the exact findings you need in order to demonstrate compliance or take corrective action.

OpenText™ File Dynamics provides extensive services to address the expanding requirements of network data management. Identity-driven policies automate tasks that are traditionally done manually, resulting in cost savings and the assurance that tasks are being performed properly. Target-driven policies provide protection from unauthorized access, as well as data migration and clean up. File Dynamics also protects against data corruption and downtime through near-line storage backup of high-value targets, enabling quick recovery of files and their associated permissions. File Dynamics delivers the role-based access restrictions, remediation, risk mitigation, and proactive management needed for compliance with data management regulations.

OpenText™ ControlPoint is a file analysis solution that leverages IDOL artificial intelligence for unstructured data analytics. It enables organizations to identify and automatically classify sensitive data (e.g., PII, PCI, PHI); clean up legacy data; and uncover risks hidden in dark data sitting unmanaged in email repositories, file shares, SharePoint sites, and cloud repositories (such as Office365, Google Drive, and Dropbox). ControlPoint also enables organizations to save on storage costs by reducing redundant, obsolete, and trivial data. This provides better access to valuable information and enforces data preservation through applying policies that assist in data lifecycle management, regulatory compliance, and data security.

File Analysis

Get started today.

Request a demo