File analysis helps organizations address their increasing data volumes by mapping the locations of their data and identifying who has access to what data (including file shares, email databases, enterprise file sync and share, records management, enterprise content management, Microsoft SharePoint, and data archives).
File analysis solutions analyze, index, search, track, and report on file metadata and content. This enables organizations to view and organize detailed metadata and contextual information, improve PII oversight and information governance, and manage unstructured data more efficiently.
File analysis solutions also protect and secure unstructured data. Organizations can make better decisions about content analysis, while mitigating risk and reducing costs associated with data. These solutions help to ensure data security, lifecycle management, data access governance, mapping, and classification while enabling key data insights and analysis that drive and protect the business. These key capabilities help organizations address digital transformation use cases for risk mitigation, governance and compliance, efficiency and optimization, and data insight.
Organizations are under increasing pressure to transform their business. Whether that journey starts with accelerating efforts to move to the cloud, support remote workers, or prepare for data privacy, file analysis solutions can help to optimize data and applications and intelligently identify, secure, and classify data. File analysis solutions can also provide insight across data to ensure compliance and enable smarter data migrations.
File analysis solutions can scale to meet the need of today’s modern workloads and identify areas where data can be optimized and defensibly deleted – driving down costs, improving efficiency, and ensuring compliance. Projects that deploy file analysis require velocity to keep up with an ever-changing business environment. Speed, scale, and rapid time to value are essential for maximizing value from these solutions.
File analysis solutions provide access to the most common sources of unstructured data (on-premises or in the cloud) to assess risk, identify sensitive and high-value data, and provide actions that protect, secure, and govern the data over its lifecycle.
Data efficiency and optimization across unstructured data starts with understanding what data you have and where it is stored. Through data mapping, you can use file analysis to identify where all your data is located and identify “dark data” that is misplaced, orphaned, duplicated, obsolete, or trivial. Projects that leverage file analysis provide faster return on investment by actively deleting or optimizing data that serves no value to the organization.
File analysis solutions assist with data risk mitigation by optimizing, protecting, and securing data found during the content analysis stage. This includes:
Once data optimization is complete, any data that has no business value is no longer consuming data storage space. Only data that that is highly valuable and actively used by the business remains.
Understanding access and permissions is essential. File analysis solutions that provide remediation tools help ensure that proper controls are applied to data while it’s in active use. Some solutions include additional protections, such as the ability to encrypt data at the endpoint to ensure proper use. And finally, properly deployed file analysis solutions can prevent users from moving or deleting data without understanding its business purpose. File analysis solutions that deploy a “manage-in-place” model minimize the risk of disruption to business users.
Deploying a file analysis solution can help ensure that the right data is available to the right user at the right time. It helps organizations meet their regulatory, legal, and internal governance and compliance objectives by:
Organizations are in a race to find, protect, and secure personal data (including consumer, citizen, and employee data). This global trend – which includes GDPR (EU), CCPA (California), KVKK (Turkey), PIPEDA (Canada), and POPIA (South Africa) – has brought new attention to file analysis solutions. By leveraging content analysis capabilities and detection techniques, file analysis solutions are ideal for ensuring compliance and assisting in responding to consumer requests or data subject access requests.
Data privacy preparedness is an example of where file analysis solutions shine. It also emphasizes the need for a process in which PII files can be easily identified, indexed, and retrieved.
The end-to-end process should look something like this:
File analysis solutions use simple classification methods based on metadata tags, keywords, or terms lists. Some solutions leverage conceptual classification of the file content and combine these methods with found documents, images, or data entities to improve the accuracy of the categorization. Other solutions take it a step further with machine learning and guided learning using sample documents, which enable you to define the classifications to be used.
For example, a Human Resources document with health or insurance information can use a data classification policy based on sample data. For other elements, such as age and location, you can apply a risk score and additional permissions to further define the policy.
File analysis solutions provide capabilities to help organizations automatically take action on data, as well as a rich toolset to help govern and preserve data. The solutions typically include the following options, driven by corporate data governance:
What is “manage-in-place”?
Manage-in-place is a key concept of data lifecycle management and governance. It is “how” the metadata (including location, permissions, and content) is analyzed by the file analysis solution where it resides. The actual object is not moved, copied, or stored in another location or preservation area during the analysis.
Understanding grammars for entities
Two basic types of data discovery grammars (rule sets) are used to describe the entities you are trying to identify: curated and user-generated.
The grammars include:
Look for curated and optimized grammars, which can’t be modified by the user. These grammars use context and landmarks for more accurate results and provide a “confidence score” to help you filter out false positives. Context and landmarks can be phrases, single words, or individual characters.
Context is key. File analysis solutions that use proximity to the entity candidate and the strength of the context (based on natural-language processing techniques) contribute to confidence scores. You can obtain more granular scores by leveraging comprehensive lists of specific entities, countries, or regions.
Tuning and flexibility. If none of these grammars covers your specific use case, you can use a file analysis solution that allows for creating custom grammars. These grammars are typically defined by using format-descriptive RegX or simple lists.
What are false positives?
By definition, a “false positive” is a test result that incorrectly indicates the presence of a particular condition or attribute1. In the case of file analysis solutions, a false positive indicates a pattern, grammar, or keyword match that is incorrectly identified during content analysis. File analysis solutions that simply use pattern or keyword matching typically have higher false positive rates than those with contextually aware content analysis capabilities.
Scanned documents and audio recordings
File analysis solutions can analyze text-based documents for risk, but PII can also reside in other forms of data. Performing file analysis on scanned documents, recorded conversations, and video conference recordings is becoming increasingly common. Some file analysis solutions can process these files prior to applying PII discovery techniques.
Scanned paper documents stored as images (inside a PDF file, for example) should be processed with optical character recognition (OCR) to extract the text and, ideally, the associated structural information. Many organizations keep scanned ID documents on record, such as employees’ driver's licenses or passports.
File analysis solutions that support analyzing audio or video recordings require processing by a speech-to-text engine that can create a written transcript for analysis.
Benefits of contextual, AI-driven content analysis:
Benefits of “manage-in-place” models:
What is a file analysis SaaS solution?
File analysis can be offered via software as a service (SaaS), where the customer consumes services provided by an application security provider for a monthly or annual fee. This approach doesn’t require hardware procurement or traditional perpetual licensing. It relies partially or completely on the SaaS vendor (or a managed service provider in some cases) to provide access to the application in order to conduct content analysis, search, governance actions, and analytics. SaaS provides an easy way to get started with content analysis and offers high scalability, speed, and fast time to value. Depending on the location of the SaaS hosting environment, data residency and data sovereignty concerns might need to be weighed against the commercial benefits of SaaS
What is a file analysis on-premises solution?
File analysis solutions can also be run on-premises and operated and maintained by in-house teams. This approach requires organizations to provide the infrastructure and personnel and acquire and manage application security solutions. On-premises assures organizations that their application data is not shared with third parties and does not leave the premises. Typically, on-premises solutions are sold through a perpetual license. More recently, subscription licensing has been used to provide more flexibility in how the software is consumed and billed.
OpenText provides file analysis tools
Voltage File Analysis Suite by OpenText™ This SaaS file analysis solution enables organizations to quickly and efficiently reduce information risk; ensure data privacy; and analyze, optimize, and secure employee access to critical data that drive and protect the business. Our solution ensures data lifecycle management and data access governance while mitigating the risk associated with managing sensitive data. File Analysis also provides identity and access governance, complete data visibility, reduction in storage costs, actionable analytics that improve efficiency, and data quality. In addition, it supports data privacy compliance while addressing governance for high-value assets (e.g., contracts, intellectual property, patents, etc.), and sensitive data (e.g., PI/ PII, PCI, PHI, etc.).
OpenText™ File Reporter inventories network file systems and delivers the detailed file storage intelligence you need to optimize and secure your network for efficiency and compliance. It enables you to identify access risks when you discover and analyze files and associated permissions for data stored across your enterprise. Engineered for enterprise file system reporting, File Reporter gathers data across the millions of files and folders scattered among the various network storage devices that make up your network. Flexible reporting, filtering, and querying options then present the exact findings you need in order to demonstrate compliance or take corrective action.
OpenText™ File Dynamics provides extensive services to address the expanding requirements of network data management. Identity-driven policies automate tasks that are traditionally done manually, resulting in cost savings and the assurance that tasks are being performed properly. Target-driven policies provide protection from unauthorized access, as well as data migration and clean up. File Dynamics also protects against data corruption and downtime through near-line storage backup of high-value targets, enabling quick recovery of files and their associated permissions. File Dynamics delivers the role-based access restrictions, remediation, risk mitigation, and proactive management needed for compliance with data management regulations.
OpenText™ ControlPoint is a file analysis solution that leverages IDOL artificial intelligence for unstructured data analytics. It enables organizations to identify and automatically classify sensitive data (e.g., PII, PCI, PHI); clean up legacy data; and uncover risks hidden in dark data sitting unmanaged in email repositories, file shares, SharePoint sites, and cloud repositories (such as Office365, Google Drive, and Dropbox). ControlPoint also enables organizations to save on storage costs by reducing redundant, obsolete, and trivial data. This provides better access to valuable information and enforces data preservation through applying policies that assist in data lifecycle management, regulatory compliance, and data security.