#
2.1 Identify Data
#
Purpose of This Section
The company cannot protect data it has not identified.
Most SMEs have more sensitive data than they think. It may be in email inboxes, shared drives, laptops, accounting software, CRM systems, cloud storage, spreadsheets, chat tools, website forms, backup folders, HR systems, vendor portals, and old archives nobody has opened in years.
Sensitive data does not only live in one neat system. It spreads over time. Employees copy it into spreadsheets. Sales teams export it from the CRM. Finance teams send it by email. HR stores it in folders. Managers keep copies on laptops. Vendors receive it to do their work. Old data remains long after the business still needs it.
The goal of this section is to create a clear working inventory of the company’s important data.
This is not yet the full data protection plan. That comes later. First, the company needs to know what data exists, where it is stored, who owns it, who can access it, and which systems or vendors touch it.
#
Goals
Overall, organizations should identify, classify, handle, retain and dispose of data when required, prioritize sensitive data, establish and maintain a data inventory. To do this, companies act to identify all such data with a variety of methods we lay out.
#
Identify:
- What data the company holds
- Where it lives
- Who owns it
- How sensitive it is
- How it moves
- Where it is copied or backed up
- Whether it is still needed
#
Main steps for identifying data
#
1. Define what counts as company data
Start by defining the data categories in scope. Do not only think about databases. SMEs usually lose visibility because data is scattered across email, shared drives, laptops, SaaS tools, exports, backups, old folders, and personal workarounds.
Include:
Customer data, employee data, supplier data, financial records, contracts, invoices, intellectual property, credentials, API keys, operational documents, business plans, regulated data, confidential files, emails, chat exports, backups, archives, and old project folders.
Answer the key question: What types of data does the company create, receive, process, store, share, or archive?
#
2. Identify all data locations
Next, list where data lives. This should include structured and unstructured repositories.
Common locations:
File servers, SharePoint, OneDrive, Google Drive, Dropbox, Nextcloud, local laptops, email inboxes, CRM, ERP, accounting software, HR systems, ticketing tools, databases, cloud storage buckets, SaaS apps, backup platforms, Git repositories, external drives, NAS devices, and archived folders.
Answer the key question: Where is our sensitive and business-critical data stored today?
This is where most weak inventories fail. They list “customer database” and forget the exported spreadsheet sitting in someone’s Downloads folder.
#
3. Define a simple data classification model
Before scanning or cataloging, define classification levels. Keep it simple enough that employees can use it.
A practical SME model:
- Public — safe for public release.
- Internal — normal business information not meant for public release.
- Confidential — sensitive business, customer, employee, financial, contractual, or operational data.
- Restricted — regulated, high-risk, credential-related, security-sensitive, legal, or highly confidential data.
Answer the key question:
How sensitive is this data if leaked, lost, altered, or exposed to the wrong person?
Do not over-engineer this. A 12-level classification scheme will die in real operations.
#
4. Create the data inventory record
For each data set or major data location, capture a standard set of fields.
Minimum useful fields:
Data name, data category, description, location, system or repository, owner, department, users/groups with access, sensitivity level, regulated status, source, format, retention requirement, backup location, third-party sharing, business purpose, last reviewed date, and status.
For example:
- Customer contracts
- Location: SharePoint / Legal / Contracts
- Owner: Legal Manager
- Classification: Confidential
- Contains: customer names, signatures, pricing, service terms
- Access: Legal, Sales leadership, Finance
- Retention: 7 years after contract end
- Backed up: Microsoft 365 backup provider
- Last reviewed: May 2026
Answer the key question:
Can someone understand what this data is, where it lives, who owns it, and why it exists without asking five people?
#
5. Discover and scan the data
Use a combination of manual review, repository exports, search, metadata tools, and automated scanning.
Methods include:
- Export folder structures from file shares.
- Review SharePoint, OneDrive, Google Drive, and SaaS admin consoles.
- Scan databases and data warehouses.
- Search for sensitive patterns such as ID numbers, payment card numbers, tax IDs, health data, employee records, and credentials.
- Scan Git repositories for secrets.
- Review backup indexes and archive folders.
- Check cloud storage buckets and object stores.
Answer the key question:
What data can we actually prove exists through scanning, system exports, or repository review?
#
6. Catalog and classify the data
After discovery, consolidate findings into your central data inventory or data catalog. Assign each data set a category, classification level, owner, and business purpose.
This step should identify:
Sensitive data, regulated data, confidential business files, intellectual property, credentials, duplicate copies, stale data, orphaned data, and data with unclear ownership.
Answer the key question:
What is this data, how sensitive is it, and who is accountable for it?
#
7. Map data flows and copies
This is the step that turns a data inventory into a useful data map.
Track:
- Where the data comes from
- Where it is stored
- Which systems process it
- Who accesses it
- Which vendors receive it
- Where it is exported
- Where it is backed up
- Where old copies may exist
Answer the key question:
Where does this data move, and where are the copies?
This matters because data risk rarely sits only in the original system. It often sits in exports, reports, attachments, backups, and shared folders.
#
8. Identify unmanaged, stale, duplicate, and high-risk data
Flag data that should be reviewed.
Examples:
- Old employee records with no owner.
- Customer exports stored on desktops.
- Credentials inside spreadsheets or code repositories.
- Financial records in personal cloud folders.
- Old backups with unclear retention.
- Duplicate contract folders.
- Archived data nobody can justify keeping.
- Publicly shared links containing confidential files.
Answer the key question:
Which data exists outside proper ownership, classification, retention, or access control?
This is still part of Identify. You are not yet fixing everything. You are exposing what exists.
#
9. Validate with business owners
Send the inventory to department owners for confirmation.
Ask:
- Is this data still used?
- Who owns it?
- Who needs access?
- Is the classification correct?
- Is the retention period correct?
- Is it stored in the correct place?
- Is anything missing?
Answer the key question:
Can the business owner confirm that this data record is accurate and still needed?
#
10. Maintain the data inventory
Data identification is not a one-time cleanup. It needs triggers.
Update the inventory when:
- A new system is added.
- A new SaaS tool is approved.
- A new data type is collected.
- A vendor starts receiving company data.
- A database, shared drive, or cloud storage bucket is created.
- A major report/export process is introduced.
- A system is retired.
- A backup location changes.
- A department changes its process.
Minimum schedules:
- Quarterly review for sensitive data.
- Annual review for the full data inventory.
- Immediate update when new high-risk data or systems are introduced.
CIS specifically recommends reviewing and updating the data inventory annually at minimum, with priority on sensitive data. (CIS Controls)
#
Recommended open-source or lower-cost tools
#
Best practical SME stack
#
For a small business with limited IT resources:
- Core record: Spreadsheet, GLPI, Snipe-IT custom fields, OpenMetadata, or DataHub depending on maturity.
- File/document organization: Nextcloud or Microsoft 365/Google Workspace native storage.
- Document archive: Paperless-ngx for scanned contracts, invoices, and archived business documents.
- Sensitive data scanning: Apache Tika + Microsoft Presidio.
- Credential discovery: Gitleaks or TruffleHog.
- Credential management: Bitwarden or Passbolt.
- Cloud data-location discovery: CloudQuery or Steampipe.
- Microsoft/Google environments: Use Purview or Google Vault if already licensed.
#
The simplest realistic stack for an SME would be:
- Spreadsheet or GLPI for the data inventory + Nextcloud or Microsoft/Google storage + Paperless-ngx for archives + Gitleaks for secrets + Presidio/Tika for sensitive-data scanning.
#
For a more technical SME:
- OpenMetadata + CloudQuery + Presidio/Tika + TruffleHog + Bitwarden/Passbolt.
#
Data inventory fields to include
For each data set or data repository, record:
- Data name
- Data category
- Business purpose
- System or storage location
- Data owner
- Technical owner
- Department
- Sensitivity classification
- Regulated data type, if any
- User groups with access
- Source of data
- Downstream systems or exports
- Third-party sharing
- Backup location
- Retention period
- Archive status
- Last reviewed date
- Review owner
- Status: active, archived, stale, duplicate, unknown, pending review, or approved
#
Steps in Summary
Identify Data Process
Define data categories and classification levels.
Identify all locations where company data may exist.
Create a standard data inventory record.
Discover data through system exports, repository review, scanning, and owner interviews.
Scan for sensitive data, regulated data, credentials, and confidential files.
Catalog each data set with owner, location, purpose, sensitivity, access, retention, and backup details.
Map data flows, copies, exports, vendor sharing, and backup locations.
Flag unmanaged, duplicate, stale, orphaned, overexposed, or unknown data.
Validate the inventory with business and technical owners.
Maintain the inventory through change triggers and periodic review.
#
Goals in Summary
A mature company will have a firm knowledge base of:
- Where customer data lives
- Where employee data lives
- Where credentials are stored
- Where backups are kept and
- Who owns each data set
This key oversight thus enables the digital assets to be properly protected.