# 7.2 Root Cause and Control Failure Analysis

# Goals

Root cause analysis identifies why the incident happened or why it was able to cause damage.

Control failure analysis identifies which safeguards were missing, weak, misconfigured, ignored, untested, or bypassed.

The goal is not to find one person to blame. Most cybersecurity incidents happen because several small weaknesses line up: an exposed system, a missing patch, weak MFA, excessive access, unclear ownership, poor monitoring, rushed approval, weak vendor controls, or incomplete backup testing.

A useful review asks:

What allowed the incident to start?
What allowed it to spread?
What allowed it to remain unnoticed?
What allowed it to cause business impact?
What control should have prevented, detected, limited, or corrected it?

# Step 1: Start From the Timeline

Use the post-incident timeline as the starting point.

Do not begin with assumptions. Review the actual sequence of events, including when the incident started, when it was detected, which systems were affected, what actions were taken, and when the incident was contained.

The timeline should show where the company had opportunities to prevent, detect, contain, or recover faster.

Why this matters:

Root cause analysis should be based on facts. If the timeline is incomplete, the root cause may be wrong.

# Step 2: Identify the Initial Entry Point

Determine how the incident most likely began.

Common entry points include:

Phishing email
Stolen password
MFA fatigue or MFA approval abuse
Compromised mailbox
Malicious attachment
Exposed remote access
Public RDP or SSH
Unpatched system
Website or CMS vulnerability
Exposed cloud storage
Compromised vendor account
OAuth or SaaS app abuse
API key or secret exposure
Lost or stolen device
Insider misuse

If the entry point is unknown, mark it as unknown and list the most likely possibilities.

Why this matters:

The company cannot reliably prevent repeat incidents if it does not understand how the first access happened.

# Step 3: Identify the Immediate Cause

The immediate cause is the event or weakness directly connected to the incident.

Examples include:

A user entered a password into a fake login page.
An admin account did not have MFA.
A website plugin was unpatched.
A public file-sharing link exposed sensitive data.
A firewall rule exposed a database.
A mailbox forwarding rule sent messages outside the company.
A backup job failed and nobody noticed.
A vendor account had too much access.
A service account key was exposed in a repository.

Why this matters:

The immediate cause explains what happened closest to the incident. It is usually not the full root cause, but it is where analysis begins.

# Step 4: Look for Deeper Root Causes

Ask why the immediate cause was possible.

Use repeated “why” questions until the company finds the process, ownership, technical, or governance issue underneath the event.

Example:

A mailbox was compromised.

Why? The user entered credentials into a phishing page.
Why did that work? MFA was not enforced on that account.
Why was MFA not enforced? The account was excluded during rollout.
Why was it excluded? Exceptions were not tracked or reviewed.
Why was that allowed? There was no access control owner or exception review process.

In this example, “the user clicked” is not the root cause. The deeper root causes include incomplete MFA enforcement, weak exception management, and unclear ownership.

Why this matters:

If the company stops at “someone clicked,” it will miss the control failures that allowed the click to become an incident.

# Step 5: Identify Failed or Missing Preventive Controls

Preventive controls are safeguards that should reduce the chance of an incident happening.

Review whether preventive controls existed and whether they worked.

Examples include:

MFA
Strong password management
Least privilege
Patch management
Secure configuration
Endpoint protection
Email filtering
Web filtering
DNS filtering
Secure remote access
Firewall restrictions
SaaS sharing controls
Vendor access controls
Change approval
Security awareness training
Website hardening
Secrets management

If a preventive control was missing, weak, not enforced, or bypassed, record it.

Why this matters:

Preventive control gaps are often the most direct improvement opportunities.

# Step 6: Identify Failed or Missing Detective Controls

Detective controls are safeguards that should notice suspicious activity.

Review whether the company had enough visibility to detect the incident early.

Examples include:

Sign-in alerts
MFA alerts
Mailbox rule alerts
Admin change alerts
Endpoint alerts
Firewall logs
VPN logs
Backup alerts
Cloud audit logs
SaaS audit logs
Website monitoring
Public exposure checks
Employee reporting path
MSP or vendor notifications
SIEM or log review

Ask:

Was the alert enabled?
Did the alert go to the right person?
Was it reviewed quickly?
Were logs available?
Were earlier warning signs missed?

Why this matters:

Even if the company cannot prevent every incident, it should improve the chance of noticing important warning signs early.

# Step 7: Identify Failed or Missing Response Controls

Response controls help the company act quickly and correctly once an issue is detected.

Review whether response procedures worked.

Examples include:

Incident activation criteria
Response ownership
Escalation path
Evidence handling rules
Containment instructions
Account disablement process
Device isolation process
Vendor escalation path
Legal or insurance contact path
Communication approval process
Decision logging

Ask:

Was it clear who was in charge?
Was response activated quickly?
Were the right people contacted?
Was evidence preserved?
Were containment actions effective?
Were decisions recorded?

Why this matters:

A minor incident can become worse when the company is slow, confused, or unsure who has authority to act.

# Step 8: Identify Failed or Missing Recovery Controls

Recovery controls help the company restore operations safely.

Review whether recovery worked as expected.

Examples include:

Backup coverage
Backup protection
Restore testing
Recovery priorities
Recovery ownership
Known-good restore points
Device rebuild process
Credential rotation
Business validation
Monitoring after restoration
Temporary workaround controls

Ask:

Were backups available?
Were backups protected?
Were restore points clean enough to use?
Were systems restored in the right order?
Were business owners involved?
Was recovery validated before normal use resumed?

Why this matters:

Recovery failures often reveal planning gaps that were not visible before the incident.

# Step 9: Analyze Business Process and Human Process Failures

Some incidents are not only technical.

Review whether business processes contributed to the incident or impact.

Examples include:

Payment changes approved by email only
No callback verification for bank detail changes
Employees unsure where to report suspicious activity
No process for lost devices
Vendor access granted without review
Shared accounts used for convenience
Former employee access not removed
Manual workarounds created data risk
No owner for a critical system
Security exceptions not reviewed

Human error should be treated as a signal, not the final answer. Ask what process, training, tool, approval step, or control would have made the error less likely or less damaging.

Why this matters:

If the improvement plan only says “train users better,” the company will often miss the business process changes needed to reduce risk.

# Step 10: Record Confirmed Findings, Unknowns, and Improvement Themes

Document the analysis clearly.

Separate confirmed findings from assumptions.

Use three categories:

Confirmed finding: Supported by evidence.
Likely finding: Reasonable based on available facts, but not fully proven.
Unknown: Still unclear or not enough evidence available.

Then group findings into improvement themes.

Common themes include:

Access control weakness
Patch or configuration weakness
Logging and monitoring gap
Backup or recovery gap
Vendor access weakness
Employee reporting gap
Business approval weakness
Documentation gap
Ownership gap
Security tooling gap
Training or awareness gap

Why this matters:

Clear findings help leadership understand what must change without getting lost in technical detail.

# Root Cause Analysis Template

Use a simple template with these fields:

Incident ID
Incident summary
Initial entry point
Immediate cause
Deeper root cause
Affected systems
Affected data
Affected business process
Preventive controls that failed or were missing
Detective controls that failed or were missing
Response controls that failed or were missing
Recovery controls that failed or were missing
Business process gaps
Human process gaps
Confirmed findings
Likely findings
Unknowns
Improvement themes
Recommended corrective actions
Owner
Due date
Leadership approval required

# Control Failure Categories

Use these categories to organize findings:

Identity and access
Email security
Endpoint and server protection
Patch and vulnerability management
Systems hardening
Network and remote access
Cloud and SaaS configuration
Website and public exposure
Data protection and sharing
Backup and recovery
Logging and alerting
Employee reporting
Vendor and third-party access
Incident response process
Business continuity process
Governance and ownership

# Examples of Weak vs Useful Root Cause Statements

Weak:

“The user clicked a phishing email.”

Useful:

“The mailbox was compromised after a user entered credentials into a phishing page. MFA was not enforced on the account because it was excluded during rollout. The exception was not tracked, and there was no scheduled review of accounts without MFA.”

Weak:

“The server was hacked because it was vulnerable.”

Useful:

“The public-facing server was compromised through an unpatched application. The patch had not been applied because the system was not included in the asset inventory or monthly vulnerability review.”

Weak:

“The backup failed.”

Useful:

“The backup job had been failing for 18 days. Alerts were sent to an unmonitored mailbox, and no backup owner was assigned to review failed jobs.”

Weak:

“The vendor account was abused.”

Useful:

“The vendor account had persistent access, no MFA, and broader permissions than needed. Vendor access was not reviewed after the project ended.”

# Expected Outputs from This Section

At the end of this section, the company should have:

A likely or confirmed entry point.
An immediate cause.
A deeper root cause or list of root causes.
A list of failed, weak, missing, or bypassed controls.
A clear separation between confirmed findings, likely findings, and unknowns.
A list of business process gaps.
A list of human process gaps.
A set of improvement themes.
Recommended corrective actions ready for prioritization.

# Objective

Do not stop at the first obvious cause.

A company should leave this section able to say:

“We understand what allowed the incident to happen, what allowed it to cause impact, which controls failed, and what must change.”

That is root cause and control failure analysis.