#
7.2 Root Cause and Control Failure Analysis
#
Goals
Root cause analysis identifies why the incident happened or why it was able to cause damage.
Control failure analysis identifies which safeguards were missing, weak, misconfigured, ignored, untested, or bypassed.
The goal is not to find one person to blame. Most cybersecurity incidents happen because several small weaknesses line up: an exposed system, a missing patch, weak MFA, excessive access, unclear ownership, poor monitoring, rushed approval, weak vendor controls, or incomplete backup testing.
A useful review asks:
What allowed the incident to start?
What allowed it to spread?
What allowed it to remain unnoticed?
What allowed it to cause business impact?
What control should have prevented, detected, limited, or corrected it?
#
Step 1: Start From the Timeline
Use the post-incident timeline as the starting point.
Do not begin with assumptions. Review the actual sequence of events, including when the incident started, when it was detected, which systems were affected, what actions were taken, and when the incident was contained.
The timeline should show where the company had opportunities to prevent, detect, contain, or recover faster.
Why this matters:
Root cause analysis should be based on facts. If the timeline is incomplete, the root cause may be wrong.
#
Step 2: Identify the Initial Entry Point
Determine how the incident most likely began.
Common entry points include:
- Phishing email
- Stolen password
- MFA fatigue or MFA approval abuse
- Compromised mailbox
- Malicious attachment
- Exposed remote access
- Public RDP or SSH
- Unpatched system
- Website or CMS vulnerability
- Exposed cloud storage
- Compromised vendor account
- OAuth or SaaS app abuse
- API key or secret exposure
- Lost or stolen device
- Insider misuse
If the entry point is unknown, mark it as unknown and list the most likely possibilities.
Why this matters:
The company cannot reliably prevent repeat incidents if it does not understand how the first access happened.
#
Step 3: Identify the Immediate Cause
The immediate cause is the event or weakness directly connected to the incident.
Examples include:
- A user entered a password into a fake login page.
- An admin account did not have MFA.
- A website plugin was unpatched.
- A public file-sharing link exposed sensitive data.
- A firewall rule exposed a database.
- A mailbox forwarding rule sent messages outside the company.
- A backup job failed and nobody noticed.
- A vendor account had too much access.
- A service account key was exposed in a repository.
Why this matters:
The immediate cause explains what happened closest to the incident. It is usually not the full root cause, but it is where analysis begins.
#
Step 4: Look for Deeper Root Causes
Ask why the immediate cause was possible.
Use repeated “why” questions until the company finds the process, ownership, technical, or governance issue underneath the event.
Example:
A mailbox was compromised.
Why? The user entered credentials into a phishing page.
Why did that work? MFA was not enforced on that account.
Why was MFA not enforced? The account was excluded during rollout.
Why was it excluded? Exceptions were not tracked or reviewed.
Why was that allowed? There was no access control owner or exception review process.
In this example, “the user clicked” is not the root cause. The deeper root causes include incomplete MFA enforcement, weak exception management, and unclear ownership.
Why this matters:
If the company stops at “someone clicked,” it will miss the control failures that allowed the click to become an incident.
#
Step 5: Identify Failed or Missing Preventive Controls
Preventive controls are safeguards that should reduce the chance of an incident happening.
Review whether preventive controls existed and whether they worked.
Examples include:
- MFA
- Strong password management
- Least privilege
- Patch management
- Secure configuration
- Endpoint protection
- Email filtering
- Web filtering
- DNS filtering
- Secure remote access
- Firewall restrictions
- SaaS sharing controls
- Vendor access controls
- Change approval
- Security awareness training
- Website hardening
- Secrets management
If a preventive control was missing, weak, not enforced, or bypassed, record it.
Why this matters:
Preventive control gaps are often the most direct improvement opportunities.
#
Step 6: Identify Failed or Missing Detective Controls
Detective controls are safeguards that should notice suspicious activity.
Review whether the company had enough visibility to detect the incident early.
Examples include:
- Sign-in alerts
- MFA alerts
- Mailbox rule alerts
- Admin change alerts
- Endpoint alerts
- Firewall logs
- VPN logs
- Backup alerts
- Cloud audit logs
- SaaS audit logs
- Website monitoring
- Public exposure checks
- Employee reporting path
- MSP or vendor notifications
- SIEM or log review
Ask:
Was the alert enabled?
Did the alert go to the right person?
Was it reviewed quickly?
Were logs available?
Were earlier warning signs missed?
Why this matters:
Even if the company cannot prevent every incident, it should improve the chance of noticing important warning signs early.
#
Step 7: Identify Failed or Missing Response Controls
Response controls help the company act quickly and correctly once an issue is detected.
Review whether response procedures worked.
Examples include:
- Incident activation criteria
- Response ownership
- Escalation path
- Evidence handling rules
- Containment instructions
- Account disablement process
- Device isolation process
- Vendor escalation path
- Legal or insurance contact path
- Communication approval process
- Decision logging
Ask:
Was it clear who was in charge?
Was response activated quickly?
Were the right people contacted?
Was evidence preserved?
Were containment actions effective?
Were decisions recorded?
Why this matters:
A minor incident can become worse when the company is slow, confused, or unsure who has authority to act.
#
Step 8: Identify Failed or Missing Recovery Controls
Recovery controls help the company restore operations safely.
Review whether recovery worked as expected.
Examples include:
- Backup coverage
- Backup protection
- Restore testing
- Recovery priorities
- Recovery ownership
- Known-good restore points
- Device rebuild process
- Credential rotation
- Business validation
- Monitoring after restoration
- Temporary workaround controls
Ask:
Were backups available?
Were backups protected?
Were restore points clean enough to use?
Were systems restored in the right order?
Were business owners involved?
Was recovery validated before normal use resumed?
Why this matters:
Recovery failures often reveal planning gaps that were not visible before the incident.
#
Step 9: Analyze Business Process and Human Process Failures
Some incidents are not only technical.
Review whether business processes contributed to the incident or impact.
Examples include:
- Payment changes approved by email only
- No callback verification for bank detail changes
- Employees unsure where to report suspicious activity
- No process for lost devices
- Vendor access granted without review
- Shared accounts used for convenience
- Former employee access not removed
- Manual workarounds created data risk
- No owner for a critical system
- Security exceptions not reviewed
Human error should be treated as a signal, not the final answer. Ask what process, training, tool, approval step, or control would have made the error less likely or less damaging.
Why this matters:
If the improvement plan only says “train users better,” the company will often miss the business process changes needed to reduce risk.
#
Step 10: Record Confirmed Findings, Unknowns, and Improvement Themes
Document the analysis clearly.
Separate confirmed findings from assumptions.
Use three categories:
Confirmed finding: Supported by evidence.
Likely finding: Reasonable based on available facts, but not fully proven.
Unknown: Still unclear or not enough evidence available.
Then group findings into improvement themes.
Common themes include:
- Access control weakness
- Patch or configuration weakness
- Logging and monitoring gap
- Backup or recovery gap
- Vendor access weakness
- Employee reporting gap
- Business approval weakness
- Documentation gap
- Ownership gap
- Security tooling gap
- Training or awareness gap
Why this matters:
Clear findings help leadership understand what must change without getting lost in technical detail.
#
Root Cause Analysis Template
Use a simple template with these fields:
- Incident ID
- Incident summary
- Initial entry point
- Immediate cause
- Deeper root cause
- Affected systems
- Affected data
- Affected business process
- Preventive controls that failed or were missing
- Detective controls that failed or were missing
- Response controls that failed or were missing
- Recovery controls that failed or were missing
- Business process gaps
- Human process gaps
- Confirmed findings
- Likely findings
- Unknowns
- Improvement themes
- Recommended corrective actions
- Owner
- Due date
- Leadership approval required
#
Control Failure Categories
Use these categories to organize findings:
- Identity and access
- Email security
- Endpoint and server protection
- Patch and vulnerability management
- Systems hardening
- Network and remote access
- Cloud and SaaS configuration
- Website and public exposure
- Data protection and sharing
- Backup and recovery
- Logging and alerting
- Employee reporting
- Vendor and third-party access
- Incident response process
- Business continuity process
- Governance and ownership
#
Examples of Weak vs Useful Root Cause Statements
Weak:
“The user clicked a phishing email.”
Useful:
“The mailbox was compromised after a user entered credentials into a phishing page. MFA was not enforced on the account because it was excluded during rollout. The exception was not tracked, and there was no scheduled review of accounts without MFA.”
Weak:
“The server was hacked because it was vulnerable.”
Useful:
“The public-facing server was compromised through an unpatched application. The patch had not been applied because the system was not included in the asset inventory or monthly vulnerability review.”
Weak:
“The backup failed.”
Useful:
“The backup job had been failing for 18 days. Alerts were sent to an unmonitored mailbox, and no backup owner was assigned to review failed jobs.”
Weak:
“The vendor account was abused.”
Useful:
“The vendor account had persistent access, no MFA, and broader permissions than needed. Vendor access was not reviewed after the project ended.”
#
Expected Outputs from This Section
At the end of this section, the company should have:
- A likely or confirmed entry point.
- An immediate cause.
- A deeper root cause or list of root causes.
- A list of failed, weak, missing, or bypassed controls.
- A clear separation between confirmed findings, likely findings, and unknowns.
- A list of business process gaps.
- A list of human process gaps.
- A set of improvement themes.
- Recommended corrective actions ready for prioritization.
#
Objective
Do not stop at the first obvious cause.
A company should leave this section able to say:
“We understand what allowed the incident to happen, what allowed it to cause impact, which controls failed, and what must change.”
That is root cause and control failure analysis.