What Is Root Cause Analysis (RCA)?
Root cause analysis (RCA) is a structured problem-solving process used to identify the underlying cause(s) of an incident, defect, or failure, rather than only addressing its symptoms. A root cause is a causal factor that, if removed or mitigated, prevents the same class of failure from recurring under similar conditions.
In practice, RCA usually produces:
- a clear problem statement and timeline,
- validated causal chain(s) from symptoms to causes,
- concrete corrective and preventive actions (CAPA),
- follow-up checks to ensure the fix is effective.
When To Use RCA
- Repeated incidents, high-severity outages, safety issues, or costly defects.
- Failures with unclear ownership across teams or components.
- Situations where quick mitigations exist but recurrence risk remains high.
Typical RCA Workflow
- Define the problem: impact, scope, success criteria (what “fixed” means).
- Collect evidence: logs, metrics, traces, configs, releases, and human timeline.
- Build a timeline: what changed, when symptoms started, how it propagated.
- Form hypotheses: enumerate plausible causes and disprove aggressively.
- Validate causes: reproduce, isolate variables, run experiments, compare baselines.
- Decide actions: mitigations (short-term) + fixes (long-term) + detection improvements.
- Verify and learn: confirm no regression; update runbooks, monitors, and processes.
Common Techniques
- 5 Whys (iteratively ask “why?” to uncover deeper causes)
- Ishikawa / fishbone diagram (categorize contributing factors)
- Fault tree analysis (logic decomposition from failure event)
- Change/rollback analysis (diff recent changes and validate correlation vs causation)
- Post-incident review (blameless retrospectives with evidence-based conclusions)
Related Notes
References
- Xu, J., Zhang, Q., Zhong, Z., He, S., Zhang, C., Lin, Q., … & Zhang, Q. (2025). OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures? (ICLR). See also: OpenRCA note.