What is AIOps?
AIOps (Artificial Intelligence for IT Operations) is a framework or practice that leverages artificial intelligence (AI) and machine learning (ML) to automate, optimize, and enhance IT operations. It focuses on using data-driven insights to monitor, detect, analyze, and resolve issues in complex IT environments.
Key components of AIOps:
- Data Ingestion & Correlation: Collects logs, metrics, traces, and events from diverse sources.
- Noise Reduction & Anomaly Detection: Filters irrelevant alerts, identifies anomalies in real time.
- Root Cause Analysis (RCA): Uses ML to find the actual cause of incidents.
- Automation & Remediation: Executes automated responses (e.g., scaling servers, restarting services).
- Predictive Analysis: Forecasts potential failures or capacity issues before they occur.
Emerging Data Sources in AIOps
Traditional data sources are all auto-generated by system.
System-generated data
logs (All of software have logs)
metrics (networking measurement metrics, wireless measurement metrics,MLflow System Metrics,Assessment of Accuracy Metrics for Time Series Forecasting, etc)
trace
Human-generated data
Need Large Language Models (LLMs)
Source Code
Question & Answer (QA)
Incident Reports
Software Information
- Document
- Config
- Architecture
Reference List
- Zhang, L., Jia, T., Jia, M., Wu, Y., Liu, A., Yang, Y., … & Li, Y. (2025). A Survey of AIOps in the Era of Large Language Models. ACM Computing Surveys.