What is AIOps?

AIOps (Artificial Intelligence for IT Operations) is a framework or practice that leverages artificial intelligence (AI) and machine learning (ML) to automate, optimize, and enhance IT operations. It focuses on using data-driven insights to monitor, detect, analyze, and resolve issues in complex IT environments.

Key components of AIOps:

  • Data Ingestion & Correlation: Collects logs, metrics, traces, and events from diverse sources.
  • Noise Reduction & Anomaly Detection: Filters irrelevant alerts, identifies anomalies in real time.
  • Root Cause Analysis (RCA): Uses ML to find the actual cause of incidents.
  • Automation & Remediation: Executes automated responses (e.g., scaling servers, restarting services).
  • Predictive Analysis: Forecasts potential failures or capacity issues before they occur.

Emerging Data Sources in AIOps

Traditional data sources are all auto-generated by system.

System-generated data

logs (All of software have logs)

metrics (networking measurement metrics, wireless measurement metrics,MLflow System Metrics,Assessment of Accuracy Metrics for Time Series Forecasting, etc)

trace

Human-generated data

Need Large Language Models (LLMs)

Source Code

Question & Answer (QA)

Incident Reports

Software Information

  • Document
  • Config
  • Architecture

Reference List

  1. Zhang, L., Jia, T., Jia, M., Wu, Y., Liu, A., Yang, Y., … & Li, Y. (2025). A Survey of AIOps in the Era of Large Language Models. ACM Computing Surveys.