Your comprehensive reference for AIOps, network automation, and IT operations terminology. Understand the key concepts driving modern IT infrastructure management.
Artificial Intelligence for IT Operations - Learn how AI transforms monitoring, alerting, and incident management.
Automating network configuration, monitoring, and management across multi-vendor environments.
Automating Network Operations Center tasks with AI-powered incident detection and remediation.
Artificial Intelligence for IT Operations. A methodology that uses AI and machine learning to automate and enhance IT operations including monitoring, event correlation, anomaly detection, and incident response.
A condition where IT operators become desensitized to alerts due to high volume, leading to missed critical issues. AIOps helps combat alert fatigue through intelligent alert correlation and suppression.
The use of machine learning algorithms to identify patterns in data that deviate from expected behavior, enabling proactive identification of potential issues before they cause outages.
A server that acts as a single entry point for API calls, handling request routing, composition, and protocol translation. Important for microservices architectures.
The automatic execution of corrective actions in response to detected issues, without requiring human intervention. Also known as self-healing or auto-remediation.
A reference point representing normal system behavior, used by monitoring tools to detect deviations and anomalies. Dynamic baselines adapt to changing patterns over time.
The routing protocol that manages how packets are routed across the internet. Critical for enterprise and service provider networks.
The process of controlling changes to IT infrastructure to minimize risk and ensure stability. Automation can streamline change management by validating changes before deployment.
A repository that stores information about IT assets and their relationships. Essential for understanding infrastructure dependencies and impact analysis.
The gradual divergence of system configurations from their intended state over time. Network automation helps prevent and detect configuration drift.
A component of AIOps platforms that analyzes events from multiple sources to identify relationships and reduce alert noise by grouping related events.
A set of practices combining software development and IT operations to shorten development cycles and deliver high-quality software continuously.
A network protocol that automatically assigns IP addresses and other network configuration parameters to devices on a network.
The hierarchical naming system that translates human-readable domain names into IP addresses that computers use to communicate.
Processing data closer to where it is generated rather than in a centralized data center. Reduces latency and bandwidth usage for time-sensitive applications.
The process of analyzing multiple events to identify patterns, relationships, and root causes. A key capability of AIOps platforms.
The process of routing an incident to higher-level support or management when it cannot be resolved at the current level or requires additional authority.
The process of detecting, isolating, and correcting network faults. A core function of NOC operations that can be automated with AI.
A metric measuring the percentage of support issues resolved during the first interaction with the customer, without requiring escalation or callback.
A network security device that monitors and controls incoming and outgoing network traffic based on predetermined security rules.
A network node that serves as an entry point to another network. Gateways can perform protocol conversion and route traffic between different network segments.
An operational framework that applies DevOps best practices for infrastructure automation, using Git as the single source of truth for declarative infrastructure and applications.
A method of distributing traffic across multiple data centers in different geographic locations to improve performance, availability, and disaster recovery.
A system design approach that ensures a certain level of operational continuity, typically through redundancy and failover mechanisms to minimize downtime.
Software that creates and manages virtual machines, allowing multiple operating systems to run on a single physical host. Examples include VMware ESXi and Microsoft Hyper-V.
Hypertext Transfer Protocol (Secure) - the foundation of data communication on the web. HTTPS adds encryption via TLS/SSL for secure communication.
Managing and provisioning infrastructure through machine-readable configuration files rather than manual processes. Enables version control and automation of infrastructure.
The process of identifying, analyzing, and resolving incidents to restore normal service operation as quickly as possible.
A network management approach where administrators define desired outcomes and the network automatically configures itself to achieve those outcomes.
A set of practices for managing IT services to meet business needs. Includes processes for incident, problem, change, and service level management.
The variation in time delay between data packets arriving over a network. High jitter can cause quality issues in real-time applications like VoIP and video conferencing.
A lightweight data interchange format that is easy for humans to read and write and easy for machines to parse. Widely used in APIs and configuration files.
A measurable value that demonstrates how effectively an organization is achieving key objectives. In IT operations, KPIs include MTTR, uptime percentage, and ticket resolution rates.
An open-source container orchestration platform that automates deployment, scaling, and management of containerized applications. Often abbreviated as K8s.
A device or software that distributes network traffic across multiple servers to ensure no single server bears too much demand, improving reliability and performance.
The process of collecting, storing, analyzing, and managing log data from various sources. Essential for troubleshooting and security analysis.
A subset of AI that enables systems to learn from data and improve performance without being explicitly programmed. Powers predictive analytics in AIOps.
The average time it takes to discover a problem or incident. AIOps can dramatically reduce MTTD through automated monitoring and anomaly detection.
The average time it takes to fully resolve an incident from detection to restoration of service. A key metric for measuring operational efficiency.
A company that remotely manages a customer's IT infrastructure and end-user systems. NOC automation is critical for MSP efficiency.
Network Operations - the practices and processes for managing and maintaining network infrastructure. Modern NetOps increasingly leverages automation and AI.
The use of software to automate network configuration, management, testing, deployment, and operations. Reduces manual effort and human error.
The practice of continuously observing network performance, availability, and health to identify and address issues proactively.
A centralized location where IT teams monitor, manage, and maintain network infrastructure. The hub for network monitoring and incident response.
Software used to monitor and manage computer networks. Examples include SolarWinds, PRTG, and Nagios.
The ability to understand the internal state of a system by examining its outputs. Goes beyond monitoring to include logs, metrics, and traces.
A rotation where team members are available outside normal hours to respond to incidents. Automation reduces on-call burden by handling routine issues automatically.
Automated coordination of multiple systems and services to complete a workflow or process. Essential for complex IT operations.
A routing protocol used within large enterprise networks to determine the best path for data packets.
A documented set of procedures for handling specific types of incidents or performing routine tasks. Automation platforms execute playbooks automatically.
Using data, statistical algorithms, and ML to identify the likelihood of future outcomes. In AIOps, used to predict and prevent outages.
The process of identifying and managing the root causes of incidents to prevent recurrence. Distinct from incident management which focuses on restoration.
Network mechanisms that prioritize certain types of traffic to ensure performance for critical applications. Essential for voice, video, and real-time communications.
The process of managing and prioritizing items waiting to be processed, such as network packets or support tickets. Intelligent queue management improves response times.
A systematic process for identifying the underlying cause of a problem. AI can accelerate RCA by correlating events and analyzing historical patterns.
Representational State Transfer Application Programming Interface - a standard architectural style for building web services that allows systems to communicate.
A compilation of routine procedures and operations. Runbook automation executes these procedures automatically when triggered by events.
A virtual WAN architecture that allows enterprises to leverage any combination of transport services to securely connect users to applications.
The ability of systems to automatically detect and correct faults without human intervention. A key capability of advanced automation platforms.
A contract defining the expected level of service, including metrics like uptime, response time, and resolution time.
A protocol for collecting and organizing information about managed devices on IP networks. Commonly used for network monitoring.
A facility where security analysts monitor, detect, analyze, and respond to cybersecurity incidents. Similar to NOC but focused on security.
A standard for message logging that allows separation of the software that generates messages from the systems that store and analyze them.
The arrangement of elements (links, nodes) in a network. Understanding topology is essential for impact analysis and troubleshooting.
The process of assessing and prioritizing incidents based on urgency and impact. AI can automate triage to ensure critical issues are addressed first.
A connectionless transport protocol that sends data without establishing a connection first. Faster than TCP but without guaranteed delivery. Used for DNS, VoIP, and streaming.
The amount of time a system or service is operational and available. Often expressed as a percentage (e.g., 99.99% uptime = 52 minutes downtime per year).
A logical grouping of network devices that appear to be on the same LAN regardless of their physical location.
A technology that creates a secure, encrypted connection over a less secure network, such as the internet.
The design, execution, and automation of processes where tasks, information, or documents are passed from one participant to another according to defined rules.
A security solution that integrates multiple security products into a unified platform for threat detection, investigation, and response across endpoints, networks, and cloud.
A markup language for encoding documents in a format that is both human-readable and machine-readable. Used in many network protocols and configuration files.
A human-readable data serialization format commonly used for configuration files in DevOps and automation tools like Ansible, Kubernetes, and Docker Compose.
A security model that requires strict identity verification for every person and device trying to access resources, regardless of whether they are inside or outside the network perimeter.
See how sauble.ai's AI-powered platform can automate your NOC, network, and workflow operations.