Building DevOps Automation Agent with AI Agents: Complete Guide 2026

Create autonomous monitor systems and automate infrastructure tasks using LLM agents

返回教程列表
高级35 分钟

Building DevOps Automation Agent with AI Agents: Complete Guide 2026

Create autonomous monitor systems and automate infrastructure tasks using LLM agents

Building DevOps Automation Agent with AI Agents 2026 Introduction AI agents that can monitor systems and automate infrastructure tasks are transforming how developers work. This guide shows you how to build a production-ready DevOps Automation Agen

ai-agentsdevops-automation-agentlangchainlangraphautomation

Building DevOps Automation Agent with AI Agents 2026

Introduction

AI agents that can monitor systems and automate infrastructure tasks are transforming how developers work. This guide shows you how to build a production-ready DevOps Automation Agent using LangGraph + Bash tools.

What We're Building

A DevOps Automation Agent that can:

  • Understand complex requests
  • Break them into sub-tasks
  • Execute tasks autonomously
  • Handle errors and retry
  • Produce consistent, high-quality output
  • Architecture

    
    User Request
        ↓
    [DevOps Automation Agent Orchestrator]
        ↓
    [Task Planning] → [Tool Selection] → [Execution]
        ↓                                      ↓
    [Validation] ←──────────────────── [Results]
        ↓
    Final Output
    

    Implementation with LangGraph

    python
    from typing import TypedDict, Annotated, List
    from langgraph.graph import StateGraph, END
    from langgraph.graph.message import add_messages
    from langchain_openai import ChatOpenAI
    from langchain_core.messages import BaseMessage, HumanMessage, AIMessage
    from langchain_core.tools import tool
    import json

    State definition

    class DevOpsAutomationAgentState(TypedDict): messages: Annotated[List[BaseMessage], add_messages] task: str sub_tasks: List[str] completed_tasks: List[str] final_output: str | None iterations: int

    Define tools for monitor systems and automate infrastructure tasks

    @tool def analyze_task(task: str) -> str: """Break down a complex task into sub-tasks.""" llm = ChatOpenAI(model="gpt-4o-mini") response = llm.invoke(f"Break this into 3-5 specific, actionable sub-tasks: {task}") return response.content

    @tool def execute_sub_task(sub_task: str, context: str = "") -> str: """Execute a specific sub-task.""" llm = ChatOpenAI(model="gpt-4o-mini") response = llm.invoke( f"Context: {context}\n\nExecute this specific task: {sub_task}\nProvide detailed output." ) return response.content

    @tool def validate_output(task: str, output: str) -> str: """Validate that the output meets requirements.""" llm = ChatOpenAI(model="gpt-4o-mini") response = llm.invoke( f"Task: {task}\n\nOutput to validate: {output}\n\n" f"Is this output complete and correct? If not, what's missing?" ) return response.content

    tools = [analyze_task, execute_sub_task, validate_output]

    Initialize LLM with tools

    llm = ChatOpenAI(model="gpt-4o", temperature=0.3) llm_with_tools = llm.bind_tools(tools)

    Agent node

    def agent_node(state: DevOpsAutomationAgentState): if state.get("iterations", 0) > 8: return {"final_output": "Max iterations reached", "iterations": 9} response = llm_with_tools.invoke(state["messages"]) return { "messages": [response], "iterations": state.get("iterations", 0) + 1 }

    Tool execution node

    from langgraph.prebuilt import ToolNode

    tool_node = ToolNode(tools)

    def should_continue(state: DevOpsAutomationAgentState) -> str: last_msg = state["messages"][-1] if hasattr(last_msg, 'tool_calls') and last_msg.tool_calls: return "tools" return "end"

    Build graph

    workflow = StateGraph(DevOpsAutomationAgentState) workflow.add_node("agent", agent_node) workflow.add_node("tools", tool_node) workflow.set_entry_point("agent") workflow.add_conditional_edges("agent", should_continue, {"tools": "tools", "end": END}) workflow.add_edge("tools", "agent")

    agent = workflow.compile()

    Using the Agent

    python
    from langchain_core.messages import HumanMessage

    def run_devops_automation_agent(request: str) -> str: """Run the DevOps Automation Agent on a user request.""" initial_state = { "messages": [HumanMessage(content=request)], "task": request, "sub_tasks": [], "completed_tasks": [], "final_output": None, "iterations": 0 } result = agent.invoke(initial_state) # Extract the final answer last_message = result["messages"][-1] return last_message.content

    Usage

    output = run_devops_automation_agent( "Create a comprehensive plan to monitor systems and automate infrastructure tasks" ) print(output)

    Adding Memory with Persistence

    python
    from langgraph.checkpoint.sqlite import SqliteSaver

    with SqliteSaver.from_conn_string("./agent_memory.db") as checkpointer: agent_with_memory = workflow.compile(checkpointer=checkpointer)

    def run_with_memory(request: str, session_id: str) -> str: config = {"configurable": {"thread_id": session_id}} state = { "messages": [HumanMessage(content=request)], "task": request, "sub_tasks": [], "completed_tasks": [], "final_output": None, "iterations": 0 } result = agent_with_memory.invoke(state, config=config) return result["messages"][-1].content

    First interaction

    response1 = run_with_memory("Start monitor systems and automate infrastructure tasks", session_id="session-001")

    Follow-up (agent remembers context)

    response2 = run_with_memory("Continue from where we left off", session_id="session-001")

    Production: FastAPI Service

    python
    from fastapi import FastAPI, BackgroundTasks
    from fastapi.responses import StreamingResponse
    from pydantic import BaseModel
    import asyncio

    app = FastAPI(title="DevOps Automation Agent Service")

    class AgentRequest(BaseModel): task: str session_id: str = "default" stream: bool = False

    @app.post("/agent/run") async def run_agent(request: AgentRequest): if request.stream: async def stream_response(): async for event in agent.astream_events( {"messages": [HumanMessage(content=request.task)], "task": request.task, "sub_tasks": [], "completed_tasks": [], "final_output": None, "iterations": 0}, version="v2" ): if event["event"] == "on_chat_model_stream": content = event["data"]["chunk"].content if content: yield content return StreamingResponse(stream_response(), media_type="text/plain") result = run_devops_automation_agent(request.task) return {"result": result, "session_id": request.session_id}

    @app.get("/health") async def health(): return {"status": "healthy", "agent": "DevOps Automation Agent"}

    Monitoring Agent Performance

    python
    from dataclasses import dataclass
    from datetime import datetime
    import statistics

    @dataclass class AgentMetrics: task: str iterations: int duration_ms: float success: bool output_length: int

    metrics_store: List[AgentMetrics] = []

    def run_with_metrics(task: str) -> tuple[str, AgentMetrics]: import time start = time.time() try: result = run_devops_automation_agent(task) success = True except Exception as e: result = f"Error: {e}" success = False duration = (time.time() - start) * 1000 # Note: iterations would come from actual state in production metrics = AgentMetrics( task=task[:50], iterations=3, duration_ms=duration, success=success, output_length=len(result) ) metrics_store.append(metrics) return result, metrics

    def print_metrics_report(): if not metrics_store: return successful = [m for m in metrics_store if m.success] durations = [m.duration_ms for m in metrics_store] print(f"Total runs: {len(metrics_store)}") print(f"Success rate: {len(successful)/len(metrics_store):.1%}") print(f"Avg duration: {statistics.mean(durations):.0f}ms") print(f"p95 duration: {sorted(durations)[int(len(durations)*0.95)]:.0f}ms")

    Best Practices

  • Limit iterations: Always set a maximum to prevent infinite loops
  • Checkpoint state: Use persistence for long-running tasks
  • Human review: Add approval steps for critical actions
  • Detailed logging: Log every tool call for debugging
  • Graceful failures: Handle errors without crashing the agent
  • Conclusion

    Building DevOps Automation Agent with AI agents enables autonomous monitor systems and automate infrastructure tasks. The LangGraph implementation provides the right balance of control and flexibility for production use.

    Start with a simple proof of concept, add persistence, then scale up as confidence grows.


    *DevOps Automation Agent implementation using LangGraph + Bash tools | May 2026*

    相关工具

    LangGraphLangChainOpenAI