Agent Installation Guide

The Infranexis Resolve agent connects your infrastructure to the RCA engine. It runs inside your environment, collects context when an alert fires, and sends it back over an encrypted WebSocket — giving the RCA engine the full picture of what went wrong.

✓ Open source 🔒 Outbound only — no inbound ports Kubernetes & Linux VM ~10 MB memory when idle

Overview

The agent is a lightweight Python process. It does three things:

  1. Opens a single outbound WebSocket to Infranexis Resolve — no inbound firewall rules needed.
  2. Sits completely idle until an alert fires and the RCA engine requests context.
  3. Collects only the data relevant to that incident (logs, metrics, events), returns it, then goes back to idle.
💡
The agent never runs continuous collection, never stores data locally, and never contacts any third party. It only acts in direct response to a live RCA request from your account.

How it works

  Your Infrastructure                    Infranexis Resolve
  ─────────────────                      ──────────────────
                                         Alert arrives (PagerDuty /
  resolve-agent ──── WebSocket ────────▶ Prometheus / Grafana / etc.)
       │               (TLS)                      │
       │                                          │ RCA pipeline starts
       │◀──── collect_context request ────────────┘
       │        { service, fired_at }
       │
       ├─ Kubernetes API ──▶ pod logs, events, deployment status
       │
       ├─ Cloud API ────────▶ CloudWatch / Azure Monitor / GCP Logging
       │   (via IAM role)     logs + metrics in ±15 min window
       │
       └─ System ───────────▶ CPU, memory, journald logs (VM mode)
       │
       └──── context_response ─────────────────▶ AI engine generates RCA
                                                  RCA delivered to Slack /
                                                  Teams / PagerDuty / Email

Deployment model — how many agents do you need?

This is the most important thing to understand before installing. The short answer: one agent per cluster or cloud account is usually enough.

What you want to collect Agents needed Why
Kubernetes pod logs, events, deployments 1 per cluster The agent talks to the Kubernetes API server, which has visibility across all nodes and namespaces in that cluster.
AWS CloudWatch Logs & Metrics 1 per AWS account/region CloudWatch is a centralised AWS service. The agent calls the CloudWatch API — it can pull logs from Lambda, ECS, RDS, EC2, anything in that account — without being on those machines.
Azure Monitor / Log Analytics 1 per Azure subscription Log Analytics is centralised. One agent with Managed Identity can query logs and metrics for any resource in that subscription.
GCP Cloud Logging & Monitoring 1 per GCP project Cloud Logging is centralised. One agent with the right service account can query logs for any resource in that project.
Linux host — systemd logs, process metrics 1 per VM / host journald and psutil are local to the machine. The agent must run on that host to read them.
💡
Most common setup: Install one agent on your EKS / AKS / GKE cluster. That single agent covers Kubernetes context and cloud log/metric APIs for the entire account — no per-machine install needed.

Typical deployment examples

  Example A — EKS + CloudWatch  (1 agent total)
  ─────────────────────────────
  EKS cluster
  └── resolve-agent pod
       ├── k8s API  → pod logs, events, deployments (all namespaces)
       └── AWS API  → CloudWatch logs from Lambda, RDS, EC2, ECS
                      CloudWatch Metrics, CloudTrail events
                      (uses IRSA role — no static credentials)


  Example B — Mixed: EKS prod + standalone VM  (2 agents)
  ──────────────────────────────────────────────────────
  EKS cluster
  └── resolve-agent pod  → k8s + CloudWatch (entire AWS account)

  Legacy VM (e.g. MySQL host)
  └── resolve-agent service  → systemd status + journald logs
                                (only needed for this specific host)


  Example C — Multi-cloud  (1 agent per cloud account)
  ─────────────────────────────────────────────────────
  AWS account  → 1 agent (EKS pod with IRSA)
  Azure sub    → 1 agent (AKS pod with Managed Identity)
  GCP project  → 1 agent (GKE pod with Workload Identity)

Installation — Kubernetes

Your personalised install command (with your customer ID embedded) is in Configuration → Intelligence in the dashboard.

  1. Prerequisites
    • kubectl configured and pointing at the target cluster
    • Permission to create a Namespace, ServiceAccount, ClusterRole, and Deployment
  2. Apply the manifest
    kubectl apply -f https://app.infranexis.com/dashboard/install/<YOUR_CUSTOMER_ID>.yaml

    This creates a resolve-agent namespace and deploys the agent as a single-replica Deployment with a read-only ClusterRole.

  3. Verify it's running
    kubectl get pods -n resolve-agent
    # NAME                             READY   STATUS    RESTARTS
    # resolve-agent-7d9f8c-xxxx        1/1     Running   0

    Within 30 seconds the dashboard will show the agent as Connected.

Uninstall

kubectl delete namespace resolve-agent

Installation — Linux / VM

Use this for standalone hosts (bare metal, EC2, VMs) where you want OS-level context: journald logs, systemd service status, process metrics.

  1. Prerequisites
    • Linux with systemd (Ubuntu 20+, Debian 11+, RHEL 8+, Amazon Linux 2)
    • Run as root or with sudo
    • Outbound HTTPS access to app.infranexis.com:443
  2. Run the installer
    curl -fsSL https://app.infranexis.com/dashboard/install/<YOUR_CUSTOMER_ID>.sh | sudo bash

    The installer creates a resolve-agent system user, installs into /opt/resolve-agent, and registers a systemd service that starts on boot.

  3. Verify
    sudo systemctl status resolve-agent

Manage the service

sudo systemctl restart resolve-agent     # restart
sudo journalctl -fu resolve-agent        # follow logs
sudo systemctl disable --now resolve-agent  # stop + disable

Uninstall

sudo systemctl disable --now resolve-agent
sudo rm -rf /opt/resolve-agent /etc/systemd/system/resolve-agent.service
sudo systemctl daemon-reload

Installation — Manual (Python)

No auto-restart on failure. Use the systemd installer for production Linux hosts. Manual mode is for testing or environments without systemd.
pip install websockets psutil kubernetes

curl -fsSL https://app.infranexis.com/dashboard/agent.py -o agent.py

RCABOT_SERVER_URL=wss://app.infranexis.com/agent/ws \
RCABOT_API_KEY=rbk_live_xxx \
RCABOT_CUSTOMER_ID=your-uuid \
python3 agent.py

Cloud Integration — How it works

For cloud log services (CloudWatch, Azure Monitor, GCP Logging), the agent acts as an authenticated API client — it calls the cloud provider's centralised logging API, not individual machines. This means one agent can pull logs for your entire AWS account, Azure subscription, or GCP project without being installed on every host.

  WITHOUT cloud integration          WITH cloud integration
  ──────────────────────             ──────────────────────
  Alert: Lambda crashed              Alert: Lambda crashed
  Context: k8s events only           Context: k8s events +
  RCA: limited                         CloudWatch Logs ("/aws/lambda/my-fn")
                                         → last 50 error lines before alert
                                       CloudWatch Metrics
                                         → CPU spike at 14:32
                                         → Error count: 47 in 5 min
                                       CloudTrail
                                         → deploy at 14:28 by user: john
                                       RCA: "Deployment at 14:28 introduced
                                            a memory leak causing OOM at 14:32"

The agent uses your cloud's built-in credential mechanism — no static API keys or secrets stored anywhere:

CloudCredential mechanismWhere it runs
AWSEC2 Instance Role / EKS IRSAAttached to EC2 or k8s ServiceAccount
AzureManaged Identity / Workload IdentityEnabled on VM or AKS pod identity
GCPWorkload Identity / Application Default CredentialsGKE ServiceAccount or Compute default SA

AWS — CloudWatch Logs & Metrics

The agent collects from CloudWatch Logs, CloudWatch Metrics, CloudWatch Alarms, and CloudTrail — all scoped to the 15 minutes around the alert time.

  1. Install boto3 on the agent pod

    Update the agent Deployment to install boto3 at startup, or build a custom image. The simplest way is to patch the existing Deployment:

    kubectl set env deployment/resolve-agent -n resolve-agent \
      RCABOT_AWS_REGION=us-east-1

    Then add boto3 by editing the Deployment's init or command to run pip install boto3 before starting the agent.

  2. Create an IAM policy
    {
      "Version": "2012-10-17",
      "Statement": [{
        "Sid": "InfranexisResolveReadOnly",
        "Effect": "Allow",
        "Action": [
          "logs:DescribeLogGroups",
          "logs:FilterLogEvents",
          "cloudwatch:DescribeAlarms",
          "cloudwatch:GetMetricStatistics",
          "cloudwatch:ListMetrics",
          "cloudtrail:LookupEvents"
        ],
        "Resource": "*"
      }]
    }
  3. Create an IRSA role and bind it to the agent's ServiceAccount
    # Create role with OIDC trust for your cluster
    eksctl create iamserviceaccount \
      --cluster=<cluster-name> \
      --namespace=resolve-agent \
      --name=resolve-agent \
      --attach-policy-arn=arn:aws:iam::<account-id>:policy/InfranexisResolveReadOnly \
      --approve
  4. (Optional) Pin specific log groups

    By default the agent auto-discovers log groups by matching your service name. You can override this:

    kubectl set env deployment/resolve-agent -n resolve-agent \
      RCABOT_AWS_LOG_GROUPS=/aws/lambda/my-service,/ecs/my-service,/app/prod
  1. Install boto3
    sudo /opt/resolve-agent/venv/bin/pip install boto3
  2. Attach the IAM policy to your EC2 instance role

    In the AWS Console → EC2 → Your instance → Security → IAM Role → Add permissions → paste the policy JSON from the EKS tab above.

    Or via CLI:

    aws iam put-role-policy \
      --role-name <your-instance-role> \
      --policy-name InfranexisResolveReadOnly \
      --policy-document file://policy.json
  3. (Optional) Set environment variables

    Edit /etc/systemd/system/resolve-agent.service and add to the [Service] section:

    Environment=RCABOT_AWS_REGION=us-east-1
    Environment=RCABOT_AWS_LOG_GROUPS=/aws/lambda/my-svc,/ecs/my-svc
    sudo systemctl daemon-reload && sudo systemctl restart resolve-agent
🔒
All IAM actions are read-only. The agent cannot create, modify, or delete any AWS resource.

Azure — Monitor Logs & Metrics

The agent collects from Log Analytics (AppTraces, AppExceptions, ContainerLog) and Azure Monitor Metrics (CPU, memory, HTTP 5xx). Authentication uses Managed Identity — no client secrets.

  1. Install the Azure SDK on the agent
    # Kubernetes
    kubectl exec -n resolve-agent deploy/resolve-agent -- \
      pip install azure-identity azure-monitor-query
    
    # Linux VM
    sudo /opt/resolve-agent/venv/bin/pip install azure-identity azure-monitor-query
  2. Enable Managed Identity
    • Azure VM: Portal → VM → Identity → System assigned → On
    • AKS: Enable Workload Identity on the cluster, annotate the agent's ServiceAccount with the client ID
  3. Grant the identity read access
    # Log Analytics Reader — on the workspace
    az role assignment create \
      --assignee <managed-identity-object-id> \
      --role "Log Analytics Reader" \
      --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/\
    Microsoft.OperationalInsights/workspaces/<workspace>
    
    # Monitoring Reader — on the resource group
    az role assignment create \
      --assignee <managed-identity-object-id> \
      --role "Monitoring Reader" \
      --scope /subscriptions/<sub>/resourceGroups/<rg>
  4. Set environment variables
    # Log Analytics workspace GUID
    # Found in: Azure Portal → Log Analytics workspace → Settings → Agents
    RCABOT_AZURE_WORKSPACE_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
    
    # Full resource ID to pull metrics from (VM, AKS cluster, or App Service)
    RCABOT_AZURE_RESOURCE_ID=/subscriptions/<sub>/resourceGroups/<rg>/\
    providers/Microsoft.Compute/virtualMachines/<vm-name>

    Apply via kubectl:

    kubectl set env deployment/resolve-agent -n resolve-agent \
      RCABOT_AZURE_WORKSPACE_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx \
      RCABOT_AZURE_RESOURCE_ID=/subscriptions/...

GCP — Cloud Logging & Monitoring

The agent collects WARNING+ log entries and Cloud Monitoring metrics (Compute, Cloud Run, Cloud Functions). Uses Workload Identity or Application Default Credentials.

  1. Install the GCP SDK on the agent
    # Kubernetes
    kubectl exec -n resolve-agent deploy/resolve-agent -- \
      pip install google-cloud-logging google-cloud-monitoring
    
    # Linux VM
    sudo /opt/resolve-agent/venv/bin/pip install \
      google-cloud-logging google-cloud-monitoring
  2. Create a GCP service account and grant read-only access
    gcloud iam service-accounts create resolve-agent \
      --project=<project-id>
    
    gcloud projects add-iam-policy-binding <project-id> \
      --member="serviceAccount:resolve-agent@<project-id>.iam.gserviceaccount.com" \
      --role="roles/logging.viewer"
    
    gcloud projects add-iam-policy-binding <project-id> \
      --member="serviceAccount:resolve-agent@<project-id>.iam.gserviceaccount.com" \
      --role="roles/monitoring.viewer"
  3. Bind to the k8s ServiceAccount (GKE Workload Identity)
    gcloud iam service-accounts add-iam-policy-binding \
      resolve-agent@<project-id>.iam.gserviceaccount.com \
      --role=roles/iam.workloadIdentityUser \
      --member="serviceAccount:<project-id>.svc.id.goog[resolve-agent/resolve-agent]"
    
    kubectl annotate serviceaccount resolve-agent \
      -n resolve-agent \
      iam.gke.io/gcp-service-account=resolve-agent@<project-id>.iam.gserviceaccount.com

    On a Compute Engine VM the default service account already has logging/monitoring viewer. No extra steps needed.

  4. Set environment variables
    # Auto-detected from metadata server on GCE/GKE — only needed outside GCP
    RCABOT_GCP_PROJECT_ID=my-project-123
    
    # Specific Cloud Logging log names to search (optional)
    # Auto-matches by service name if blank
    RCABOT_GCP_LOG_NAMES=my-service,stderr,stdout

Environment Variables Reference

VariableRequiredDescription
RCABOT_SERVER_URLYesWebSocket URL, e.g. wss://app.infranexis.com/agent/ws. Set automatically by the installer.
RCABOT_API_KEYYesYour API key (rbk_live_xxx). Set automatically by the installer.
RCABOT_CUSTOMER_IDYesYour customer UUID. Set automatically by the installer.
RCABOT_NAMESPACESNoComma-separated k8s namespaces to watch. Default: default. Can also be set per-pipeline in the dashboard.
RCABOT_LABELSNoComma-separated key=value labels, e.g. env=prod,region=us-east-1. Shown in the dashboard.
RCABOT_HOSTNAMENoOverride the hostname shown in the dashboard. Defaults to socket.gethostname().
RCABOT_AWS_REGIONNoAWS region for CloudWatch API calls. Auto-detected from EC2 IMDS if blank.
RCABOT_AWS_LOG_GROUPSNoComma-separated CloudWatch log group names. Auto-discovered by service name if blank.
RCABOT_AZURE_WORKSPACE_IDNoLog Analytics workspace GUID. Required to enable Azure log collection.
RCABOT_AZURE_RESOURCE_IDNoFull Azure resource ID for metrics. Required to enable Azure metrics collection.
RCABOT_GCP_PROJECT_IDNoGCP project ID. Auto-detected from metadata server on GCE/GKE.
RCABOT_GCP_LOG_NAMESNoComma-separated Cloud Logging log names. Auto-matches by service name if blank.

Security & Privacy

FAQ

Do I need to install the agent on every machine?

No — for cloud log services. CloudWatch, Azure Monitor, and GCP Logging are centralised APIs. One agent per AWS account / Azure subscription / GCP project is enough to pull logs and metrics for all resources in that account.

Yes — only for OS-level context. If you want systemd service status, journald logs, or process-level CPU/memory from a specific host, the agent must run on that host. This is only needed for legacy VMs or bare-metal machines not covered by your Kubernetes cluster.

Does the agent need inbound firewall rules?

No. The agent opens a single outbound WebSocket to app.infranexis.com:443. Standard HTTPS outbound is all that's required.

What happens if the agent disconnects?

The agent reconnects automatically with exponential backoff (up to 60 seconds between retries). RCAs still run during disconnects — they just won't include live context from that agent. The dashboard shows a clear Disconnected indicator.

Can I restrict which namespaces or log groups the agent can see?

Yes. In Configuration → Pipelines, each pipeline has a namespaces field. The agent only queries the namespaces listed there. For CloudWatch, set RCABOT_AWS_LOG_GROUPS to pin specific log groups and prevent queries to others.

Can I run multiple agents (prod and staging)?

Yes. Install one agent per cluster or environment. Each registers separately in the dashboard. Use RCABOT_LABELS=env=prod and RCABOT_LABELS=env=staging to tag them, then route specific pipelines to specific agents.

How much resource does the agent use?

Idle: ~10 MB memory, ~0% CPU. During an RCA collection burst (a few seconds): ~50–80 MB memory. The k8s manifest requests 50m CPU / 64Mi and limits to 200m CPU / 128Mi.

Which Kubernetes permissions does the agent need?

A ClusterRole with read-only verbs (get, list, watch) on: pods, pods/log, events, deployments, replicasets, nodes, namespaces. It cannot create, delete, patch, or modify anything.

Is the agent source code available?

Yes. The agent is plain Python — no compiled binaries. The exact script that runs on your machine is served from your install URL and can be read in full before deployment. There is nothing hidden or obfuscated.

Ready to install?

Your personalised install command is waiting in the dashboard.

Go to dashboard →