Agent Installation Guide

The Infranexis Resolve agent connects your infrastructure to the RCA engine. It runs inside your environment, collects context when an alert fires, and sends it back over an encrypted WebSocket — giving the RCA engine the full picture of what went wrong.

✓ Open source 🔒 Outbound only — no inbound ports Kubernetes & Linux VM ~10 MB memory when idle

Overview

The agent is a lightweight Python process. It does three things:

Opens a single outbound WebSocket to Infranexis Resolve — no inbound firewall rules needed.
Sits completely idle until an alert fires and the RCA engine requests context.
Collects only the data relevant to that incident (logs, metrics, events), returns it, then goes back to idle.

💡

The agent never runs continuous collection, never stores data locally, and never contacts any third party. It only acts in direct response to a live RCA request from your account.

How it works

  Your Infrastructure                    Infranexis Resolve
  ─────────────────                      ──────────────────
                                         Alert arrives (PagerDuty /
  resolve-agent ──── WebSocket ────────▶ Prometheus / Grafana / etc.)
       │               (TLS)                      │
       │                                          │ RCA pipeline starts
       │◀──── collect_context request ────────────┘
       │        { service, fired_at }
       │
       ├─ Kubernetes API ──▶ pod logs, events, deployment status
       │
       ├─ Cloud API ────────▶ CloudWatch / Azure Monitor / GCP Logging
       │   (via IAM role)     logs + metrics in ±15 min window
       │
       └─ System ───────────▶ CPU, memory, journald logs (VM mode)
       │
       └──── context_response ─────────────────▶ AI engine generates RCA
                                                  RCA delivered to Slack /
                                                  Teams / PagerDuty / Email

Deployment model — how many agents do you need?

This is the most important thing to understand before installing. The short answer: one agent per cluster or cloud account is usually enough.

What you want to collect	Agents needed	Why
Kubernetes pod logs, events, deployments	1 per cluster	The agent talks to the Kubernetes API server, which has visibility across all nodes and namespaces in that cluster.
AWS CloudWatch Logs & Metrics	1 per AWS account/region	CloudWatch is a centralised AWS service. The agent calls the CloudWatch API — it can pull logs from Lambda, ECS, RDS, EC2, anything in that account — without being on those machines.
Azure Monitor / Log Analytics	1 per Azure subscription	Log Analytics is centralised. One agent with Managed Identity can query logs and metrics for any resource in that subscription.
GCP Cloud Logging & Monitoring	1 per GCP project	Cloud Logging is centralised. One agent with the right service account can query logs for any resource in that project.
Linux host — systemd logs, process metrics	1 per VM / host	journald and psutil are local to the machine. The agent must run on that host to read them.

💡

Most common setup: Install one agent on your EKS / AKS / GKE cluster. That single agent covers Kubernetes context and cloud log/metric APIs for the entire account — no per-machine install needed.

Typical deployment examples

  Example A — EKS + CloudWatch  (1 agent total)
  ─────────────────────────────
  EKS cluster
  └── resolve-agent pod
       ├── k8s API  → pod logs, events, deployments (all namespaces)
       └── AWS API  → CloudWatch logs from Lambda, RDS, EC2, ECS
                      CloudWatch Metrics, CloudTrail events
                      (uses IRSA role — no static credentials)


  Example B — Mixed: EKS prod + standalone VM  (2 agents)
  ──────────────────────────────────────────────────────
  EKS cluster
  └── resolve-agent pod  → k8s + CloudWatch (entire AWS account)

  Legacy VM (e.g. MySQL host)
  └── resolve-agent service  → systemd status + journald logs
                                (only needed for this specific host)


  Example C — Multi-cloud  (1 agent per cloud account)
  ─────────────────────────────────────────────────────
  AWS account  → 1 agent (EKS pod with IRSA)
  Azure sub    → 1 agent (AKS pod with Managed Identity)
  GCP project  → 1 agent (GKE pod with Workload Identity)

Installation — Kubernetes

Your personalised install command (with your customer ID embedded) is in Configuration → Intelligence in the dashboard.

Prerequisites
- kubectl configured and pointing at the target cluster
- Permission to create a Namespace, ServiceAccount, ClusterRole, and Deployment
Apply the manifest
```
kubectl apply -f https://app.infranexis.com/dashboard/install/<YOUR_CUSTOMER_ID>.yaml
```
This creates a resolve-agent namespace and deploys the agent as a single-replica Deployment with a read-only ClusterRole.

Verify it's running

kubectl get pods -n resolve-agent
# NAME                             READY   STATUS    RESTARTS
# resolve-agent-7d9f8c-xxxx        1/1     Running   0

Within 30 seconds the dashboard will show the agent as Connected.

Uninstall

kubectl delete namespace resolve-agent

Installation — Linux / VM

Use this for standalone hosts (bare metal, EC2, VMs) where you want OS-level context: journald logs, systemd service status, process metrics.

Prerequisites
- Linux with systemd (Ubuntu 20+, Debian 11+, RHEL 8+, Amazon Linux 2)
- Run as root or with sudo
- Outbound HTTPS access to app.infranexis.com:443
Run the installer
```
curl -fsSL https://app.infranexis.com/dashboard/install/<YOUR_CUSTOMER_ID>.sh | sudo bash
```
The installer creates a resolve-agent system user, installs into /opt/resolve-agent, and registers a systemd service that starts on boot.
Verify
```
sudo systemctl status resolve-agent
```

Manage the service

sudo systemctl restart resolve-agent     # restart
sudo journalctl -fu resolve-agent        # follow logs
sudo systemctl disable --now resolve-agent  # stop + disable

Uninstall

sudo systemctl disable --now resolve-agent
sudo rm -rf /opt/resolve-agent /etc/systemd/system/resolve-agent.service
sudo systemctl daemon-reload

Installation — Manual (Python)

⚠

No auto-restart on failure. Use the systemd installer for production Linux hosts. Manual mode is for testing or environments without systemd.

pip install websockets psutil kubernetes

curl -fsSL https://app.infranexis.com/dashboard/agent.py -o agent.py

RCABOT_SERVER_URL=wss://app.infranexis.com/agent/ws \
RCABOT_API_KEY=rbk_live_xxx \
RCABOT_CUSTOMER_ID=your-uuid \
python3 agent.py

Cloud Integration — How it works

For cloud log services (CloudWatch, Azure Monitor, GCP Logging), the agent acts as an authenticated API client — it calls the cloud provider's centralised logging API, not individual machines. This means one agent can pull logs for your entire AWS account, Azure subscription, or GCP project without being installed on every host.

  WITHOUT cloud integration          WITH cloud integration
  ──────────────────────             ──────────────────────
  Alert: Lambda crashed              Alert: Lambda crashed
  Context: k8s events only           Context: k8s events +
  RCA: limited                         CloudWatch Logs ("/aws/lambda/my-fn")
                                         → last 50 error lines before alert
                                       CloudWatch Metrics
                                         → CPU spike at 14:32
                                         → Error count: 47 in 5 min
                                       CloudTrail
                                         → deploy at 14:28 by user: john
                                       RCA: "Deployment at 14:28 introduced
                                            a memory leak causing OOM at 14:32"

The agent uses your cloud's built-in credential mechanism — no static API keys or secrets stored anywhere:

Cloud	Credential mechanism	Where it runs
AWS	EC2 Instance Role / EKS IRSA	Attached to EC2 or k8s ServiceAccount
Azure	Managed Identity / Workload Identity	Enabled on VM or AKS pod identity
GCP	Workload Identity / Application Default Credentials	GKE ServiceAccount or Compute default SA

AWS — CloudWatch Logs & Metrics

The agent collects from CloudWatch Logs, CloudWatch Metrics, CloudWatch Alarms, and CloudTrail — all scoped to the 15 minutes around the alert time.

Install boto3 on the agent pod
Update the agent Deployment to install boto3 at startup, or build a custom image. The simplest way is to patch the existing Deployment:
```
kubectl set env deployment/resolve-agent -n resolve-agent \
  RCABOT_AWS_REGION=us-east-1
```
Then add boto3 by editing the Deployment's init or command to run pip install boto3 before starting the agent.

Create an IAM policy

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "InfranexisResolveReadOnly",
    "Effect": "Allow",
    "Action": [
      "logs:DescribeLogGroups",
      "logs:FilterLogEvents",
      "cloudwatch:DescribeAlarms",
      "cloudwatch:GetMetricStatistics",
      "cloudwatch:ListMetrics",
      "cloudtrail:LookupEvents"
    ],
    "Resource": "*"
  }]
}

Create an IRSA role and bind it to the agent's ServiceAccount

# Create role with OIDC trust for your cluster
eksctl create iamserviceaccount \
  --cluster=<cluster-name> \
  --namespace=resolve-agent \
  --name=resolve-agent \
  --attach-policy-arn=arn:aws:iam::<account-id>:policy/InfranexisResolveReadOnly \
  --approve

(Optional) Pin specific log groups
By default the agent auto-discovers log groups by matching your service name. You can override this:
```
kubectl set env deployment/resolve-agent -n resolve-agent \
  RCABOT_AWS_LOG_GROUPS=/aws/lambda/my-service,/ecs/my-service,/app/prod
```

Install boto3

sudo /opt/resolve-agent/venv/bin/pip install boto3

Attach the IAM policy to your EC2 instance role
In the AWS Console → EC2 → Your instance → Security → IAM Role → Add permissions → paste the policy JSON from the EKS tab above.

Or via CLI:
```
aws iam put-role-policy \
  --role-name <your-instance-role> \
  --policy-name InfranexisResolveReadOnly \
  --policy-document file://policy.json
```

(Optional) Set environment variables

Edit /etc/systemd/system/resolve-agent.service and add to the [Service] section:

Environment=RCABOT_AWS_REGION=us-east-1
Environment=RCABOT_AWS_LOG_GROUPS=/aws/lambda/my-svc,/ecs/my-svc

sudo systemctl daemon-reload && sudo systemctl restart resolve-agent

🔒

All IAM actions are read-only. The agent cannot create, modify, or delete any AWS resource.

Azure — Monitor Logs & Metrics

The agent collects from Log Analytics (AppTraces, AppExceptions, ContainerLog) and Azure Monitor Metrics (CPU, memory, HTTP 5xx). Authentication uses Managed Identity — no client secrets.

Install the Azure SDK on the agent

# Kubernetes
kubectl exec -n resolve-agent deploy/resolve-agent -- \
  pip install azure-identity azure-monitor-query

# Linux VM
sudo /opt/resolve-agent/venv/bin/pip install azure-identity azure-monitor-query

Enable Managed Identity
- Azure VM: Portal → VM → Identity → System assigned → On
- AKS: Enable Workload Identity on the cluster, annotate the agent's ServiceAccount with the client ID

Grant the identity read access

# Log Analytics Reader — on the workspace
az role assignment create \
  --assignee <managed-identity-object-id> \
  --role "Log Analytics Reader" \
  --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/\
Microsoft.OperationalInsights/workspaces/<workspace>

# Monitoring Reader — on the resource group
az role assignment create \
  --assignee <managed-identity-object-id> \
  --role "Monitoring Reader" \
  --scope /subscriptions/<sub>/resourceGroups/<rg>

Set environment variables

# Log Analytics workspace GUID
# Found in: Azure Portal → Log Analytics workspace → Settings → Agents
RCABOT_AZURE_WORKSPACE_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

# Full resource ID to pull metrics from (VM, AKS cluster, or App Service)
RCABOT_AZURE_RESOURCE_ID=/subscriptions/<sub>/resourceGroups/<rg>/\
providers/Microsoft.Compute/virtualMachines/<vm-name>

Apply via kubectl:

kubectl set env deployment/resolve-agent -n resolve-agent \
  RCABOT_AZURE_WORKSPACE_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx \
  RCABOT_AZURE_RESOURCE_ID=/subscriptions/...

GCP — Cloud Logging & Monitoring

The agent collects WARNING+ log entries and Cloud Monitoring metrics (Compute, Cloud Run, Cloud Functions). Uses Workload Identity or Application Default Credentials.

Install the GCP SDK on the agent

# Kubernetes
kubectl exec -n resolve-agent deploy/resolve-agent -- \
  pip install google-cloud-logging google-cloud-monitoring

# Linux VM
sudo /opt/resolve-agent/venv/bin/pip install \
  google-cloud-logging google-cloud-monitoring

Create a GCP service account and grant read-only access

gcloud iam service-accounts create resolve-agent \
  --project=<project-id>

gcloud projects add-iam-policy-binding <project-id> \
  --member="serviceAccount:resolve-agent@<project-id>.iam.gserviceaccount.com" \
  --role="roles/logging.viewer"

gcloud projects add-iam-policy-binding <project-id> \
  --member="serviceAccount:resolve-agent@<project-id>.iam.gserviceaccount.com" \
  --role="roles/monitoring.viewer"

Bind to the k8s ServiceAccount (GKE Workload Identity)

gcloud iam service-accounts add-iam-policy-binding \
  resolve-agent@<project-id>.iam.gserviceaccount.com \
  --role=roles/iam.workloadIdentityUser \
  --member="serviceAccount:<project-id>.svc.id.goog[resolve-agent/resolve-agent]"

kubectl annotate serviceaccount resolve-agent \
  -n resolve-agent \
  iam.gke.io/gcp-service-account=resolve-agent@<project-id>.iam.gserviceaccount.com

On a Compute Engine VM the default service account already has logging/monitoring viewer. No extra steps needed.

Set environment variables

# Auto-detected from metadata server on GCE/GKE — only needed outside GCP
RCABOT_GCP_PROJECT_ID=my-project-123

# Specific Cloud Logging log names to search (optional)
# Auto-matches by service name if blank
RCABOT_GCP_LOG_NAMES=my-service,stderr,stdout

Environment Variables Reference

Variable	Required	Description
RCABOT_SERVER_URL	Yes	WebSocket URL, e.g. wss://app.infranexis.com/agent/ws. Set automatically by the installer.
RCABOT_API_KEY	Yes	Your API key (rbk_live_xxx). Set automatically by the installer.
RCABOT_CUSTOMER_ID	Yes	Your customer UUID. Set automatically by the installer.
RCABOT_NAMESPACES	No	Comma-separated k8s namespaces to watch. Default: default. Can also be set per-pipeline in the dashboard.
RCABOT_LABELS	No	Comma-separated key=value labels, e.g. env=prod,region=us-east-1. Shown in the dashboard.
RCABOT_HOSTNAME	No	Override the hostname shown in the dashboard. Defaults to socket.gethostname().
RCABOT_AWS_REGION	No	AWS region for CloudWatch API calls. Auto-detected from EC2 IMDS if blank.
RCABOT_AWS_LOG_GROUPS	No	Comma-separated CloudWatch log group names. Auto-discovered by service name if blank.
RCABOT_AZURE_WORKSPACE_ID	No	Log Analytics workspace GUID. Required to enable Azure log collection.
RCABOT_AZURE_RESOURCE_ID	No	Full Azure resource ID for metrics. Required to enable Azure metrics collection.
RCABOT_GCP_PROJECT_ID	No	GCP project ID. Auto-detected from metadata server on GCE/GKE.
RCABOT_GCP_LOG_NAMES	No	Comma-separated Cloud Logging log names. Auto-matches by service name if blank.

Security & Privacy

Agent source code is open source — inspect every line before deploying. The script served by your install URL is the exact code that runs.
Outbound-only WebSocket to port 443 — no inbound attack surface, no listening ports.
Kubernetes RBAC is read-only — the agent cannot create, delete, or modify any resource.
Cloud IAM permissions are read-only — logs:FilterLogEvents, cloudwatch:GetMetricStatistics, monitoring.viewer, etc.
No static credentials — uses instance role (AWS), Managed Identity (Azure), or Workload Identity (GCP).
WebSocket connection authenticated with your unique API key. No other customer can see your data.
All data in transit encrypted with TLS 1.3.
Context data is used only to generate the RCA — never sold, shared, or used to train models.
Idle memory: ~10 MB. Collection burst: ~50–80 MB for a few seconds. No continuous background activity.

FAQ

Do I need to install the agent on every machine? ▼

No — for cloud log services. CloudWatch, Azure Monitor, and GCP Logging are centralised APIs. One agent per AWS account / Azure subscription / GCP project is enough to pull logs and metrics for all resources in that account.

Yes — only for OS-level context. If you want systemd service status, journald logs, or process-level CPU/memory from a specific host, the agent must run on that host. This is only needed for legacy VMs or bare-metal machines not covered by your Kubernetes cluster.

Does the agent need inbound firewall rules? ▼

No. The agent opens a single outbound WebSocket to app.infranexis.com:443. Standard HTTPS outbound is all that's required.

What happens if the agent disconnects? ▼

The agent reconnects automatically with exponential backoff (up to 60 seconds between retries). RCAs still run during disconnects — they just won't include live context from that agent. The dashboard shows a clear Disconnected indicator.

Can I restrict which namespaces or log groups the agent can see? ▼

Yes. In Configuration → Pipelines, each pipeline has a namespaces field. The agent only queries the namespaces listed there. For CloudWatch, set RCABOT_AWS_LOG_GROUPS to pin specific log groups and prevent queries to others.

Can I run multiple agents (prod and staging)? ▼

Yes. Install one agent per cluster or environment. Each registers separately in the dashboard. Use RCABOT_LABELS=env=prod and RCABOT_LABELS=env=staging to tag them, then route specific pipelines to specific agents.

How much resource does the agent use? ▼

Idle: ~10 MB memory, ~0% CPU. During an RCA collection burst (a few seconds): ~50–80 MB memory. The k8s manifest requests 50m CPU / 64Mi and limits to 200m CPU / 128Mi.

Which Kubernetes permissions does the agent need? ▼

A ClusterRole with read-only verbs (get, list, watch) on: pods, pods/log, events, deployments, replicasets, nodes, namespaces. It cannot create, delete, patch, or modify anything.

Is the agent source code available? ▼

Yes. The agent is plain Python — no compiled binaries. The exact script that runs on your machine is served from your install URL and can be read in full before deployment. There is nothing hidden or obfuscated.

Ready to install?

Your personalised install command is waiting in the dashboard.

Go to dashboard →