Agent Installation Guide
The Infranexis Resolve agent connects your infrastructure to the RCA engine. It runs inside your environment, collects context when an alert fires, and sends it back over an encrypted WebSocket — giving the RCA engine the full picture of what went wrong.
Overview
The agent is a lightweight Python process. It does three things:
- Opens a single outbound WebSocket to Infranexis Resolve — no inbound firewall rules needed.
- Sits completely idle until an alert fires and the RCA engine requests context.
- Collects only the data relevant to that incident (logs, metrics, events), returns it, then goes back to idle.
How it works
Your Infrastructure Infranexis Resolve
───────────────── ──────────────────
Alert arrives (PagerDuty /
resolve-agent ──── WebSocket ────────▶ Prometheus / Grafana / etc.)
│ (TLS) │
│ │ RCA pipeline starts
│◀──── collect_context request ────────────┘
│ { service, fired_at }
│
├─ Kubernetes API ──▶ pod logs, events, deployment status
│
├─ Cloud API ────────▶ CloudWatch / Azure Monitor / GCP Logging
│ (via IAM role) logs + metrics in ±15 min window
│
└─ System ───────────▶ CPU, memory, journald logs (VM mode)
│
└──── context_response ─────────────────▶ AI engine generates RCA
RCA delivered to Slack /
Teams / PagerDuty / Email
Deployment model — how many agents do you need?
This is the most important thing to understand before installing. The short answer: one agent per cluster or cloud account is usually enough.
| What you want to collect | Agents needed | Why |
|---|---|---|
| Kubernetes pod logs, events, deployments | 1 per cluster | The agent talks to the Kubernetes API server, which has visibility across all nodes and namespaces in that cluster. |
| AWS CloudWatch Logs & Metrics | 1 per AWS account/region | CloudWatch is a centralised AWS service. The agent calls the CloudWatch API — it can pull logs from Lambda, ECS, RDS, EC2, anything in that account — without being on those machines. |
| Azure Monitor / Log Analytics | 1 per Azure subscription | Log Analytics is centralised. One agent with Managed Identity can query logs and metrics for any resource in that subscription. |
| GCP Cloud Logging & Monitoring | 1 per GCP project | Cloud Logging is centralised. One agent with the right service account can query logs for any resource in that project. |
| Linux host — systemd logs, process metrics | 1 per VM / host | journald and psutil are local to the machine. The agent must run on that host to read them. |
Typical deployment examples
Example A — EKS + CloudWatch (1 agent total)
─────────────────────────────
EKS cluster
└── resolve-agent pod
├── k8s API → pod logs, events, deployments (all namespaces)
└── AWS API → CloudWatch logs from Lambda, RDS, EC2, ECS
CloudWatch Metrics, CloudTrail events
(uses IRSA role — no static credentials)
Example B — Mixed: EKS prod + standalone VM (2 agents)
──────────────────────────────────────────────────────
EKS cluster
└── resolve-agent pod → k8s + CloudWatch (entire AWS account)
Legacy VM (e.g. MySQL host)
└── resolve-agent service → systemd status + journald logs
(only needed for this specific host)
Example C — Multi-cloud (1 agent per cloud account)
─────────────────────────────────────────────────────
AWS account → 1 agent (EKS pod with IRSA)
Azure sub → 1 agent (AKS pod with Managed Identity)
GCP project → 1 agent (GKE pod with Workload Identity)
Installation — Kubernetes
Your personalised install command (with your customer ID embedded) is in Configuration → Intelligence in the dashboard.
-
Prerequisites
- kubectl configured and pointing at the target cluster
- Permission to create a Namespace, ServiceAccount, ClusterRole, and Deployment
-
Apply the manifest
kubectl apply -f https://app.infranexis.com/dashboard/install/<YOUR_CUSTOMER_ID>.yaml
This creates a resolve-agent namespace and deploys the agent as a single-replica Deployment with a read-only ClusterRole.
-
Verify it's running
kubectl get pods -n resolve-agent # NAME READY STATUS RESTARTS # resolve-agent-7d9f8c-xxxx 1/1 Running 0
Within 30 seconds the dashboard will show the agent as Connected.
Uninstall
kubectl delete namespace resolve-agent
Installation — Linux / VM
Use this for standalone hosts (bare metal, EC2, VMs) where you want OS-level context: journald logs, systemd service status, process metrics.
-
Prerequisites
- Linux with systemd (Ubuntu 20+, Debian 11+, RHEL 8+, Amazon Linux 2)
- Run as root or with sudo
- Outbound HTTPS access to app.infranexis.com:443
-
Run the installer
curl -fsSL https://app.infranexis.com/dashboard/install/<YOUR_CUSTOMER_ID>.sh | sudo bash
The installer creates a resolve-agent system user, installs into /opt/resolve-agent, and registers a systemd service that starts on boot.
-
Verify
sudo systemctl status resolve-agent
Manage the service
sudo systemctl restart resolve-agent # restart sudo journalctl -fu resolve-agent # follow logs sudo systemctl disable --now resolve-agent # stop + disable
Uninstall
sudo systemctl disable --now resolve-agent sudo rm -rf /opt/resolve-agent /etc/systemd/system/resolve-agent.service sudo systemctl daemon-reload
Installation — Manual (Python)
pip install websockets psutil kubernetes curl -fsSL https://app.infranexis.com/dashboard/agent.py -o agent.py RCABOT_SERVER_URL=wss://app.infranexis.com/agent/ws \ RCABOT_API_KEY=rbk_live_xxx \ RCABOT_CUSTOMER_ID=your-uuid \ python3 agent.py
Cloud Integration — How it works
For cloud log services (CloudWatch, Azure Monitor, GCP Logging), the agent acts as an authenticated API client — it calls the cloud provider's centralised logging API, not individual machines. This means one agent can pull logs for your entire AWS account, Azure subscription, or GCP project without being installed on every host.
WITHOUT cloud integration WITH cloud integration
────────────────────── ──────────────────────
Alert: Lambda crashed Alert: Lambda crashed
Context: k8s events only Context: k8s events +
RCA: limited CloudWatch Logs ("/aws/lambda/my-fn")
→ last 50 error lines before alert
CloudWatch Metrics
→ CPU spike at 14:32
→ Error count: 47 in 5 min
CloudTrail
→ deploy at 14:28 by user: john
RCA: "Deployment at 14:28 introduced
a memory leak causing OOM at 14:32"
The agent uses your cloud's built-in credential mechanism — no static API keys or secrets stored anywhere:
| Cloud | Credential mechanism | Where it runs |
|---|---|---|
| AWS | EC2 Instance Role / EKS IRSA | Attached to EC2 or k8s ServiceAccount |
| Azure | Managed Identity / Workload Identity | Enabled on VM or AKS pod identity |
| GCP | Workload Identity / Application Default Credentials | GKE ServiceAccount or Compute default SA |
AWS — CloudWatch Logs & Metrics
The agent collects from CloudWatch Logs, CloudWatch Metrics, CloudWatch Alarms, and CloudTrail — all scoped to the 15 minutes around the alert time.
-
Install boto3 on the agent pod
Update the agent Deployment to install boto3 at startup, or build a custom image. The simplest way is to patch the existing Deployment:
kubectl set env deployment/resolve-agent -n resolve-agent \ RCABOT_AWS_REGION=us-east-1
Then add boto3 by editing the Deployment's init or command to run pip install boto3 before starting the agent.
-
Create an IAM policy
{ "Version": "2012-10-17", "Statement": [{ "Sid": "InfranexisResolveReadOnly", "Effect": "Allow", "Action": [ "logs:DescribeLogGroups", "logs:FilterLogEvents", "cloudwatch:DescribeAlarms", "cloudwatch:GetMetricStatistics", "cloudwatch:ListMetrics", "cloudtrail:LookupEvents" ], "Resource": "*" }] } -
Create an IRSA role and bind it to the agent's ServiceAccount
# Create role with OIDC trust for your cluster eksctl create iamserviceaccount \ --cluster=<cluster-name> \ --namespace=resolve-agent \ --name=resolve-agent \ --attach-policy-arn=arn:aws:iam::<account-id>:policy/InfranexisResolveReadOnly \ --approve
-
(Optional) Pin specific log groups
By default the agent auto-discovers log groups by matching your service name. You can override this:
kubectl set env deployment/resolve-agent -n resolve-agent \ RCABOT_AWS_LOG_GROUPS=/aws/lambda/my-service,/ecs/my-service,/app/prod
-
Install boto3
sudo /opt/resolve-agent/venv/bin/pip install boto3
-
Attach the IAM policy to your EC2 instance role
In the AWS Console → EC2 → Your instance → Security → IAM Role → Add permissions → paste the policy JSON from the EKS tab above.
Or via CLI:
aws iam put-role-policy \ --role-name <your-instance-role> \ --policy-name InfranexisResolveReadOnly \ --policy-document file://policy.json
-
(Optional) Set environment variables
Edit /etc/systemd/system/resolve-agent.service and add to the [Service] section:
Environment=RCABOT_AWS_REGION=us-east-1 Environment=RCABOT_AWS_LOG_GROUPS=/aws/lambda/my-svc,/ecs/my-svc
sudo systemctl daemon-reload && sudo systemctl restart resolve-agent
Azure — Monitor Logs & Metrics
The agent collects from Log Analytics (AppTraces, AppExceptions, ContainerLog) and Azure Monitor Metrics (CPU, memory, HTTP 5xx). Authentication uses Managed Identity — no client secrets.
-
Install the Azure SDK on the agent
# Kubernetes kubectl exec -n resolve-agent deploy/resolve-agent -- \ pip install azure-identity azure-monitor-query # Linux VM sudo /opt/resolve-agent/venv/bin/pip install azure-identity azure-monitor-query
-
Enable Managed Identity
- Azure VM: Portal → VM → Identity → System assigned → On
- AKS: Enable Workload Identity on the cluster, annotate the agent's ServiceAccount with the client ID
-
Grant the identity read access
# Log Analytics Reader — on the workspace az role assignment create \ --assignee <managed-identity-object-id> \ --role "Log Analytics Reader" \ --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/\ Microsoft.OperationalInsights/workspaces/<workspace> # Monitoring Reader — on the resource group az role assignment create \ --assignee <managed-identity-object-id> \ --role "Monitoring Reader" \ --scope /subscriptions/<sub>/resourceGroups/<rg>
-
Set environment variables
# Log Analytics workspace GUID # Found in: Azure Portal → Log Analytics workspace → Settings → Agents RCABOT_AZURE_WORKSPACE_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx # Full resource ID to pull metrics from (VM, AKS cluster, or App Service) RCABOT_AZURE_RESOURCE_ID=/subscriptions/<sub>/resourceGroups/<rg>/\ providers/Microsoft.Compute/virtualMachines/<vm-name>
Apply via kubectl:
kubectl set env deployment/resolve-agent -n resolve-agent \ RCABOT_AZURE_WORKSPACE_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx \ RCABOT_AZURE_RESOURCE_ID=/subscriptions/...
GCP — Cloud Logging & Monitoring
The agent collects WARNING+ log entries and Cloud Monitoring metrics (Compute, Cloud Run, Cloud Functions). Uses Workload Identity or Application Default Credentials.
-
Install the GCP SDK on the agent
# Kubernetes kubectl exec -n resolve-agent deploy/resolve-agent -- \ pip install google-cloud-logging google-cloud-monitoring # Linux VM sudo /opt/resolve-agent/venv/bin/pip install \ google-cloud-logging google-cloud-monitoring
-
Create a GCP service account and grant read-only access
gcloud iam service-accounts create resolve-agent \ --project=<project-id> gcloud projects add-iam-policy-binding <project-id> \ --member="serviceAccount:resolve-agent@<project-id>.iam.gserviceaccount.com" \ --role="roles/logging.viewer" gcloud projects add-iam-policy-binding <project-id> \ --member="serviceAccount:resolve-agent@<project-id>.iam.gserviceaccount.com" \ --role="roles/monitoring.viewer"
-
Bind to the k8s ServiceAccount (GKE Workload Identity)
gcloud iam service-accounts add-iam-policy-binding \ resolve-agent@<project-id>.iam.gserviceaccount.com \ --role=roles/iam.workloadIdentityUser \ --member="serviceAccount:<project-id>.svc.id.goog[resolve-agent/resolve-agent]" kubectl annotate serviceaccount resolve-agent \ -n resolve-agent \ iam.gke.io/gcp-service-account=resolve-agent@<project-id>.iam.gserviceaccount.com
On a Compute Engine VM the default service account already has logging/monitoring viewer. No extra steps needed.
-
Set environment variables
# Auto-detected from metadata server on GCE/GKE — only needed outside GCP RCABOT_GCP_PROJECT_ID=my-project-123 # Specific Cloud Logging log names to search (optional) # Auto-matches by service name if blank RCABOT_GCP_LOG_NAMES=my-service,stderr,stdout
Environment Variables Reference
| Variable | Required | Description |
|---|---|---|
| RCABOT_SERVER_URL | Yes | WebSocket URL, e.g. wss://app.infranexis.com/agent/ws. Set automatically by the installer. |
| RCABOT_API_KEY | Yes | Your API key (rbk_live_xxx). Set automatically by the installer. |
| RCABOT_CUSTOMER_ID | Yes | Your customer UUID. Set automatically by the installer. |
| RCABOT_NAMESPACES | No | Comma-separated k8s namespaces to watch. Default: default. Can also be set per-pipeline in the dashboard. |
| RCABOT_LABELS | No | Comma-separated key=value labels, e.g. env=prod,region=us-east-1. Shown in the dashboard. |
| RCABOT_HOSTNAME | No | Override the hostname shown in the dashboard. Defaults to socket.gethostname(). |
| RCABOT_AWS_REGION | No | AWS region for CloudWatch API calls. Auto-detected from EC2 IMDS if blank. |
| RCABOT_AWS_LOG_GROUPS | No | Comma-separated CloudWatch log group names. Auto-discovered by service name if blank. |
| RCABOT_AZURE_WORKSPACE_ID | No | Log Analytics workspace GUID. Required to enable Azure log collection. |
| RCABOT_AZURE_RESOURCE_ID | No | Full Azure resource ID for metrics. Required to enable Azure metrics collection. |
| RCABOT_GCP_PROJECT_ID | No | GCP project ID. Auto-detected from metadata server on GCE/GKE. |
| RCABOT_GCP_LOG_NAMES | No | Comma-separated Cloud Logging log names. Auto-matches by service name if blank. |
Security & Privacy
- Agent source code is open source — inspect every line before deploying. The script served by your install URL is the exact code that runs.
- Outbound-only WebSocket to port 443 — no inbound attack surface, no listening ports.
- Kubernetes RBAC is read-only — the agent cannot create, delete, or modify any resource.
- Cloud IAM permissions are read-only — logs:FilterLogEvents, cloudwatch:GetMetricStatistics, monitoring.viewer, etc.
- No static credentials — uses instance role (AWS), Managed Identity (Azure), or Workload Identity (GCP).
- WebSocket connection authenticated with your unique API key. No other customer can see your data.
- All data in transit encrypted with TLS 1.3.
- Context data is used only to generate the RCA — never sold, shared, or used to train models.
- Idle memory: ~10 MB. Collection burst: ~50–80 MB for a few seconds. No continuous background activity.
FAQ
No — for cloud log services. CloudWatch, Azure Monitor, and GCP Logging are centralised APIs. One agent per AWS account / Azure subscription / GCP project is enough to pull logs and metrics for all resources in that account.
Yes — only for OS-level context. If you want systemd service status, journald logs, or process-level CPU/memory from a specific host, the agent must run on that host. This is only needed for legacy VMs or bare-metal machines not covered by your Kubernetes cluster.
No. The agent opens a single outbound WebSocket to app.infranexis.com:443. Standard HTTPS outbound is all that's required.
The agent reconnects automatically with exponential backoff (up to 60 seconds between retries). RCAs still run during disconnects — they just won't include live context from that agent. The dashboard shows a clear Disconnected indicator.
Yes. In Configuration → Pipelines, each pipeline has a namespaces field. The agent only queries the namespaces listed there. For CloudWatch, set RCABOT_AWS_LOG_GROUPS to pin specific log groups and prevent queries to others.
Yes. Install one agent per cluster or environment. Each registers separately in the dashboard. Use RCABOT_LABELS=env=prod and RCABOT_LABELS=env=staging to tag them, then route specific pipelines to specific agents.
Idle: ~10 MB memory, ~0% CPU. During an RCA collection burst (a few seconds): ~50–80 MB memory. The k8s manifest requests 50m CPU / 64Mi and limits to 200m CPU / 128Mi.
A ClusterRole with read-only verbs (get, list, watch) on: pods, pods/log, events, deployments, replicasets, nodes, namespaces. It cannot create, delete, patch, or modify anything.
Yes. The agent is plain Python — no compiled binaries. The exact script that runs on your machine is served from your install URL and can be read in full before deployment. There is nothing hidden or obfuscated.