Skip to main content

Architecture

Deployment Model

LogClaw uses a namespace-per-tenant, dedicated-instance model. Every tenant receives its own isolated Kubernetes namespace (logclaw-<tenantId>) containing a full, dedicated copy of every component. There is no shared data plane between tenants. Cluster-scoped operators (Strimzi, Flink Operator, ESO, cert-manager, OpenSearch Operator) are installed once per cluster and watch all tenant namespaces via label selectors. Tenant workloads are provisioned and reconciled through ArgoCD ApplicationSet, which generates one ArgoCD Application per tenant values file committed to gitops/tenants/.

Data Flow

┌─────────────────────────────────────────────────────────────────────┐
│                        Log Sources                                  │
│   Apps, Infrastructure, Cloud Services, CI/CD, Kubernetes Pods      │
└──────────────┬──────────────────────────┬───────────────────────────┘
               │ OTLP/gRPC :4317          │ OTLP/HTTP :4318
               ▼                          ▼
        ┌──────────────────────────────────────┐
        │       logclaw-otel-collector         │
        │   OTLP receiver → batch → enrich    │
        │   (tenant_id injection, batching)    │
        └──────────────────┬───────────────────┘
                           │ Kafka produce (otlp_json, lz4)

        ┌──────────────────────────────────────┐
        │          logclaw-kafka               │
        │   KRaft mode (Strimzi)               │
        │   Topics: raw-logs, enriched-logs    │
        └──────┬──────────────────┬────────────┘
               │                  │
     ┌─────────▼──────┐   ┌──────▼────────────────────────────────┐
     │ logclaw-flink  │   │         logclaw-bridge                │
     │ Stream jobs    │   │  Thread 1: OTLP ETL (flatten→enrich) │
     │ (production)   │   │  Thread 2: Anomaly detection (Z-score)│
     │                │   │  Thread 3: OpenSearch indexer          │
     │                │   │  Thread 4: Request lifecycle engine    │
     └────────────────┘   │           (5-layer trace correlation) │
                          └──────┬────────────────┬───────────────┘
                                 │                │
                    ┌────────────▼──┐    ┌────────▼──────────────┐
                    │  OpenSearch   │    │  logclaw-ticketing    │
                    │  (search +   │    │  AI SRE Agent         │
                    │   analytics) │    │  (PagerDuty, Jira,    │
                    └──────┬───────┘    │   ServiceNow, etc.)   │
                           │            └───────────────────────┘
                    ┌──────▼───────┐
                    │  Dashboard   │◄── logclaw-agent
                    │  (Next.js)   │    (infra health metrics)
                    └──────────────┘

Component Details

OTel Collector — Ingestion Gateway

The OpenTelemetry Collector is the sole entry point for all log data. It accepts OTLP over both gRPC (:4317) and HTTP (:4318), the CNCF industry standard supported by Datadog, Splunk, Grafana, AWS, GCP, and Azure. Pipeline: otlp receivermemory_limiterresource processor (inject tenant_id) → batchkafka exporter (otlp_json, lz4 compression)

Kafka — Event Bus

Apache Kafka (Strimzi, KRaft mode — no ZooKeeper) provides the durable event bus. Two primary topics:
TopicProducerConsumerFormat
raw-logsOTel CollectorBridge / FlinkOTLP JSON
enriched-logsBridge / FlinkTicketing AgentFlat JSON (normalized)

Bridge — ETL + Intelligence Engine

The Bridge is a Python service running 4 concurrent threads:
ThreadRoleDetails
OTLP ETLFlatten OTLP JSON → normalized documentsUnwraps resourceLogs → scopeLogs → logRecords, extracts body, severity, traceId, spanId, timestamps
Anomaly DetectionZ-score based anomaly scoringSliding window over error rates per service, configurable threshold and window size
OpenSearch IndexerBulk index enriched documentsReads from enriched-logs, writes to logclaw-logs-YYYY.MM.dd indices
Request Lifecycle5-layer trace correlation engineGroups logs by traceId → builds request timelines → computes blast radius → generates incident context
Bridge vs Flink: The Bridge provides trace correlation, anomaly detection, and OpenSearch indexing in a single lightweight Python service. For high-throughput production, Flink handles stream processing. For dev/demo and early-stage deployments, the Bridge is simpler — no Flink Operator needed. Enable both for maximum capability.

OpenSearch — Search & Analytics

OpenSearch provides full-text search, log analytics, and visualization. Deployed with dedicated master and data nodes for production tiers. Index pattern: logclaw-logs-YYYY.MM.dd with automatic ILM policies.

Ticketing Agent — AI SRE

The Ticketing Agent consumes anomalies from Kafka, correlates them with trace data, and creates deduplicated incident tickets across 6 platforms:
  • PagerDuty — severity-based routing with auto-acknowledgment
  • Jira — project/issue type mapping with custom fields
  • ServiceNow — CMDB integration with assignment groups
  • OpsGenie — team-based routing with schedules
  • Slack — webhook notifications with thread updates
  • Zammad — in-cluster ticketing (self-hosted option)

ML Engine — Model Inference

Feast Feature Store + KServe InferenceService for serving anomaly detection models. Airflow orchestrates retraining DAGs that pull features from Feast, train models, and deploy updated InferenceServices.

Infrastructure Agent — Cluster Health

A Go-based sidecar that collects infrastructure health metrics:
CollectorData SourceMetrics
KafkaStrimzi CRDsConsumer lag, broker status, topic health
FlinkFlink Operator CRDsJob state, task manager status
OpenSearchOpenSearch REST APICluster health, index stats, node stats
ESOExternal Secrets CRDsSecret sync status, last sync time
Exposes /health, /ready, and /metrics endpoints consumed by the Dashboard.

Dashboard — Web UI

Next.js application providing:
  • Log ingestion — drag-and-drop JSON/NDJSON file upload via OTLP proxy
  • Pipeline monitoring — real-time throughput visualization (Ingest → Stream → Process → Index)
  • Incident management — view, acknowledge, resolve, escalate incidents
  • Anomaly visualization — charts showing anomaly scores and affected services
  • System configuration — runtime config for ticketing platforms, anomaly thresholds, LLM settings

Multi-Cloud Abstraction

LogClaw abstracts provider-specific details through two global configuration surfaces:

Object Storage

s3 — AWS S3 or S3-compatible (MinIO, Ceph) gcs — Google Cloud Storage azure — Azure Blob Storage

Secret Management

aws — AWS Secrets Manager (ESO) gcp — Google Secret Manager vault — HashiCorp Vault azure — Azure Key Vault
The same Helm chart works across AWS, GCP, Azure, and on-prem clusters. Only the provider, region, and bucket/endpoint fields differ.

Tier Profiles

Settingstandardhaultra-ha
Kafka brokers135
Kafka storage / broker100 Gi1 Ti2 Ti
OpenSearch masters133
OpenSearch data nodes135
OpenSearch disk / node100 Gi1 Ti2 Ti
OTel Collector replicas135
Flink task managers124
Ticketing agent replicas123
ML Engine replicas123
PodDisruptionBudgetNoYesYes
TopologySpreadNoZoneZone + Node