LogClaw Architecture — End-to-End Data Flow
Overview
LogClaw is an AI-powered observability platform that ingests OTLP logs, detects anomalies using ML-boosted scoring, automatically creates incidents, and dispatches alerts with root cause analysis.Service Map
Step-by-Step: 10 Logs Through the Pipeline
Step 1: Ingestion (Auth Proxy → OTel Collector)
Customer sends OTLP logs via HTTP POST:apps/logclaw-auth-proxy/):
- Extracts
x-api-keyheader - Queries PostgreSQL (
logclaw_enterpriseDB) to validate the key - Looks up the tenant ID associated with that API key
- Injects
tenant_idas a resource attribute into the OTLP payload - Forwards the modified payload to OTel Collector on port 4318
charts/logclaw-otel-collector/):
- Receives OTLP HTTP on port 4318
- Applies batch processor (groups logs for efficiency)
- Exports to Kafka topic
raw-logsvia thekafkaexporter
Step 2: Kafka (Message Bus)
Kafka stores logs in topics with configurable retention:Step 3: Flink ETL (Unwrap OTLP → Flat JSON)
What it does: Converts the deeply nested OTLP format into a flat, searchable JSON record. Input (fromraw-logs topic — nested OTLP):
enriched-logs topic — flat JSON):
- Unwrap the 3-level OTLP nesting → flat record
- Generate deterministic
log_id(UUID5 from trace+span+timestamp) for OpenSearch idempotency - Normalize severity (
CRITICAL→FATAL) - Flatten attribute keys (
service.name→service_name) — dots cause nested objects in OpenSearch - If unparseable → send to
dead-letter-queue
apps/flink-jobs/logclaw-etl/ (Java) or apps/bridge/ Thread 1 (Python)
Step 4: Flink Enrichment (Add ML Features from Feast)
What it does: Reads flat JSON fromenriched-logs, queries Feast for ML features, writes back with added fields.
Before:
apps/flink-jobs/logclaw-enrichment/ (Java) or Bridge Thread 1 (Python)
Step 5: Flink Anomaly Scorer
Scoring Algorithm:apps/flink-jobs/logclaw-anomaly-scorer/ (Java) or Bridge Thread 2 (Python)
Step 6: Bridge Thread 3 (Index + Incidents)
Flink stops at writing toanomaly-events. Bridge Thread 3 picks up:
- Index all enriched logs →
logclaw-logs-{tenant}in OpenSearch - Index anomaly events →
logclaw-anomalies-{tenant}in OpenSearch - Group anomalies by service + 5-min window → if >= 3 → create Incident
apps/bridge/ (Python, also embedded in charts/logclaw-bridge/templates/configmap-app.yaml)
Step 7: Ticketing Agent (Root Cause + Alerts)
Polls OpenSearch forstatus=open incidents:
apps/ticketing-agent/ (Python, also in configmap)
Airflow, Feast & Redis: The ML Feature Pipeline
Airflow’s sole purpose in LogClaw is to run periodic feature computation DAGs. It does not orchestrate any other services or ETL — that’s Flink/Bridge’s job.How Features Get Into Redis
How Flink Enrichment Reads Features (Real-Time)
The Full Feature Cycle
Why Redis and Not OpenSearch Directly?
- Redis: Sub-millisecond per key lookup (0.1ms). Perfect for per-log real-time enrichment.
- OpenSearch: 10-50ms per query. Too slow when enriching thousands of logs per second.
- Feast is the abstraction layer — defines the feature schema, handles Redis read/write, and provides a clean API for the Flink job to call.
What Airflow Runs
| DAG | Schedule | What It Does |
|---|---|---|
feature_compute | Every 1 hour | Queries OpenSearch for per-service aggregates, computes 4 ML features, writes to Redis via feast materialize |
Flink vs Bridge
| Flink | Bridge | |
|---|---|---|
| Language | Java (3 JARs) | Python (1 file, 3 threads) |
| Pods | 6+ (3 JM + 3 TM) | 1 |
| Requires | Flink Operator + node pool | Nothing extra |
| Scaling | Horizontal (parallelism) | Vertical only |
| Guarantees | Exactly-once (RocksDB checkpoints) | At-least-once (in-memory) |
| Cost | ~4 CPU, 6Gi RAM | ~200m CPU, 256Mi RAM |
| Use case | Production at scale | Dev, staging, small tenants |
log_id → OpenSearch overwrites).
Flink Runtime
Flink is a stream processing engine (not batch like Spring Batch). Jobs run continuously.Kubernetes & Helm
Chart Structure
How helm upgrade Works
- Helm diffs rendered templates vs cluster state
- Deployments: Rolling update (new ReplicaSet up, old down, zero downtime)
- StatefulSets: Pods updated one at a time (ordered)
- ConfigMaps: Updated in-place; pods restart if Deployment has checksum annotation
- CRDs (Kafka, OpenSearch): Operator reconciles (Strimzi rolls brokers one by one)
- Success: Release revision stored as K8s Secret
- Failure:
helm rollback logclaw <revision>restores previous state
ConfigMap = Deployment Artifact
For Python services, source code is embedded in ConfigMap:apps/ source AND the configmap must be kept in sync on every change.
Values Per Environment
External Dependencies
| Service | Image | Version |
|---|---|---|
| Kafka | strimzi/kafka | 0.50.0 + Kafka 4.0.0 |
| OpenSearch | opensearchproject/opensearch | 2.14.0 |
| OTel Collector | otel/opentelemetry-collector-contrib | 0.114.0 |
| Feast | feastdev/feature-server | 0.40.0 |
| PostgreSQL | bitnami/postgresql | latest |
| Redis | bitnami/redis | latest |
| Airflow | apache/airflow | 2.9.3 |
| Flink | apache/flink | 1.19 |