Knowledge Hub

Monitoring CleanStart Deployments with Prometheus and Datadog

Observability is critical for CleanStart deployments. This guide covers metrics collection, logging from shell-less containers, health checks, and optional integration with CleanSight (which detects outdated CleanStart images in production and recommends upgrades). You'll also learn to use third-party monitoring solutions (Datadog, New Relic, Dynatrace) and set up container-level security monitoring.

Container Observability Fundamentals

Observability in containerized systems requires three signals:

Signal	Purpose	Collection
Metrics	Quantitative measurements (CPU, memory, requests/sec)	Prometheus scrape, agent collection
Logs	Structured events and errors	stdout/stderr, sidecar agents, journald
Traces	Distributed request flows across services	Instrumentation library + collector

CleanStart containers naturally emit logs via stdout/stderr (no shell, no internal log files). Metrics are exposed via /metrics endpoint (Prometheus format) or pushed to collectors.

Prometheus Metrics for CleanStart Containers

Container Resource Metrics

Kubernetes automatically exposes resource metrics for all pods.

Query examples (Prometheus):

# CPU usage (current)container_cpu_usage_seconds_total{pod=~"myapp.*"} # Memory usage (current)container_memory_usage_bytes{pod=~"myapp.*"} # Network in/outcontainer_network_receive_bytes_total{pod=~"myapp.*"}container_network_transmit_bytes_total{pod=~"myapp.*"} # Disk I/Ocontainer_fs_reads_total{pod=~"myapp.*"}container_fs_writes_total{pod=~"myapp.*"} # CPU per pod (5-minute average)rate(container_cpu_usage_seconds_total{pod=~"myapp.*"}[5m]) # Memory percentage (of limit)(container_memory_usage_bytes{pod=~"myapp.*"} / container_spec_memory_limit_bytes) * 100

Application-Level Metrics

Expose custom metrics in your application:

Python (Prometheus client library):

from prometheus_client import Counter, Histogram, Gauge, generate_latestfrom fastapi import FastAPIfrom time import time app = FastAPI() # Counters (increment)request_count = Counter(    'http_requests_total',    'Total HTTP requests',    ['method', 'endpoint', 'status']) # Histograms (measure latency)request_duration = Histogram(    'http_request_duration_seconds',    'HTTP request latency',    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0],    labelnames=['method', 'endpoint']) # Gauges (current value)active_connections = Gauge(    'active_connections',    'Number of active connections') @app.middleware("http")async def metrics_middleware(request, call_next):    start_time = time()    response = await call_next(request)     # Record counter    request_count.labels(        method=request.method,        endpoint=request.url.path,        status=response.status_code    ).inc()     # Record duration    request_duration.labels(        method=request.method,        endpoint=request.url.path    ).observe(time() - start_time)     return response @app.get("/metrics")async def metrics():    return generate_latest() @app.get("/health")async def health():    # Update gauge    active_connections.set(get_connection_count())    return {"status": "healthy"}

Node.js (Prometheus client):

const prometheus = require('prom-client');const express = require('express'); const app = express(); // Default metrics (CPU, memory, GC)prometheus.collectDefaultMetrics(); // Custom countersconst httpRequestDuration = new prometheus.Histogram({  name: 'http_request_duration_seconds',  help: 'HTTP request latency',  labelNames: ['method', 'endpoint', 'status'],  buckets: [0.01, 0.05, 0.1, 0.5, 1.0]}); const httpRequests = new prometheus.Counter({  name: 'http_requests_total',  help: 'Total HTTP requests',  labelNames: ['method', 'endpoint', 'status']}); app.use((req, res, next) => {  const start = Date.now();  res.on('finish', () => {    const duration = (Date.now() - start) / 1000;    httpRequestDuration      .labels(req.method, req.path, res.statusCode)      .observe(duration);    httpRequests      .labels(req.method, req.path, res.statusCode)      .inc();  });  next();}); app.get('/metrics', async (req, res) => {  res.set('Content-Type', prometheus.register.contentType);  res.end(await prometheus.register.metrics());}); app.get('/health', (req, res) => {  res.json({ status: 'healthy' });});

Go (Prometheus client):

package main import (	"net/http"	"github.com/prometheus/client_golang/prometheus"	"github.com/prometheus/client_golang/prometheus/promhttp"	"time") var (	httpDuration = prometheus.NewHistogramVec(		prometheus.HistogramOpts{			Name: "http_request_duration_seconds",			Help: "HTTP request latency",			Buckets: []float64{0.01, 0.05, 0.1, 0.5, 1.0},		},		[]string{"method", "endpoint"},	) 	httpRequests = prometheus.NewCounterVec(		prometheus.CounterOpts{			Name: "http_requests_total",			Help: "Total HTTP requests",		},		[]string{"method", "endpoint", "status"},	)) func init() {	prometheus.MustRegister(httpDuration, httpRequests)} func recordMetrics(next http.Handler) http.Handler {	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {		start := time.Now()		next.ServeHTTP(w, r)		duration := time.Since(start).Seconds()		httpDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)	})} func main() {	http.Handle("/metrics", promhttp.Handler())	http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {		w.WriteHeader(http.StatusOK)		w.Write([]byte(`{"status":"healthy"}`))	})	http.ListenAndServe(":8080", recordMetrics(http.DefaultServeMux))}

ServiceMonitor for Prometheus Operator

Define how Prometheus discovers and scrapes metrics:

apiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata:  name: myapp  namespace: production  labels:    app: myappspec:  selector:    matchLabels:      app: myapp  endpoints:  - port: metrics    interval: 30s    path: /metrics    scheme: http    scrapeTimeout: 10s    relabelings:    # Add custom labels    - sourceLabels: [__meta_kubernetes_pod_name]      targetLabel: pod    - sourceLabels: [__meta_kubernetes_pod_namespace]      targetLabel: namespace---# Service with metrics portapiVersion: v1kind: Servicemetadata:  name: myapp  namespace: productionspec:  selector:    app: myapp  ports:  - name: metrics    port: 8080    targetPort: 8080    protocol: TCP

Deploy ServiceMonitor:

kubectl apply -f servicemonitor.yaml # Verify Prometheus discovers the targetkubectl exec -it prometheus-0 -n monitoring -- \  curl localhost:9090/api/v1/targets

Grafana Dashboard for Metrics

Visualize metrics with Grafana:

{  "dashboard": {    "title": "CleanStart Application Metrics",    "panels": [      {        "title": "Request Rate",        "targets": [          {            "expr": "rate(http_requests_total[5m])"          }        ],        "type": "graph"      },      {        "title": "Request Latency (p95)",        "targets": [          {            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"          }        ]      },      {        "title": "Container CPU Usage",        "targets": [          {            "expr": "rate(container_cpu_usage_seconds_total{pod=~\"myapp.*\"}[5m])"          }        ]      },      {        "title": "Container Memory Usage",        "targets": [          {            "expr": "container_memory_usage_bytes{pod=~\"myapp.*\"} / 1024 / 1024"          }        ]      }    ]  }}

Logging from Shell-Less Containers

CleanStart containers have no shell or internal logging system. All logs go to stdout/stderr.

Structured JSON Logging

Log structured JSON for easy parsing by log collectors:

Python:

import jsonimport sysfrom datetime import datetime def log(level, message, **extra):    log_entry = {        "timestamp": datetime.utcnow().isoformat() + "Z",        "level": level,        "message": message,        **extra    }    print(json.dumps(log_entry), file=sys.stdout) # Usagelog("INFO", "Application started", version="1.0.0", environment="production")log("ERROR", "Database connection failed", error="timeout", host="db.local")log("WARN", "High memory usage", memory_mb=450, threshold_mb=500)

Node.js (pino logger):

const pino = require('pino'); const logger = pino({  transport: {    target: 'pino-pretty',    options: {      colorize: false,      singleLine: true,      translateTime: 'SYS:standard'    }  }}); // Usagelogger.info({ version: '1.0.0' }, 'Application started');logger.error({ error: 'timeout', host: 'db.local' }, 'Database connection failed');logger.warn({ memory_mb: 450, threshold: 500 }, 'High memory usage');

Fluentd DaemonSet for Log Collection

Deploy Fluentd to collect logs from all pods:

apiVersion: v1kind: ConfigMapmetadata:  name: fluentd-config  namespace: loggingdata:  fluent.conf: |    <source>      @type tail      path /var/log/containers/*_production_*.log      pos_file /var/log/fluentd-containers.log.pos      tag kubernetes.*      <parse>        @type json        time_key timestamp        time_format %Y-%m-%dT%H:%M:%S.%NZ      </parse>    </source>     <filter kubernetes.**>      @type kubernetes_metadata      kubernetes_url "#{ENV['FLUENT_FILTER_KUBERNETES_URL'] || 'http://127.0.0.1:8080'}"      tag_to_kubernetes_name_re (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace>[^_]+)_(?<pod_hash>[a-z0-9]{8})\.log$    </filter>     <match kubernetes.var.log.containers.*_production_*.log>      @type elasticsearch      @id output_es      @log_level info      include_tag_key true      host elasticsearch      port 9200      index_name kubernetes-${Time.strftime(%Y.%m.%d)}      logstash_format true      logstash_prefix kubernetes      <buffer>        @type file        path /var/log/fluentd-buffers/kubernetes.system.buffer        flush_mode interval        retry_type exponential_backoff        flush_interval 5s        retry_forever false        retry_max_interval 30        chunk_limit_size "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_CHUNK_LIMIT_SIZE'] || '8M'}"        queue_limit_length "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_QUEUE_LIMIT_LENGTH'] || '256'}"        flush_thread_count "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_FLUSH_THREAD_COUNT'] || '1'}"      </buffer>    </match>---apiVersion: apps/v1kind: DaemonSetmetadata:  name: fluentd  namespace: loggingspec:  selector:    matchLabels:      app: fluentd  template:    metadata:      labels:        app: fluentd    spec:      serviceAccount: fluentd      serviceAccountName: fluentd      tolerations:      - key: node-role.kubernetes.io/master        effect: NoSchedule      - key: node-role.kubernetes.io/control-plane        effect: NoSchedule      containers:      - name: fluentd        image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch        env:          - name: FLUENT_ELASTICSEARCH_HOST            value: "elasticsearch"          - name: FLUENT_ELASTICSEARCH_PORT            value: "9200"          - name: FLUENT_ELASTICSEARCH_SCHEME            value: "http"          - name: FLUENTD_SYSTEMD_CONF            value: disable        resources:          limits:            memory: 512Mi          requests:            cpu: 100m            memory: 256Mi        volumeMounts:        - name: varlog          mountPath: /var/log        - name: varlibdockercontainers          mountPath: /var/lib/docker/containers          readOnly: true        - name: config          mountPath: /fluentd/etc/fluent.conf          subPath: fluent.conf      volumes:      - name: varlog        hostPath:          path: /var/log      - name: varlibdockercontainers        hostPath:          path: /var/lib/docker/containers      - name: config        configMap:          name: fluentd-config---apiVersion: v1kind: ServiceAccountmetadata:  name: fluentd  namespace: logging---apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata:  name: fluentdrules:- apiGroups:  - ""  resources:  - pods  - namespaces  verbs:  - get  - list  - watch---apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRoleBindingmetadata:  name: fluentdroleRef:  apiGroup: rbac.authorization.k8s.io  kind: ClusterRole  name: fluentdsubjects:- kind: ServiceAccount  name: fluentd  namespace: logging

Fluent Bit (Lighter Alternative)

For resource-constrained environments:

apiVersion: v1kind: ConfigMapmetadata:  name: fluent-bit-config  namespace: loggingdata:  fluent-bit.conf: |    [SERVICE]        Daemon Off        Flush 1        Log_Level info     [INPUT]        Name tail        Path /var/log/containers/*_production_*.log        Parser json        Tag kube.*        Mem_Buf_Limit 5MB        Skip_Long_Lines On     [FILTER]        Name kubernetes        Match kube.*     [OUTPUT]        Name elasticsearch        Match *        Host elasticsearch        Port 9200        Logstash_Format On        Logstash_Prefix kubernetes        Retry_Limit False---apiVersion: apps/v1kind: DaemonSetmetadata:  name: fluent-bit  namespace: loggingspec:  selector:    matchLabels:      app: fluent-bit  template:    metadata:      labels:        app: fluent-bit    spec:      containers:      - name: fluent-bit        image: fluent/fluent-bit:latest        volumeMounts:        - name: varlog          mountPath: /var/log        - name: varlibdockercontainers          mountPath: /var/lib/docker/containers          readOnly: true        - name: config          mountPath: /fluent-bit/etc/        resources:          limits:            memory: 100Mi          requests:            cpu: 50m            memory: 50Mi      volumes:      - name: varlog        hostPath:          path: /var/log      - name: varlibdockercontainers        hostPath:          path: /var/lib/docker/containers      - name: config        configMap:          name: fluent-bit-config

Alerting Rules

PrometheusRule Examples

apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata:  name: myapp-alerts  namespace: productionspec:  groups:  - name: cleanstart.rules    interval: 30s    rules:    # Image vulnerability alerts    - alert: ImageVulnerabilityDetected      expr: |        vulnerabilities_found{pod=~"myapp.*"} > 0      for: 1m      labels:        severity: critical        component: security      annotations:        summary: "Vulnerability detected in running image"        description: "{{ $value }} vulnerabilities found in {{ $labels.pod }}"        runbook: "https://wiki.example.com/runbooks/image-vulnerability"     # Container restart alerts    - alert: ContainerRestartingTooOften      expr: |        rate(container_last_seen{pod=~"myapp.*"}[5m]) > 1      for: 5m      labels:        severity: warning      annotations:        summary: "Container {{ $labels.pod }} restarting frequently"        description: "Pod restarted {{ $value }} times in 5 minutes"     # Resource exhaustion    - alert: HighMemoryUsage      expr: |        (container_memory_usage_bytes{pod=~"myapp.*"} / container_spec_memory_limit_bytes) > 0.9      for: 5m      labels:        severity: warning      annotations:        summary: "High memory usage in {{ $labels.pod }}"        description: "Memory usage {{ $value | humanizePercentage }}"     - alert: HighCPUUsage      expr: |        rate(container_cpu_usage_seconds_total{pod=~"myapp.*"}[5m]) > 0.8      for: 5m      labels:        severity: warning      annotations:        summary: "High CPU usage in {{ $labels.pod }}"        description: "CPU usage {{ $value | humanizePercentage }}"     # Application performance    - alert: HighErrorRate      expr: |        (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) > 0.05      for: 5m      labels:        severity: critical      annotations:        summary: "High error rate in {{ $labels.job }}"        description: "Error rate is {{ $value | humanizePercentage }}"     - alert: HighLatency      expr: |        histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1      for: 5m      labels:        severity: warning      annotations:        summary: "High request latency"        description: "p95 latency is {{ $value }}s"     # Registry/image issues    - alert: ImagePullFailed      expr: |        increase(pod_image_pull_errors_total{pod=~"myapp.*"}[5m]) > 0      labels:        severity: critical      annotations:        summary: "Failed to pull image {{ $labels.pod }}"        description: "Image pull failed {{ $value }} times"     # Cluster health    - alert: NodeNotReady      expr: |        kube_node_status_condition{condition="Ready",status="true"} == 0      for: 5m      labels:        severity: critical      annotations:        summary: "Node {{ $labels.node }} is not ready"     - alert: PersistentVolumeClaimPending      expr: |        kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1      for: 15m      labels:        severity: warning      annotations:        summary: "PVC {{ $labels.persistentvolumeclaim }} stuck pending"

Deploy alerts:

kubectl apply -f alerts.yaml # Verify alert rules loadedkubectl get prometheusrule -n production

Health Check Patterns

Kubernetes Health Probes

CleanStart containers run without shell, so health checks must work without shell scripts.

apiVersion: apps/v1kind: Deploymentmetadata:  name: myapp  namespace: productionspec:  replicas: 3  template:    spec:      containers:      - name: myapp        image: registry.cleanstart.com/python3:3.12.5-prod        ports:        - name: http          containerPort: 8080        - name: metrics          containerPort: 8081         # Startup probe: container initializing (only runs once)        startupProbe:          httpGet:            path: /startup            port: http          initialDelaySeconds: 5          periodSeconds: 10          timeoutSeconds: 5          failureThreshold: 30  # 30 * 10 = 300 seconds max startup time         # Readiness probe: ready to serve traffic        readinessProbe:          httpGet:            path: /ready            port: http          initialDelaySeconds: 10          periodSeconds: 5          timeoutSeconds: 3          failureThreshold: 2          # If fails: pod removed from service, but not restarted         # Liveness probe: container alive        livenessProbe:          httpGet:            path: /alive            port: http          initialDelaySeconds: 30          periodSeconds: 10          timeoutSeconds: 3          failureThreshold: 3          # If fails: pod is restarted

Health Check Endpoints

Implement health check endpoints in your application:

Python:

from fastapi import FastAPIimport asyncio app = FastAPI() # Global stateinitialized = Falseready = True @app.on_event("startup")async def startup():    global initialized    # Perform heavy initialization    await asyncio.sleep(2)    initialized = True @app.get("/startup")async def startup_probe():    # Only passes once initialization complete    if not initialized:        raise Exception("Still initializing")    return {"status": "initialized"} @app.get("/ready")async def readiness_probe():    # Check dependencies (DB, cache)    try:        db.ping()        redis.ping()    except Exception as e:        ready = False        return {"status": "not_ready", "reason": str(e)}, 503    return {"status": "ready"} @app.get("/alive")async def liveness_probe():    # Check basic operation    return {"status": "alive"}

Node.js:

const express = require('express');const app = express(); let initialized = false;let ready = true; // StartupsetTimeout(() => {  // Initialization complete  initialized = true;  console.log('Application initialized');}, 2000); app.get('/startup', (req, res) => {  if (!initialized) {    res.status(503).json({ status: 'initializing' });  } else {    res.status(200).json({ status: 'initialized' });  }}); app.get('/ready', (req, res) => {  // Check dependencies  if (!db.connected || !redis.connected) {    res.status(503).json({      status: 'not_ready',      reason: 'dependency_unavailable'    });  } else {    res.status(200).json({ status: 'ready' });  }}); app.get('/alive', (req, res) => {  res.status(200).json({ status: 'alive' });});

CleanSight Integration (Optional)

CleanSight is an optional multi-cloud discovery platform whose primary purpose is detecting outdated CleanStart images running in your production environments and recommending upgrades to newer, patched versions. It is not mandatory — organizations can connect their own repositories and monitoring tools instead.

What CleanSight Does

Core function: Find old CleanStart images in production → suggest newer ones.

CleanSight deploys lightweight discovery agents into your cloud environments. These agents scan your Kubernetes clusters, container registries, and container services to build an inventory of every CleanStart image running in production. It then compares each image against the latest available versions from registry.cleanstart.com and flags images that are behind — whether by patch version, minor version, or those affected by newly disclosed CVEs.

Component	Purpose
Agent Manager	Orchestrates discovery agents across clouds, manages lifecycle
Discovery Agents	Cloud-specific agents for AWS (EKS/ECR/ECS), Azure (AKS/ACR), GCP (GKE/Artifact Registry/Cloud Run)
Image Inventory	Catalogs every CleanStart image running in your clusters
Version Comparison	Compares running images against latest available from `registry.cleanstart.com`
Upgrade Recommendations	Identifies outdated images and recommends specific upgrade targets
SBOM Generation	Syft integration for Software Bill of Materials (supports compliance)
Vulnerability Context	Trivy/Grype scanning shows which CVEs are resolved by upgrading (added as a supplementary capability)
Web Dashboard	Unified view of image freshness across all clusters
REST/gRPC APIs	Programmatic access for CI/CD integration

When to Use CleanSight

You run CleanStart images across multiple clusters and need visibility into which ones are outdated. You want automated upgrade recommendations — CleanSight tells you exactly which image tag to pull. Multi-cloud environments: Single pane of glass across AWS, Azure, and GCP. Compliance requirements: Demonstrate that production images are current and patched. Not mandatory: Organizations can monitor image freshness through their own tooling, registry webhooks, or CI/CD checks instead.

Deploying CleanSight Agent Manager

Option 1: Docker Compose (Single Machine)

# docker-compose.ymlversion: '3.8'services:  cleansight-agent-manager:    image: cleansight/agent-manager:latest    ports:      - "9090:9090"      # Web dashboard      - "50051:50051"    # gRPC API    environment:      CLEANSIGHT_REGION: us-west-2      CLEANSIGHT_LOG_LEVEL: info      AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}      AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}      AZURE_CLIENT_ID: ${AZURE_CLIENT_ID}      AZURE_CLIENT_SECRET: ${AZURE_CLIENT_SECRET}      GCP_PROJECT_ID: ${GCP_PROJECT_ID}      GCP_SERVICE_ACCOUNT_JSON: ${GCP_SERVICE_ACCOUNT_JSON}    volumes:      - ./config:/etc/cleansight      - cleansight-data:/var/lib/cleansight    restart: unless-stopped    healthcheck:      test: ["CMD", "curl", "-f", "http://localhost:9090/health"]      interval: 30s      timeout: 10s      retries: 3 volumes:  cleansight-data:

Start CleanSight:

docker-compose up -d # Wait for initializationsleep 10 # Access dashboardopen http://localhost:9090 # Logsdocker-compose logs -f cleansight-agent-manager

Option 2: Kubernetes

apiVersion: v1kind: ConfigMapmetadata:  name: cleansight-config  namespace: cleansightdata:  config.yaml: |    agent_manager:      web_port: 9090      grpc_port: 50051      log_level: info     cloud_providers:      aws:        enabled: true        regions: [us-west-2, us-east-1, eu-west-1]      azure:        enabled: true        subscriptions: []  # Auto-discover if empty      gcp:        enabled: true        projects: []     scanning:      schedule: "0 */6 * * *"  # Every 6 hours      sbom_tool: syft      vulnerability_tool: trivy      timeout_seconds: 3600 ---apiVersion: apps/v1kind: Deploymentmetadata:  name: cleansight-agent-manager  namespace: cleansightspec:  replicas: 1  selector:    matchLabels:      app: cleansight  template:    metadata:      labels:        app: cleansight    spec:      serviceAccountName: cleansight      containers:      - name: cleansight        image: cleansight/agent-manager:latest        ports:        - name: web          containerPort: 9090        - name: grpc          containerPort: 50051        env:        - name: CLEANSIGHT_CONFIG_PATH          value: /etc/cleansight/config.yaml        - name: AWS_ACCESS_KEY_ID          valueFrom:            secretKeyRef:              name: cloud-credentials              key: aws-key        - name: AWS_SECRET_ACCESS_KEY          valueFrom:            secretKeyRef:              name: cloud-credentials              key: aws-secret        resources:          requests:            memory: "512Mi"            cpu: "250m"          limits:            memory: "2Gi"            cpu: "1000m"        livenessProbe:          httpGet:            path: /health            port: web          initialDelaySeconds: 30          periodSeconds: 10        readinessProbe:          httpGet:            path: /ready            port: web          initialDelaySeconds: 10          periodSeconds: 5        volumeMounts:        - name: config          mountPath: /etc/cleansight      volumes:      - name: config        configMap:          name: cleansight-config ---apiVersion: v1kind: Servicemetadata:  name: cleansight-agent-manager  namespace: cleansightspec:  selector:    app: cleansight  ports:  - name: web    port: 9090    targetPort: web  - name: grpc    port: 50051    targetPort: grpc  type: ClusterIP ---apiVersion: v1kind: ServiceAccountmetadata:  name: cleansight  namespace: cleansight ---apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata:  name: cleansightrules:- apiGroups: [""]  resources: ["pods"]  verbs: ["get", "list", "watch"]- apiGroups: [""]  resources: ["pods/log"]  verbs: ["get"]- apiGroups: ["apps"]  resources: ["deployments", "statefulsets", "daemonsets"]  verbs: ["get", "list", "watch"]- apiGroups: ["batch"]  resources: ["jobs", "cronjobs"]  verbs: ["get", "list", "watch"] ---apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRoleBindingmetadata:  name: cleansightroleRef:  apiGroup: rbac.authorization.k8s.io  kind: ClusterRole  name: cleansightsubjects:- kind: ServiceAccount  name: cleansight  namespace: cleansight

Deploy to Kubernetes:

kubectl create namespace cleansightkubectl apply -f cleansight-deployment.yaml # Check statuskubectl get pods -n cleansightkubectl logs -f deployment/cleansight-agent-manager -n cleansight # Port-forward to access dashboardkubectl port-forward -n cleansight svc/cleansight-agent-manager 9090:9090open http://localhost:9090

Connecting CleanSight to Your Cluster

Configuration:

Open CleanSight dashboard: http://localhost:9090
Go to Settings → Cloud Providers
For Kubernetes (AWS EKS): Cluster Name: myapp-production Region: us-west-2 OIDC Provider: oidc.eks.us-west-2.amazonaws.com/id/ABC123 Service Account Role: arn:aws:iam::123456789:role/cleansight-eks-role
For Kubernetes (Azure AKS): Cluster Name: myapp-production Resource Group: myapp-rg Subscription: my-subscription
Click Test Connection → Save

CleanSight automatically: Discovers all running containers and identifies CleanStart images. Compares each image version against the latest available from registry.cleanstart.com. Flags outdated images and recommends specific upgrade targets (e.g., "python3:3.12.3-prod → python3:3.12.5-prod"). Shows which CVEs are resolved by upgrading to the recommended version. Generates SBOMs for compliance and audit trails. Tracks upgrade adoption across your fleet over time.

Accessing CleanSight Data

Web Dashboard:

The CleanSight web dashboard is accessible at http://localhost:9090 and provides several key sections. The Image Inventory displays all CleanStart images running in your clusters, including version numbers, age of each image, and current status. The Upgrade Recommendations section flags outdated images and provides recommended target versions. The CVE Impact view shows which vulnerabilities are resolved by upgrading to recommended versions, along with severity levels and affected image counts.

The Fleet Overview provides fleet-wide metrics including upgrade adoption rate and image freshness trends across your clusters. The Reports section generates compliance documentation such as SBOMs, vulnerability scan history, and audit trails. Finally, the Settings area allows you to configure cloud provider integrations, upgrade scheduling policies, and notification webhooks for alerting.

REST API:

# List all CleanStart images discovered across clusterscurl http://localhost:9090/api/v1/images # Get upgrade recommendations for outdated imagescurl http://localhost:9090/api/v1/images/outdated # Get specific image details and recommended upgrade targetcurl http://localhost:9090/api/v1/images/sha256:abc123/upgrade-recommendation # Get CVEs resolved by upgrading a specific imagecurl http://localhost:9090/api/v1/images/sha256:abc123/cves-resolved # Generate SBOM for compliancecurl http://localhost:9090/api/v1/images/sha256:abc123/sbom -o sbom.json # Trigger a discovery scan across connected clusterscurl -X POST http://localhost:9090/api/v1/scans \  -H "Content-Type: application/json" \  -d '{"provider":"aws","region":"us-west-2"}'

gRPC API (for integrations):

import grpcfrom cleansight.v1 import container_service_pb2, container_service_pb2_grpc channel = grpc.secure_channel('cleansight-agent-manager:50051', grpc.ssl_channel_credentials())stub = container_service_pb2_grpc.ContainerServiceStub(channel) # List containersresponse = stub.ListContainers(container_service_pb2.ListContainersRequest())for container in response.containers:    print(f"{container.image}: {len(container.vulnerabilities)} vulns")

WebSocket Real-time Events:

const ws = new WebSocket('ws://localhost:9090/api/v1/events'); ws.onmessage = (event) => {  const message = JSON.parse(event.data);  console.log('Event:', message);  // {  //   "type": "vulnerability_found",  //   "image": "registry.cleanstart.com/python3:3.12.5-prod",  //   "vulnerability": { "id": "CVE-2024-1234", "severity": "HIGH" }  // }};

Bring Your Own Monitoring

CleanStart containers run on GLIBC Linux, so third-party monitoring agents work natively.

Datadog Agent

apiVersion: apps/v1kind: DaemonSetmetadata:  name: datadog-agent  namespace: datadogspec:  selector:    matchLabels:      app: datadog-agent  template:    metadata:      labels:        app: datadog-agent    spec:      serviceAccountName: datadog-agent      hostNetwork: true      hostPID: true      containers:      - name: agent        image: datadog/agent:latest        env:        - name: DD_API_KEY          valueFrom:            secretKeyRef:              name: datadog-api-key              key: api-key        - name: DD_KUBERNETES_KUBELET_HOST          valueFrom:            fieldRef:              fieldPath: status.hostIP        - name: DD_APM_ENABLED          value: "true"        - name: DD_LOGS_ENABLED          value: "true"        resources:          limits:            memory: 256Mi          requests:            cpu: 100m            memory: 128Mi        volumeMounts:        - name: docker          mountPath: /var/run/docker.sock        - name: proc          mountPath: /host/proc          readOnly: true        - name: sys          mountPath: /host/sys          readOnly: true      volumes:      - name: docker        hostPath:          path: /var/run/docker.sock      - name: proc        hostPath:          path: /proc      - name: sys        hostPath:          path: /sys

Application instrumentation (Python):

from ddtrace import patch_allimport logging # Patch all librariespatch_all() # Datadog loggerlogging.basicConfig(    format='{"timestamp": "%(asctime)s", "level": "%(levelname)s", "message": "%(message)s"}') # Application code (auto-traced by Datadog)@app.get("/api/users/{user_id}")def get_user(user_id: int):    return {"id": user_id, "name": "Alice"}

New Relic

apiVersion: v1kind: ConfigMapmetadata:  name: newrelic-config  namespace: newrelicdata:  newrelic-config.yaml: |    integration_name: nri-kubernetes    instances:      - name: nri-kubernetes        command: pod ---apiVersion: apps/v1kind: DaemonSetmetadata:  name: newrelic-agent  namespace: newrelicspec:  selector:    matchLabels:      app: newrelic-agent  template:    metadata:      labels:        app: newrelic-agent    spec:      serviceAccountName: newrelic-agent      hostNetwork: true      hostPID: true      containers:      - name: agent        image: newrelic/infrastructure-k8s:latest        env:        - name: NRIP_LICENSE_KEY          valueFrom:            secretKeyRef:              name: newrelic-license-key              key: license        - name: NRIP_CUSTOM_ATTRIBUTES          value: "environment=production"        resources:          limits:            memory: 512Mi          requests:            cpu: 100m            memory: 256Mi        volumeMounts:        - name: docker          mountPath: /var/run/docker.sock        - name: proc          mountPath: /host/proc          readOnly: true      volumes:      - name: docker        hostPath:          path: /var/run/docker.sock      - name: proc        hostPath:          path: /proc

Dynatrace

# Install Dynatrace Operatorkubectl create namespace dynatracekubectl apply -f https://github.com/Dynatrace/dynatrace-operator/releases/latest/download/dynatrace-operator.yaml # Create DynatraceOneAgent resourcekubectl apply -f - <<EOFapiVersion: dynatrace.com/v1beta1kind: DynatraceOneAgentmetadata:  name: dynatrace  namespace: dynatracespec:  apiUrl: https://YOUR-ENVIRONMENT-ID.live.dynatrace.com/api  tokens:    apiToken:      valueFrom:        secretKeyRef:          name: dynakube          key: apiToken    paasToken:      valueFrom:        secretKeyRef:          name: dynakube          key: paasTokenEOF

Runtime Security Monitoring

Monitor container behavior at runtime using eBPF-based tools.

Falco (Container Runtime Security)

apiVersion: v1kind: ConfigMapmetadata:  name: falco-config  namespace: falcodata:  falco.yaml: |    rules_file:      - /etc/falco/rules.yaml      - /etc/falco/rules.d    plugins: []    output:      outputs:        - json      syslog_output:        enabled: true    file_output:      enabled: true      keep_alive: false      filename: /var/log/falco/falco.log ---apiVersion: apps/v1kind: DaemonSetmetadata:  name: falco  namespace: falcospec:  selector:    matchLabels:      app: falco  template:    metadata:      labels:        app: falco    spec:      hostNetwork: true      hostPID: true      containers:      - name: falco        image: falcosecurity/falco:latest        securityContext:          privileged: true        volumeMounts:        - name: docker          mountPath: /var/run/docker.sock        - name: cgroup          mountPath: /host/sys/fs/cgroup          readOnly: true        - name: proc          mountPath: /host/proc          readOnly: true      volumes:      - name: docker        hostPath:          path: /var/run/docker.sock      - name: cgroup        hostPath:          path: /sys/fs/cgroup      - name: proc        hostPath:          path: /proc

Custom Falco rules for CleanStart:

- rule: Unauthorized Process Execution  desc: Detect suspicious process execution  condition: spawned_process and not allowed_process  output: "Process execution detected (user=%user.name process=%proc.name)"  priority: WARNING - rule: Suspicious Network Connection  desc: Detect unusual outbound connections  condition: outbound and not trusted_ip  output: "Network connection (src=%fd.sip dst=%fd.dip port=%fd.dport)"  priority: WARNING

Summary

Observability for CleanStart deployments requires:

Metrics: Prometheus scraping application /metrics endpoint + Kubernetes resource metrics Logs: Structured JSON to stdout/stderr, collected by Fluentd/Fluent Bit to Elasticsearch Traces: Distributed tracing with OpenTelemetry (instrumentation libraries + OTLP collector) Health Checks: Startup, readiness, liveness probes without shell scripts Alerting: PrometheusRule definitions for critical conditions Optional CleanSight: Detects outdated CleanStart images in production and recommends upgrades Third-party Monitoring: Datadog, New Relic, Dynatrace agents work natively on GLIBC Runtime Security: Falco monitors suspicious container behavior

Complete observability = Metrics + Logs + Traces + Health Checks + Alerting + Optional CleanSight for image freshness tracking.