Knowledge Hub

Performance Baseline Testing Guide for CleanStart Images

Purpose

Establishing performance baselines for CleanStart base images enables your QA team to detect regressions across upgrades and ensure consistent application behavior. This guide walks you through measuring, documenting, and monitoring key performance metrics so you can confidently deploy image updates to production.

Target audience: QA leads, DevOps engineers, platform teams, performance engineers.

Typical time investment: 2–3 hours for initial baseline, 30 min for regression validation per upgrade.

Why Performance Baselines Matter

Without baselines, you deploy blindly. When an application slows down after an image upgrade, you won't know if the new image is slower in terms of startup, memory, or CPU overhead; if your application has a memory leak; if the workload characteristics changed such as traffic or data volume; or if infrastructure degradation occurred such as noisy neighbor effects or network latency.

Performance baselines solve this by providing a reference point against which to compare future deployments.

Common Performance Regressions Caught by Baselines

Performance baselines catch several types of regressions. Startup time degradation, where deployments take twice as long, is caught by Layer 4 (perf) testing, and missing it costs $50K or more in deployment delays. Memory leaks, indicated by pods OOMKill after 6 hours, are caught by Layer 6 (soak) testing, with missing it costing $500K or more in production incidents. GC pause increases showing P99 latency spikes every 10 seconds are caught by runtime metrics, costing $200K or more in SLA violations if missed. Connection pool exhaustion causing intermittent 503 errors is caught by soak plus monitoring, costing $100K or more in customer impact if missed. CPU overhead requiring twice the CPU after upgrade is caught by load test comparison, costing $50K or more in infrastructure cost if missed. Finally, I/O performance degradation slowing down database queries is caught by database benchmarking, costing $150K or more in app latency if missed.

What to Measure

Effective baselines track metrics across four dimensions: startup, runtime, resource efficiency, and reliability.

1. Startup Performance

Why it matters: Deployment time directly impacts DevOps velocity. Slow startups also increase pod restart impact during rolling updates or node failures.

Metrics to Track

Metric	How to Measure	Industry Baseline	CleanStart Target	Threshold (warning)
Cold start time	Time from `docker run` to health check pass	5–10s	<5s	>10% increase from previous
Warm start time	Time from restart (cached layers)	2–5s	<2s	>10% increase
Time to first response	Time to handle first successful request	2–8s	<3s	>15% increase
Readiness probe latency	Time `/health` endpoint responds	1–3s	<1s	>20% increase
Liveness probe latency	Time container becomes "alive"	1–5s	<2s	>20% increase

Measurement Script

#!/bin/bash IMAGE=$1ITERATIONS=${2:-10}CONTAINER_NAME="startup-test" echo "📊 Measuring startup performance for $IMAGE"echo "🔄 Running $ITERATIONS iterations..."echo "" declare -a TIMES for i in $(seq 1 $ITERATIONS); do  # Remove previous container  docker rm -f $CONTAINER_NAME 2>/dev/null || true   # Measure time from docker run to health check  START=$(date +%s%N)   # Start container  docker run -d \    --name $CONTAINER_NAME \    --health-cmd='curl -f http://localhost:8080/health || exit 1' \    --health-interval=1s \    --health-timeout=2s \    --health-retries=3 \    -p 8080:8080 \    "$IMAGE" > /dev/null   # Wait for health check  while [ "$(docker inspect --format='{{.State.Health.Status}}' $CONTAINER_NAME 2>/dev/null)" != "healthy" ]; do    if [ "$(docker inspect --format='{{.State.Running}}' $CONTAINER_NAME 2>/dev/null)" != "true" ]; then      echo "❌ Container failed to start"      docker logs $CONTAINER_NAME      exit 1    fi    sleep 0.1  done   END=$(date +%s%N)   # Calculate elapsed time in milliseconds  ELAPSED=$(( (END - START) / 1000000 ))  TIMES[$i]=$ELAPSED   printf "  Run %2d: %4dms\n" $i $ELAPSED   # Cleanup  docker rm -f $CONTAINER_NAME > /dev/nulldone # Calculate statisticsTOTAL=0MIN=${TIMES[1]}MAX=${TIMES[1]} for time in "${TIMES[@]}"; do  TOTAL=$((TOTAL + time))  [ "$time" -lt "$MIN" ] && MIN=$time  [ "$time" -gt "$MAX" ] && MAX=$timedone AVERAGE=$((TOTAL / ITERATIONS))MEDIAN=$(echo "${TIMES[@]}" | tr ' ' '\n' | sort -n | head -$((ITERATIONS/2+1)) | tail -1) echo ""echo "📈 Results:"printf "  Average: %4dms\n" $AVERAGEprintf "  Median:  %4dms\n" $MEDIANprintf "  Min:     %4dms\n" $MINprintf "  Max:     %4dms\n" $MAXecho "" # Store baselinemkdir -p baselinescat > baselines/startup-${IMAGE//\//-}.json << EOF{  "image": "$IMAGE",  "date": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",  "iterations": $ITERATIONS,  "average_ms": $AVERAGE,  "median_ms": $MEDIAN,  "min_ms": $MIN,  "max_ms": $MAX,  "raw_times": [$(IFS=, ; echo "${TIMES[*]}")]}EOF echo "✅ Baseline saved to baselines/startup-${IMAGE//\//-}.json"

2. Runtime Performance

Why it matters: Runtime performance directly affects user experience (latency), scalability (throughput), and infrastructure costs (resource utilization).

Metrics to Track

Metric	How to Measure	Tool	Industry Baseline	Warning Threshold
Request latency (p50)	50th percentile response time	wrk, hey, k6	50–100ms	>10% increase
Request latency (p95)	95th percentile response time	wrk, hey, k6	150–300ms	>15% increase
Request latency (p99)	99th percentile response time	wrk, hey, k6	300–800ms	>20% increase
Throughput (RPS)	Requests per second	wrk, k6	App-dependent	>10% decrease
Error rate	% of failed requests under load	wrk, k6	<0.1%	Any increase
CPU utilization (idle)	CPU% when no load	docker stats, top	<5%	>10% increase
CPU utilization (loaded)	CPU% under 1000 RPS	docker stats, cAdvisor	70–90%	>100% (saturated)

HTTP Load Test Script (wrk)

#!/bin/bash IMAGE=$1RPS=${2:-100}          # Requests per secondDURATION=${3:-60}      # SecondsCONNECTIONS=${4:-100}  # Concurrent connectionsTHREADS=${5:-4}        # Worker threads CONTAINER_NAME="load-test-$(date +%s)" echo "🔄 Starting HTTP load test for $IMAGE"echo "   RPS: $RPS | Duration: ${DURATION}s | Connections: $CONNECTIONS"echo "" # Start containerdocker run -d \  --name "$CONTAINER_NAME" \  --cpus=2 \  --memory=1g \  -p 8080:8080 \  "$IMAGE" > /dev/null sleep 5  # Warmup # Install wrk if neededif ! command -v wrk &> /dev/null; then  echo "Installing wrk..."  apt-get update && apt-get install -y wrkfi # Run load testecho "Running load test..."wrk -t$THREADS \    -c$CONNECTIONS \    -d${DURATION}s \    -R$RPS \    --latency \    http://localhost:8080/ \    | tee /tmp/load-test.txt # Extract resultsecho ""echo "📊 Summary:"grep "Latency" /tmp/load-test.txt | head -1grep "Requests/sec" /tmp/load-test.txt # Store baselinemkdir -p baselinescat > baselines/load-${IMAGE//\//-}.json << EOF{  "image": "$IMAGE",  "date": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",  "rps": $RPS,  "duration_seconds": $DURATION,  "threads": $THREADS,  "connections": $CONNECTIONS,  "results": "$(cat /tmp/load-test.txt | jq -Rs)"}EOF # Cleanupdocker rm -f "$CONTAINER_NAME" > /dev/null echo "✅ Results saved to baselines/load-${IMAGE//\//-}.json"

3. Resource Efficiency

Why it matters: Resource efficiency directly affects infrastructure costs and container density. A 10% memory reduction means 10% more containers per node.

Metrics to Track

Metric	Tool	Target	Warning Threshold
Image size	`docker image inspect`	<500MB	>10% increase
Pulled image size (compressed)	Registry	<100–200MB	>5% increase
Memory (RSS) at idle	docker stats	<100MB	>20% increase
Memory (VSZ) at idle	docker stats	<300MB	>30% increase
Disk I/O (read/write IOPS)	fio, iostat	Baseline	>20% sustained increase
Container density (pods per node)	Kubernetes	Max safe	Any decrease

Resource Measurement Script

#!/bin/bash IMAGE=$1DURATION=${2:-300}  # 5 minutes CONTAINER_NAME="resource-test-$(date +%s)" echo "💾 Measuring resource usage for $IMAGE" # Get image sizeIMAGE_SIZE=$(docker images --format "table {{.Repository}}:{{.Tag}}\t{{.Size}}" | \             grep "$IMAGE" | awk '{print $2}') echo "Image size: $IMAGE_SIZE" # Start containerdocker run -d \  --name "$CONTAINER_NAME" \  -p 8080:8080 \  "$IMAGE" > /dev/null sleep 3 # Collect resource metrics every 5 secondsecho "Monitoring for ${DURATION}s..." STATS_FILE=/tmp/resource-stats.txt> $STATS_FILE for i in $(seq 0 5 $DURATION); do  docker stats --no-stream "$CONTAINER_NAME" >> $STATS_FILE  sleep 5done # Extract peak and average valuesPEAK_MEMORY=$(grep "$CONTAINER_NAME" $STATS_FILE | awk '{print $4}' | sort -h | tail -1)AVG_MEMORY=$(grep "$CONTAINER_NAME" $STATS_FILE | awk '{print $4}' | \             sed 's/[^0-9.]*//g' | \             awk '{sum+=$1; count++} END {printf "%.0f", sum/count}')PEAK_CPU=$(grep "$CONTAINER_NAME" $STATS_FILE | awk '{print $3}' | sort -h | tail -1) echo ""echo "📊 Results:"echo "  Image size: $IMAGE_SIZE"echo "  Peak memory: $PEAK_MEMORY"echo "  Avg memory: ${AVG_MEMORY}M"echo "  Peak CPU: $PEAK_CPU"echo "" # Store baselinemkdir -p baselinescat > baselines/resources-${IMAGE//\//-}.json << EOF{  "image": "$IMAGE",  "date": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",  "image_size": "$IMAGE_SIZE",  "peak_memory": "$PEAK_MEMORY",  "avg_memory_mb": $AVG_MEMORY,  "peak_cpu": "$PEAK_CPU"}EOF # Cleanupdocker rm -f "$CONTAINER_NAME" > /dev/null echo "✅ Results saved to baselines/resources-${IMAGE//\//-}.json"

4. Garbage Collection & Runtime-Specific Metrics

Why it matters: For managed-memory languages (Java, Python, Node), GC pauses directly impact P99 latency and user experience.

Java/JVM Metrics

Metric	Tool	Target	Warning Threshold
GC pause (max)	jstat, async-profiler	<100ms	>150ms
Full GC frequency	jstat	<1/minute	>2/minute
Heap utilization	jstat	60–80%	>90% sustained
Startup time	time	<5s	>10% increase

Baseline Establishment Workflow

Step 1: Prepare Test Environment (30 min)

# 1. Isolate test environment (no other containers/processes)docker rm -f $(docker ps -aq) # 2. Warm up system (stabilize CPU frequency, page cache)for i in {1..5}; do  docker run --rm registry.cleanstart.com/py:3.12 python -c "print('warmup')"done # 3. Install benchmarking toolsapt-get updateapt-get install -y wrk ab apache2-utils # 4. Create baseline directorymkdir -p baselines

Step 2: Run Baseline Tests (1.5 hours)

Run the startup, HTTP load, and resource measurement scripts defined above.

Step 3: Document and Store (15 min)

#!/bin/bash IMAGE=$1TIMESTAMP=$(date -u +%Y-%m-%d) # Merge all baseline files into one reportcat > baselines/report-${IMAGE//\//-}-${TIMESTAMP}.md << 'EOF'# Performance Baseline Report ## Image$IMAGE ## Date$TIMESTAMP ## Metrics ### Startup Performance- Average: XXX ms- P95: XXX ms ### Runtime Performance- P50 latency: XXX ms- P95 latency: XXX ms- Throughput: XXX RPS ### Resource Usage- Image size: XXX MB- Memory (idle): XXX MB- Memory (loaded): XXX MB ## Baseline Thresholds- Startup regression warning: >10%- Latency regression warning: >15%- Memory regression warning: >20%- Throughput regression warning: >10% ## Next Steps- Compare future image versions against these baselines- Alert if any metric exceeds warning threshold- Investigate regressions before production deploymentEOF # Store in version controlgit add baselines/git commit -m "Add performance baseline for $IMAGE on $TIMESTAMP"git push echo "✅ Baseline report saved and committed"

Detecting Regressions: Before vs. After Comparison

When you upgrade from 3.12.1 → 3.12.2, compare new baselines against old ones.

Automated Regression Detection Script

#!/bin/bash OLD_IMAGE=$1NEW_IMAGE=$2REGRESSION_REPORT="regression-report-$(date +%Y%m%d).md" echo "# Regression Test Report" > $REGRESSION_REPORTecho "" >> $REGRESSION_REPORTecho "## Comparison: $OLD_IMAGE → $NEW_IMAGE" >> $REGRESSION_REPORTecho "" >> $REGRESSION_REPORT # Function to compare metricscompare_metrics() {  local METRIC=$1  local OLD_VALUE=$2  local NEW_VALUE=$3  local THRESHOLD=$4  # Percentage   if [ -z "$OLD_VALUE" ] || [ -z "$NEW_VALUE" ]; then    return  fi   # Calculate percent change  CHANGE=$(echo "scale=1; (($NEW_VALUE - $OLD_VALUE) / $OLD_VALUE) * 100" | bc)   if (( $(echo "$CHANGE < 0" | bc -l) )); then    DIRECTION="BETTER (faster/smaller)"    SYMBOL="✅"  else    DIRECTION="WORSE (slower/larger)"    SYMBOL="⚠️"     if (( $(echo "$CHANGE > $THRESHOLD" | bc -l) )); then      SYMBOL="❌"    fi  fi   echo "$SYMBOL $METRIC: ${OLD_VALUE} → ${NEW_VALUE} (${CHANGE}%)" >> $REGRESSION_REPORT} # Load baseline filesOLD_STARTUP=$(jq '.average_ms' baselines/startup-${OLD_IMAGE//\//-}.json)NEW_STARTUP=$(jq '.average_ms' baselines/startup-${NEW_IMAGE//\//-}.json)compare_metrics "Startup time (ms)" $OLD_STARTUP $NEW_STARTUP 10 echo ""echo "📊 Full regression report: $REGRESSION_REPORT"

Performance Comparison: CleanStart vs. Alternatives

Real-world benchmarks showing CleanStart vs. Alpine vs. Ubuntu vs. Distroless.

Startup Time Comparison

Image              Cold Start    Warm Start    Time to Health━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━CleanStart py:3.12    1.2s          0.8s         1.0sAlpine py:3.12        1.8s          1.2s         1.5sUbuntu py:3.12        2.5s          1.8s         2.0sDistroless py:3.12    0.9s          0.6s         0.8s

Note: Distroless is faster because it has no shell/debuggers. CleanStart is 25% faster than Alpine while including security features Distroless lacks.

Image Size Comparison

Image              Compressed    Uncompressed    % Smaller than Ubuntu━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━CleanStart py:3.12   45 MB          120 MB         82%Alpine py:3.12       38 MB          105 MB         85%Ubuntu py:3.12      210 MB          680 MB         —Distroless py:3.12   22 MB           62 MB         91%

Memory (Idle) Comparison

Image              RSS Memory    VSZ Memory    Notes━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━CleanStart py:3.12   45 MB         180 MB       Clean, minimalAlpine py:3.12       32 MB         140 MB       Musl reduces RSSUbuntu py:3.12      120 MB         420 MB       Full glibc + utilsDistroless py:3.12   28 MB         110 MB       Smallest footprint

Throughput Under Load (1000 RPS, 100 concurrent)

Image              Throughput    P95 Latency    Error Rate━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━CleanStart py:3.12  1200 RPS      85 ms         0.02%Alpine py:3.12      1150 RPS      92 ms         0.03%Ubuntu py:3.12      1050 RPS     120 ms         0.15%Distroless py:3.12  1210 RPS      82 ms         0.01%

CI/CD Integration: Automated Baseline Testing

GitHub Actions Workflow

name: Performance Baseline on:  schedule:    - cron: '0 2 * * *'  # Daily at 2 AM UTC  workflow_dispatch:    inputs:      image:        description: 'Image to baseline'        required: true jobs:  baseline:    runs-on: ubuntu-latest    timeout-minutes: 120    steps:      - uses: actions/checkout@v4       - name: Establish baseline        run: |          ./scripts/performance-baseline.sh "${{ inputs.image }}" 300       - name: Compare against previous        run: |          ./scripts/compare-baselines.sh "${{ inputs.image }}"        continue-on-error: true       - name: Upload results        uses: actions/upload-artifact@v4        with:          name: baseline-results          path: baselines/       - name: Comment on PR (if regression detected)        if: failure()        run: |          echo "⚠️ Performance regression detected. Review results." >> $GITHUB_STEP_SUMMARY

Best Practices

1. Test in Production-Like Conditions

Test using the same hardware (CPU type, memory, network), simulating real traffic patterns, and including all external dependencies (database, cache, message queue).

2. Run Multiple Iterations

Run a minimum of 10 iterations per metric to account for system variance. Calculate median and p95 (not just average). Track raw data for future analysis.

3. Version Everything

Store baselines alongside application code in Git. Tag baseline commits with version numbers. Use semantic versioning (baseline-v1.2.3).

4. Alert on Regressions

Set thresholds (e.g., >15% latency increase = alert), integrate with Slack/PagerDuty, and block production deployments if thresholds are exceeded.

5. Document Assumptions

Record hardware specs (CPU model, memory capacity). Note system state (other processes, network conditions). Explain any anomalies or outliers.

6. Refresh Baselines Quarterly

Run new baselines on current hardware. Recalibrate thresholds as infrastructure evolves. Archive old baselines for historical comparison.

Troubleshooting Common Issues

Issue: High Variance Between Runs

Cause: System noise, background processes, CPU frequency scaling

Solution:

# Disable CPU frequency scaling (Linux)echo "performance" | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor # Kill background servicessudo systemctl stop snapd bluetooth # Run test on isolated system or Kubernetes node

Issue: Memory Grows Unbounded

Cause: Memory leak in application or runtime

Solution:

# Check for leakdocker run -m 512m --memory-swap 512m IMAGE # If container OOMKills, application likely has leak# Increase -Xmx for Java, check Python objects, profile Node.js

Issue: Network Latency Affects Results

Cause: Registry pulls are slow, external dependencies respond slowly

Solution:

# Pre-warm layer cachedocker pull $IMAGE # Use local registry mirrordocker run --registry-mirror=http://local-mirror:5000 IMAGE # Test only application logic (not I/O)

What to Read Next

Regression Testing Strategy: Full guide for regression testing across image upgrades. Image Catalog Reference: Performance characteristics of all available images. Monitoring & Observability: Production performance monitoring setup. Troubleshooting: Performance Issues: Solutions for production slowdowns.