Purpose
Establishing performance baselines for CleanStart base images enables your QA team to detect regressions across upgrades and ensure consistent application behavior. This guide walks you through measuring, documenting, and monitoring key performance metrics so you can confidently deploy image updates to production.
Target audience: QA leads, DevOps engineers, platform teams, performance engineers.
Typical time investment: 2–3 hours for initial baseline, 30 min for regression validation per upgrade.
Why Performance Baselines Matter
Without baselines, you deploy blindly. When an application slows down after an image upgrade, you won't know if the new image is slower in terms of startup, memory, or CPU overhead; if your application has a memory leak; if the workload characteristics changed such as traffic or data volume; or if infrastructure degradation occurred such as noisy neighbor effects or network latency.
Performance baselines solve this by providing a reference point against which to compare future deployments.
Common Performance Regressions Caught by Baselines
Performance baselines catch several types of regressions. Startup time degradation, where deployments take twice as long, is caught by Layer 4 (perf) testing, and missing it costs $50K or more in deployment delays. Memory leaks, indicated by pods OOMKill after 6 hours, are caught by Layer 6 (soak) testing, with missing it costing $500K or more in production incidents. GC pause increases showing P99 latency spikes every 10 seconds are caught by runtime metrics, costing $200K or more in SLA violations if missed. Connection pool exhaustion causing intermittent 503 errors is caught by soak plus monitoring, costing $100K or more in customer impact if missed. CPU overhead requiring twice the CPU after upgrade is caught by load test comparison, costing $50K or more in infrastructure cost if missed. Finally, I/O performance degradation slowing down database queries is caught by database benchmarking, costing $150K or more in app latency if missed.
What to Measure
Effective baselines track metrics across four dimensions: startup, runtime, resource efficiency, and reliability.
1. Startup Performance
Why it matters: Deployment time directly impacts DevOps velocity. Slow startups also increase pod restart impact during rolling updates or node failures.
Metrics to Track
Metric | How to Measure | Industry Baseline | CleanStart Target | Threshold (warning) |
|---|---|---|---|---|
Cold start time | Time from | 5–10s | <5s | >10% increase from previous |
Warm start time | Time from restart (cached layers) | 2–5s | <2s | >10% increase |
Time to first response | Time to handle first successful request | 2–8s | <3s | >15% increase |
Readiness probe latency | Time | 1–3s | <1s | >20% increase |
Liveness probe latency | Time container becomes "alive" | 1–5s | <2s | >20% increase |
Measurement Script
#!/bin/bash IMAGE=$1ITERATIONS=${2:-10}CONTAINER_NAME="startup-test" echo "📊 Measuring startup performance for $IMAGE"echo "🔄 Running $ITERATIONS iterations..."echo "" declare -a TIMES for i in $(seq 1 $ITERATIONS); do # Remove previous container docker rm -f $CONTAINER_NAME 2>/dev/null || true # Measure time from docker run to health check START=$(date +%s%N) # Start container docker run -d \ --name $CONTAINER_NAME \ --health-cmd='curl -f http://localhost:8080/health || exit 1' \ --health-interval=1s \ --health-timeout=2s \ --health-retries=3 \ -p 8080:8080 \ "$IMAGE" > /dev/null # Wait for health check while [ "$(docker inspect --format='{{.State.Health.Status}}' $CONTAINER_NAME 2>/dev/null)" != "healthy" ]; do if [ "$(docker inspect --format='{{.State.Running}}' $CONTAINER_NAME 2>/dev/null)" != "true" ]; then echo "❌ Container failed to start" docker logs $CONTAINER_NAME exit 1 fi sleep 0.1 done END=$(date +%s%N) # Calculate elapsed time in milliseconds ELAPSED=$(( (END - START) / 1000000 )) TIMES[$i]=$ELAPSED printf " Run %2d: %4dms\n" $i $ELAPSED # Cleanup docker rm -f $CONTAINER_NAME > /dev/nulldone # Calculate statisticsTOTAL=0MIN=${TIMES[1]}MAX=${TIMES[1]} for time in "${TIMES[@]}"; do TOTAL=$((TOTAL + time)) [ "$time" -lt "$MIN" ] && MIN=$time [ "$time" -gt "$MAX" ] && MAX=$timedone AVERAGE=$((TOTAL / ITERATIONS))MEDIAN=$(echo "${TIMES[@]}" | tr ' ' '\n' | sort -n | head -$((ITERATIONS/2+1)) | tail -1) echo ""echo "📈 Results:"printf " Average: %4dms\n" $AVERAGEprintf " Median: %4dms\n" $MEDIANprintf " Min: %4dms\n" $MINprintf " Max: %4dms\n" $MAXecho "" # Store baselinemkdir -p baselinescat > baselines/startup-${IMAGE//\//-}.json << EOF{ "image": "$IMAGE", "date": "$(date -u +%Y-%m-%dT%H:%M:%SZ)", "iterations": $ITERATIONS, "average_ms": $AVERAGE, "median_ms": $MEDIAN, "min_ms": $MIN, "max_ms": $MAX, "raw_times": [$(IFS=, ; echo "${TIMES[*]}")]}EOF echo "✅ Baseline saved to baselines/startup-${IMAGE//\//-}.json"2. Runtime Performance
Why it matters: Runtime performance directly affects user experience (latency), scalability (throughput), and infrastructure costs (resource utilization).
Metrics to Track
Metric | How to Measure | Tool | Industry Baseline | Warning Threshold |
|---|---|---|---|---|
Request latency (p50) | 50th percentile response time | wrk, hey, k6 | 50–100ms | >10% increase |
Request latency (p95) | 95th percentile response time | wrk, hey, k6 | 150–300ms | >15% increase |
Request latency (p99) | 99th percentile response time | wrk, hey, k6 | 300–800ms | >20% increase |
Throughput (RPS) | Requests per second | wrk, k6 | App-dependent | >10% decrease |
Error rate | % of failed requests under load | wrk, k6 | <0.1% | Any increase |
CPU utilization (idle) | CPU% when no load | docker stats, top | <5% | >10% increase |
CPU utilization (loaded) | CPU% under 1000 RPS | docker stats, cAdvisor | 70–90% | >100% (saturated) |
HTTP Load Test Script (wrk)
#!/bin/bash IMAGE=$1RPS=${2:-100} # Requests per secondDURATION=${3:-60} # SecondsCONNECTIONS=${4:-100} # Concurrent connectionsTHREADS=${5:-4} # Worker threads CONTAINER_NAME="load-test-$(date +%s)" echo "🔄 Starting HTTP load test for $IMAGE"echo " RPS: $RPS | Duration: ${DURATION}s | Connections: $CONNECTIONS"echo "" # Start containerdocker run -d \ --name "$CONTAINER_NAME" \ --cpus=2 \ --memory=1g \ -p 8080:8080 \ "$IMAGE" > /dev/null sleep 5 # Warmup # Install wrk if neededif ! command -v wrk &> /dev/null; then echo "Installing wrk..." apt-get update && apt-get install -y wrkfi # Run load testecho "Running load test..."wrk -t$THREADS \ -c$CONNECTIONS \ -d${DURATION}s \ -R$RPS \ --latency \ http://localhost:8080/ \ | tee /tmp/load-test.txt # Extract resultsecho ""echo "📊 Summary:"grep "Latency" /tmp/load-test.txt | head -1grep "Requests/sec" /tmp/load-test.txt # Store baselinemkdir -p baselinescat > baselines/load-${IMAGE//\//-}.json << EOF{ "image": "$IMAGE", "date": "$(date -u +%Y-%m-%dT%H:%M:%SZ)", "rps": $RPS, "duration_seconds": $DURATION, "threads": $THREADS, "connections": $CONNECTIONS, "results": "$(cat /tmp/load-test.txt | jq -Rs)"}EOF # Cleanupdocker rm -f "$CONTAINER_NAME" > /dev/null echo "✅ Results saved to baselines/load-${IMAGE//\//-}.json"3. Resource Efficiency
Why it matters: Resource efficiency directly affects infrastructure costs and container density. A 10% memory reduction means 10% more containers per node.
Metrics to Track
Metric | Tool | Target | Warning Threshold |
|---|---|---|---|
Image size |
| <500MB | >10% increase |
Pulled image size (compressed) | Registry | <100–200MB | >5% increase |
Memory (RSS) at idle | docker stats | <100MB | >20% increase |
Memory (VSZ) at idle | docker stats | <300MB | >30% increase |
Disk I/O (read/write IOPS) | fio, iostat | Baseline | >20% sustained increase |
Container density (pods per node) | Kubernetes | Max safe | Any decrease |
Resource Measurement Script
#!/bin/bash IMAGE=$1DURATION=${2:-300} # 5 minutes CONTAINER_NAME="resource-test-$(date +%s)" echo "💾 Measuring resource usage for $IMAGE" # Get image sizeIMAGE_SIZE=$(docker images --format "table {{.Repository}}:{{.Tag}}\t{{.Size}}" | \ grep "$IMAGE" | awk '{print $2}') echo "Image size: $IMAGE_SIZE" # Start containerdocker run -d \ --name "$CONTAINER_NAME" \ -p 8080:8080 \ "$IMAGE" > /dev/null sleep 3 # Collect resource metrics every 5 secondsecho "Monitoring for ${DURATION}s..." STATS_FILE=/tmp/resource-stats.txt> $STATS_FILE for i in $(seq 0 5 $DURATION); do docker stats --no-stream "$CONTAINER_NAME" >> $STATS_FILE sleep 5done # Extract peak and average valuesPEAK_MEMORY=$(grep "$CONTAINER_NAME" $STATS_FILE | awk '{print $4}' | sort -h | tail -1)AVG_MEMORY=$(grep "$CONTAINER_NAME" $STATS_FILE | awk '{print $4}' | \ sed 's/[^0-9.]*//g' | \ awk '{sum+=$1; count++} END {printf "%.0f", sum/count}')PEAK_CPU=$(grep "$CONTAINER_NAME" $STATS_FILE | awk '{print $3}' | sort -h | tail -1) echo ""echo "📊 Results:"echo " Image size: $IMAGE_SIZE"echo " Peak memory: $PEAK_MEMORY"echo " Avg memory: ${AVG_MEMORY}M"echo " Peak CPU: $PEAK_CPU"echo "" # Store baselinemkdir -p baselinescat > baselines/resources-${IMAGE//\//-}.json << EOF{ "image": "$IMAGE", "date": "$(date -u +%Y-%m-%dT%H:%M:%SZ)", "image_size": "$IMAGE_SIZE", "peak_memory": "$PEAK_MEMORY", "avg_memory_mb": $AVG_MEMORY, "peak_cpu": "$PEAK_CPU"}EOF # Cleanupdocker rm -f "$CONTAINER_NAME" > /dev/null echo "✅ Results saved to baselines/resources-${IMAGE//\//-}.json"4. Garbage Collection & Runtime-Specific Metrics
Why it matters: For managed-memory languages (Java, Python, Node), GC pauses directly impact P99 latency and user experience.
Java/JVM Metrics
Metric | Tool | Target | Warning Threshold |
|---|---|---|---|
GC pause (max) | jstat, async-profiler | <100ms | >150ms |
Full GC frequency | jstat | <1/minute | >2/minute |
Heap utilization | jstat | 60–80% | >90% sustained |
Startup time | time | <5s | >10% increase |
Baseline Establishment Workflow
Step 1: Prepare Test Environment (30 min)
# 1. Isolate test environment (no other containers/processes)docker rm -f $(docker ps -aq) # 2. Warm up system (stabilize CPU frequency, page cache)for i in {1..5}; do docker run --rm registry.cleanstart.com/py:3.12 python -c "print('warmup')"done # 3. Install benchmarking toolsapt-get updateapt-get install -y wrk ab apache2-utils # 4. Create baseline directorymkdir -p baselinesStep 2: Run Baseline Tests (1.5 hours)
Run the startup, HTTP load, and resource measurement scripts defined above.
Step 3: Document and Store (15 min)
#!/bin/bash IMAGE=$1TIMESTAMP=$(date -u +%Y-%m-%d) # Merge all baseline files into one reportcat > baselines/report-${IMAGE//\//-}-${TIMESTAMP}.md << 'EOF'# Performance Baseline Report ## Image$IMAGE ## Date$TIMESTAMP ## Metrics ### Startup Performance- Average: XXX ms- P95: XXX ms ### Runtime Performance- P50 latency: XXX ms- P95 latency: XXX ms- Throughput: XXX RPS ### Resource Usage- Image size: XXX MB- Memory (idle): XXX MB- Memory (loaded): XXX MB ## Baseline Thresholds- Startup regression warning: >10%- Latency regression warning: >15%- Memory regression warning: >20%- Throughput regression warning: >10% ## Next Steps- Compare future image versions against these baselines- Alert if any metric exceeds warning threshold- Investigate regressions before production deploymentEOF # Store in version controlgit add baselines/git commit -m "Add performance baseline for $IMAGE on $TIMESTAMP"git push echo "✅ Baseline report saved and committed"Detecting Regressions: Before vs. After Comparison
When you upgrade from 3.12.1 → 3.12.2, compare new baselines against old ones.
Automated Regression Detection Script
#!/bin/bash OLD_IMAGE=$1NEW_IMAGE=$2REGRESSION_REPORT="regression-report-$(date +%Y%m%d).md" echo "# Regression Test Report" > $REGRESSION_REPORTecho "" >> $REGRESSION_REPORTecho "## Comparison: $OLD_IMAGE → $NEW_IMAGE" >> $REGRESSION_REPORTecho "" >> $REGRESSION_REPORT # Function to compare metricscompare_metrics() { local METRIC=$1 local OLD_VALUE=$2 local NEW_VALUE=$3 local THRESHOLD=$4 # Percentage if [ -z "$OLD_VALUE" ] || [ -z "$NEW_VALUE" ]; then return fi # Calculate percent change CHANGE=$(echo "scale=1; (($NEW_VALUE - $OLD_VALUE) / $OLD_VALUE) * 100" | bc) if (( $(echo "$CHANGE < 0" | bc -l) )); then DIRECTION="BETTER (faster/smaller)" SYMBOL="✅" else DIRECTION="WORSE (slower/larger)" SYMBOL="⚠️" if (( $(echo "$CHANGE > $THRESHOLD" | bc -l) )); then SYMBOL="❌" fi fi echo "$SYMBOL $METRIC: ${OLD_VALUE} → ${NEW_VALUE} (${CHANGE}%)" >> $REGRESSION_REPORT} # Load baseline filesOLD_STARTUP=$(jq '.average_ms' baselines/startup-${OLD_IMAGE//\//-}.json)NEW_STARTUP=$(jq '.average_ms' baselines/startup-${NEW_IMAGE//\//-}.json)compare_metrics "Startup time (ms)" $OLD_STARTUP $NEW_STARTUP 10 echo ""echo "📊 Full regression report: $REGRESSION_REPORT"Performance Comparison: CleanStart vs. Alternatives
Real-world benchmarks showing CleanStart vs. Alpine vs. Ubuntu vs. Distroless.
Startup Time Comparison
Image Cold Start Warm Start Time to Health━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━CleanStart py:3.12 1.2s 0.8s 1.0sAlpine py:3.12 1.8s 1.2s 1.5sUbuntu py:3.12 2.5s 1.8s 2.0sDistroless py:3.12 0.9s 0.6s 0.8sNote: Distroless is faster because it has no shell/debuggers. CleanStart is 25% faster than Alpine while including security features Distroless lacks.
Image Size Comparison
Image Compressed Uncompressed % Smaller than Ubuntu━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━CleanStart py:3.12 45 MB 120 MB 82%Alpine py:3.12 38 MB 105 MB 85%Ubuntu py:3.12 210 MB 680 MB —Distroless py:3.12 22 MB 62 MB 91%Memory (Idle) Comparison
Image RSS Memory VSZ Memory Notes━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━CleanStart py:3.12 45 MB 180 MB Clean, minimalAlpine py:3.12 32 MB 140 MB Musl reduces RSSUbuntu py:3.12 120 MB 420 MB Full glibc + utilsDistroless py:3.12 28 MB 110 MB Smallest footprintThroughput Under Load (1000 RPS, 100 concurrent)
Image Throughput P95 Latency Error Rate━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━CleanStart py:3.12 1200 RPS 85 ms 0.02%Alpine py:3.12 1150 RPS 92 ms 0.03%Ubuntu py:3.12 1050 RPS 120 ms 0.15%Distroless py:3.12 1210 RPS 82 ms 0.01%CI/CD Integration: Automated Baseline Testing
GitHub Actions Workflow
name: Performance Baseline on: schedule: - cron: '0 2 * * *' # Daily at 2 AM UTC workflow_dispatch: inputs: image: description: 'Image to baseline' required: true jobs: baseline: runs-on: ubuntu-latest timeout-minutes: 120 steps: - uses: actions/checkout@v4 - name: Establish baseline run: | ./scripts/performance-baseline.sh "${{ inputs.image }}" 300 - name: Compare against previous run: | ./scripts/compare-baselines.sh "${{ inputs.image }}" continue-on-error: true - name: Upload results uses: actions/upload-artifact@v4 with: name: baseline-results path: baselines/ - name: Comment on PR (if regression detected) if: failure() run: | echo "⚠️ Performance regression detected. Review results." >> $GITHUB_STEP_SUMMARYBest Practices
1. Test in Production-Like Conditions
Test using the same hardware (CPU type, memory, network), simulating real traffic patterns, and including all external dependencies (database, cache, message queue).
2. Run Multiple Iterations
Run a minimum of 10 iterations per metric to account for system variance. Calculate median and p95 (not just average). Track raw data for future analysis.
3. Version Everything
Store baselines alongside application code in Git. Tag baseline commits with version numbers. Use semantic versioning (baseline-v1.2.3).
4. Alert on Regressions
Set thresholds (e.g., >15% latency increase = alert), integrate with Slack/PagerDuty, and block production deployments if thresholds are exceeded.
5. Document Assumptions
Record hardware specs (CPU model, memory capacity). Note system state (other processes, network conditions). Explain any anomalies or outliers.
6. Refresh Baselines Quarterly
Run new baselines on current hardware. Recalibrate thresholds as infrastructure evolves. Archive old baselines for historical comparison.
Troubleshooting Common Issues
Issue: High Variance Between Runs
Cause: System noise, background processes, CPU frequency scaling
Solution:
# Disable CPU frequency scaling (Linux)echo "performance" | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor # Kill background servicessudo systemctl stop snapd bluetooth # Run test on isolated system or Kubernetes nodeIssue: Memory Grows Unbounded
Cause: Memory leak in application or runtime
Solution:
# Check for leakdocker run -m 512m --memory-swap 512m IMAGE # If container OOMKills, application likely has leak# Increase -Xmx for Java, check Python objects, profile Node.jsIssue: Network Latency Affects Results
Cause: Registry pulls are slow, external dependencies respond slowly
Solution:
# Pre-warm layer cachedocker pull $IMAGE # Use local registry mirrordocker run --registry-mirror=http://local-mirror:5000 IMAGE # Test only application logic (not I/O)What to Read Next
Regression Testing Strategy: Full guide for regression testing across image upgrades. Image Catalog Reference: Performance characteristics of all available images. Monitoring & Observability: Production performance monitoring setup. Troubleshooting: Performance Issues: Solutions for production slowdowns.
