Document Version: 1.0 Last Updated: 2026-03-22 Classification: Internal - CISO/Incident Response Use Approval: CISO, VP Engineering, General Counsel Drill Cadence: Quarterly
1. Scope
This playbook defines incident response procedures for supply chain security incidents affecting CleanStart-managed container infrastructure. It addresses eight critical categories of supply chain threats that organizations must be prepared to handle effectively. The first covers compromised base images where a vulnerability or malware has been introduced to an image pulled from the registry.cleanstart.com infrastructure. The second concerns compromised dependencies, which involve security vulnerabilities in application libraries such as npm packages, Python pip modules, Go modules, and similar dependencies included in software bills of materials. The third category addresses compromised build infrastructure, where attackers have modified the build pipeline, injected malicious code, or forged cryptographic signatures. The fourth involves compromised signing keys where Cosign or Sigstore credentials have been stolen or exposed. The fifth covers compromised registry credentials where attackers have pushed malicious images using legitimate digest values. The sixth describes broader supply chain attacks where upstream vendors such as Docker, Kubernetes, or compiler toolchains have been compromised, potentially affecting multiple customers simultaneously. The seventh addresses runtime exploitation where attackers leverage vulnerabilities discovered in deployed containers. The eighth involves compliance breaches where images violating security policies are deployed without proper validation.
Out of scope: This playbook does NOT cover host or operating system security incidents involving virtual machine compromise or Kubernetes node compromise. It excludes network security incidents such as distributed denial-of-service attacks or lateral movement within clusters. Data breaches involving database compromise or credential theft unrelated to container supply chains fall outside this scope. General incident response procedures covered by the master incident response playbook are also excluded from this specific playbook.
2. Incident Severity Classification
P1 - Critical (0-30 minutes to containment)
Criteria:
A P1 Critical incident involves an actively exploited vulnerability in a production image with confirmed evidence of compromise. This classification also applies when signing keys used for image verification have been compromised, giving attackers the ability to forge legitimate-appearing images. Malware detected in a running container represents another P1 scenario. P1 incidents have a blast radius affecting more than 1,000 pods, spanning more than 10 environments, and impacting more than 100 customers. These incidents involve known active attacks with public exploits available or explicit vendor warnings issued.
Examples:
Consider a scenario where the Log4Shell vulnerability exists in a production image and attackers are actively attempting remote code execution. Alternatively, if Cosign signing keys have been leaked and attackers are confirmed to be pushing malicious images, this constitutes P1. Similarly, if Falco runtime monitoring detects a reverse shell being spawned inside a container, immediate P1 escalation is required.
Escalation: CISO, VP Eng, General Counsel, incident response commander, on-call SRE
SLA: Declare incident and begin containment procedures within 15 minutes
P2 - High (1-4 hours to containment)
Criteria:
A P2 High incident involves a vulnerability in a production image that has not yet been exploited but represents a significant exploitable risk. These incidents involve CVSS scores of 9.0 or higher, or represent well-known exploitation techniques. The blast radius for P2 incidents ranges from 100 to 1,000 pods across 2 to 10 environments. Vendor advisories with available proofs of concept or active exploitation reports warrant P2 classification.
Examples:
Scenarios qualifying for P2 include a critical RCE vulnerability assigned CVE-2024-XXXX present in base images that have been deployed 7 days ago. A dependency with a published exploit that is currently active in staging or production environments qualifies as P2. If CleanSight detects an image that is 120 days old and contains critical CVEs, this also merits P2 classification.
Escalation: CISO, VP Eng, on-call SRE, product security lead
SLA: Complete assessment and prepare containment plan within 2 hours
P3 - Medium (1-3 days to remediation)
Criteria:
P3 Medium incidents involve vulnerabilities in production images with CVSS scores ranging from 7.0 to 8.9. These incidents show no evidence of active exploitation. The blast radius affects 10 to 100 pods in one or two environments. Standard SLA applies, meaning high-severity CVEs receive a 7-day remediation window.
Examples:
High-severity vulnerabilities in application dependencies qualify as P3. Base images requiring security patches but showing no zero-day evidence are classified as P3. Policy violations such as detection of unsigned image deployments are also P3 incidents.
Escalation: VP Eng, product security lead, SRE team lead
SLA: Complete verification of impact and prioritization in the development backlog within 24 hours
P4 - Low (Standard backlog)
Criteria:
P4 Low severity incidents involve vulnerabilities in non-critical images with CVSS scores between 4.0 and 6.9. These incidents have no production impact, affecting only staging or development environments. The blast radius remains below 10 pods in a single environment and can be addressed during the regular release cycle.
Examples:
Medium-severity CVEs discovered in test or CI images qualify as P4. Low-severity vulnerabilities in optional libraries are classified as P4. Informational security findings related to configuration hardening also fall into this category.
Escalation: Product security lead, engineering team
SLA: Address using the standard CVE SLA where medium-severity issues receive a 30-day remediation window
3. PHASE 1: Detection and Triage (0-30 minutes)
3.1 Alert Sources
Supply chain security incidents can be detected through both automated monitoring systems and manual external reports. Understanding these alert sources and their reliability helps responders quickly validate incidents during the critical early minutes.
Automated Sources:
Vulnerability scanners such as Trivy and Snyk automatically alert when image scan results identify critical or high-severity findings. CleanSight monitors for outdated images that are more than 30 days old and have newer upstream versions available. Falco eBPF runtime monitoring generates alerts when suspicious behavior occurs, including unexpected process execution, unauthorized file modifications, or unusual capability usage. Image registry webhooks create alerts whenever new images are pushed or existing images are modified. SBOM diff monitoring systems detect unexpected dependency additions or removals that might indicate unauthorized changes. Cosign verification failures alert when image signatures cannot be validated or key rotation becomes necessary.
Manual and External Sources:
Vendor advisories such as OpenSSL or Log4j security notices provide critical advance warning of vulnerabilities. Public CVE disclosures published by NVD and GitHub Security Advisories must be monitored continuously. Security researchers may report vulnerabilities through public channels including Twitter, HackerNews, or direct email. Internal security testing activities including SAST and DAST scans, as well as penetration testing findings, may surface supply chain vulnerabilities. Customers sometimes report suspicious activity through support channels. Law enforcement and threat intelligence agencies may provide warnings in rare situations involving known active attacks.
3.2 Initial Triage Checklist (First 5 minutes)
Upon receipt of an alert, the responder must systematically answer the following critical questions:
[ ] Is the alert valid? The responder must confirm this is not a false positive and not a test system. For scanner alerts, verify the image is actually deployed in production. For Falco alerts, assess whether the process or file modification is suspicious given the application's normal behavior. For vendor advisories, confirm the vulnerability applies to the organization's technology stack. [ ] What is the affected component? The responder must identify the specific component with high precision. This includes the complete image name and digest, for example registry.cleanstart.com/base-python:3.11.2@sha256:abc.... Record the specific vulnerability identifier such as a CVE number or the package name. If malware is suspected, document the malware signature and suspicious behavior patterns. [ ] Where is it deployed? Conduct a rapid blast radius assessment to understand the scope. Determine whether the issue affects all clusters or only specific subsets. Identify the environments involved, whether production, staging, development, or all three. Count the affected pods, namespaces, and teams to establish the impact scope. [ ] How was it detected? Document the detection method to assess alert reliability. For scanner-based alerts, record which tool generated the alert and which vulnerability database version was used. For Falco alerts, note the specific rule that triggered, the timestamp of the alert, and any anomaly score assigned. For vendor-based reports, distinguish between official vendor advisories and unconfirmed community reports. [ ] Do we have a baseline SBOM? Prepare for the eradication phase by obtaining the software bill of materials from a known-good image version. Compare the current image's SBOM with the baseline to identify unexpected changes and differentiate legitimate updates from potentially malicious modifications. If any of these answers is unknown, immediate escalation is required.
3.3 Severity Classification
Incidents must be assigned a severity level using a systematic scoring algorithm based on multiple factors. The CVSS score accounts for 40 percent of the assessment. The blast radius measured in affected pods and customers accounts for 30 percent. The exploitability factor considering proof-of-concept availability and evidence of active exploitation accounts for 20 percent. The business impact assessment including potential data exposure, system availability implications, and compliance violations accounts for the remaining 10 percent.
Scoring algorithm:
A combination of CVSS 9.0-10.0 with any blast radius warrants P1 classification. CVSS 7.0-8.9 with more than 100 affected pods is classified as P1. The same CVSS range with fewer than 100 pods is classified as P2. CVSS 4.0-6.9 with any blast radius is classified as P3. CVSS scores below 4.0 receive P4 classification.
If the severity assessment indicates P1 or P2, escalate to the CISO immediately.
3.4 Blast Radius Determination
Once an incident is confirmed, responders must systematically identify all affected resources to understand the full scope of the incident. The following kubectl commands facilitate rapid blast radius assessment:
# If container image affected (e.g., registry.cleanstart.com/app:v1.2.3)kubectl get pods --all-namespaces \ -o jsonpath='{..image}' | \ tr ' ' '\n' | grep -E "(app:v1\.2\.3|app@sha256:...)" | sort | uniq # Count by clusterkubectl get nodes --show-labels | grep <cluster-name> # Count by team (using label)kubectl get pods --all-namespaces -l team=backend \ -o jsonpath='{..image}' | grep <affected-image> # Count by environmentkubectl get pods --all-namespaces -l env=prod \ -o jsonpath='{..image}' | grep <affected-image> # Get detailed info for targeted remediationkubectl get pods --all-namespaces -o json \ -A | jq '.items[] | select(.spec.containers[].image | contains("affected-image")) | {namespace, name, image: .spec.containers[].image}'Document the findings in the incident ticket with the following information: Total count of affected pods organized by cluster, environment, and team. Estimated number of customers impacted by the incident. List of specific services and applications affected. Assessment of data access privileges, including whether the service has access to personally identifiable information, payment card data, or other sensitive information.
4. PHASE 2: Containment (30 min - 2 hours)
4.1 Immediate Containment (First 30 minutes)
Goal: Stop the incident from spreading further and prevent active exploitation.
Step 1: Declare Incident
[ ] Create an incident ticket in Jira or SNOW with all critical information. Include the severity level (P1-P4) and a brief description of the incident. Document affected resources including image digest, affected pods, and impacted clusters. Set the status to "Containment in progress". Establish update frequency at every 15 minutes for P1 incidents, every hour for P2 incidents, and every 3 hours for P3 incidents. [ ] Notify all relevant stakeholders through Slack, email, or in-person command post. For all incidents, notify the CISO and VP Engineering. For all incidents, contact the incident response commander and product security lead. For all incidents, alert the on-call SRE. For P1 incidents specifically, escalate immediately to General Counsel, and inform CEO/COO through the CISO.
Step 2: Prevent New Deployments (Network + Policy)
Option A: Block at admission control (fastest, non-disruptive to existing pods) [ ] Update the OPA/Gatekeeper policy to block the affected image by its digest. Example Gatekeeper rule.
# Example Gatekeeper ruleviolation[{"msg": m}] { image := input.review.object.spec.containers[_].image image == "registry.cleanstart.com/app@sha256:badhash..." m := "Deployment blocked: image contains critical vulnerability (incident #12345)"}[ ] Apply the policy update immediately, which takes effect for all new deployments. [ ] Verify the policy is working by attempting to deploy the affected image; the deployment should be rejected with the blocking message. Option B: Block at registry (if the image is actively spreading) [ ] Tag the compromised image as quarantined, for example by adding a security:quarantine tag. [ ] Configure registry webhooks to reject any pull attempts for quarantined images. [ ] Notify all teams via the security Slack channel of the quarantine action. Option C: Take registry offline (only in rare circumstances with evidence of active compromise) [ ] Coordinate immediately with VP Engineering and the CISO before taking this action. [ ] Redirect all image pull requests to an alternative registry if available. [ ] Document the business justification for this extreme measure in the incident ticket.
Step 3: Isolate Affected Pods (Stopping exploitation)
For P1 incidents with evidence of active exploitation or malware: [ ] Apply network isolation policies to restrict outbound network access from affected pods. Example NetworkPolicy.
apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: incident-isolation-12345 namespace: productionspec: podSelector: matchLabels: app: vulnerable-app policyTypes: - Egress egress: - to: - namespaceSelector: matchLabels: name: kube-system ports: - protocol: TCP port: 53[ ] Apply the policy immediately and verify with kubectl get networkpolicies. [ ] Pods remain running to preserve forensic evidence but cannot contact external systems. For P2 incidents with confirmed exploitation attempts: [ ] Apply the same network isolation as P1 incidents. For P3-P4 incidents with vulnerabilities but no exploitation evidence: [ ] Do NOT apply isolation; proceed directly to the remediation phase. [ ] Reduce Falco alert verbosity for these pods to prevent alert fatigue.
Step 4: Enable Forensic Collection (Evidence preservation)
[ ] Increase Falco monitoring sensitivity for affected pods to capture detailed system call information and file operations.
kubectl set env -n falco ds falco \ FALCO_RULE_FILE=rules-incident-12345.yaml[ ] Capture complete container logs before the pods are rolled out or terminated.
kubectl logs <affected-pod> -n <namespace> > /tmp/pod-logs-incident-12345.txtkubectl logs <affected-pod> -n <namespace> --previous >> /tmp/pod-logs-incident-12345.txt[ ] Snapshot the container filesystem if P1 or malware is suspected.
kubectl debug <affected-pod> -n <namespace> -it --image=ubuntu -- bash# Inside debug container:tar czf /tmp/container-fs-snapshot.tar.gz / 2>/dev/null[ ] Collect the complete Falco alert timeline for the past 24 hours.
grep "affected-pod" /var/log/falco/alerts.log > /tmp/falco-incident-12345.txt[ ] Tag the affected pod to prevent garbage collection and accidental termination.
kubectl label pod <affected-pod> -n <namespace> \ incident=12345 \ do-not-delete=true[ ] Store all evidence in a secure location outside the running cluster. Copy logs to the central audit system. Retain all evidence for a minimum of 90 days, or longer if a data breach is suspected.
4.2 Short-Term Containment (30 min - 2 hours)
Goal: Return the system to a known-good state while preserving investigation capability.
Step 1: Identify Last Known Good Image
[ ] Review deployment history to identify a previous version without the vulnerability.
kubectl rollout history deployment <app> -n <namespace># Note revision X with known-good image[ ] Verify the last-known-good image signature and integrity.
cosign verify registry.cleanstart.com/app:v1.2.2 --certificate-identity-regexp=...# Should succeed with valid signature[ ] Scan the last-known-good image comprehensively for vulnerabilities.
trivy image registry.cleanstart.com/app:v1.2.2 --severity CRITICAL# Must be clean or have only acceptable findingsStep 2: Trigger Rollback (Canary approach)
For P1-P2 incidents where confidence in the incident is high: [ ] Roll back 10 percent of traffic to the last-known-good image as a canary.
kubectl rollout undo deployment <app> -n <namespace> --to-revision=X# Or:kubectl set image deployment/<app> \ app=registry.cleanstart.com/app:v1.2.2 \ -n <namespace> \ --record[ ] Wait 5 minutes and monitor Falco alerts, application logs, and system metrics for normalization. [ ] If the canary is stable, roll back 50 percent of traffic and wait another 5 minutes. [ ] If still stable, complete the rollback to 100 percent. [ ] If any issues are detected during rollback, abort immediately and escalate. For P3-P4 incidents where low confidence or planned remediation is appropriate: [ ] Do NOT perform rollback; proceed directly to the eradication phase.
Step 3: Update Deployments (Prevent re-creation)
[ ] Update the deployment YAML to reference the last-known-good image version.
spec: template: spec: containers: - name: app image: registry.cleanstart.com/app:v1.2.2[ ] Apply the updated deployment to the cluster.
kubectl apply -f deployment.yaml[ ] Update all deployment managers such as Helm or ArgoCD to sync the new version.
Step 4: Monitor Stabilization (Continuous until clear)
[ ] For P1-P2 incidents, establish a real-time monitoring dashboard tracking the following metrics: Falco alert rate should drop to near-zero. Pod restart rate should return to normal baseline values. CPU and memory usage should normalize. Network traffic patterns, especially outbound connections to suspicious IP addresses, should cease. Application error logs should return to baseline. [ ] Execute monitoring checks every 10 minutes for P1 and every 30 minutes for P2.
# Falco alertskubectl logs -f -n falco -l app=falco | grep "incident-12345" | wc -l # Pod statuskubectl get pods -n <namespace> -l app=<app> -o wide # Error logskubectl logs <affected-pod> -n <namespace> --tail=20 | grep -i "error\|exception"5. PHASE 3: Eradication (2 hours - 24 hours)
5.1 Root Cause Analysis
Determine the source of the vulnerability or compromise to prevent recurrence. The investigation path depends on the incident type.
If base image vulnerability:
[ ] Identify which base image was affected, for example library/ubuntu:22.04 from Docker Hub. [ ] Determine which package contained the vulnerability, such as openssl 3.0.1. [ ] Establish when the package was added to the base image. [ ] Investigate whether existing security scanners missed the finding. Action: [ ] Pull the updated base image from registry.cleanstart.com containing the patched package version. [ ] Verify the updated SBOM includes the corrected package version. [ ] Rebuild the application image using the new base image.
If application dependency vulnerability:
[ ] Identify the specific package affected, such as an npm, pip, or Go module. [ ] Determine which version contains the vulnerability. [ ] Establish when the vulnerable package was introduced to the application. [ ] Investigate why the pre-deployment security scan did not detect the vulnerability. Action: [ ] Update the application dependency to a patched version. [ ] Execute the full test suite to ensure compatibility. [ ] Build a new application image. [ ] Verify the SBOM reflects the updated package version.
If build infrastructure compromise:
[ ] Determine whether the build system infrastructure itself was attacked, including the CI/CD server or container registry. [ ] Establish whether signing keys were exposed or compromised. [ ] Assess whether attackers injected code into the build pipeline. [ ] Identify all images that were signed during the compromise window. Action: [ ] Secure and isolate the build system immediately. [ ] Rotate all signing keys immediately. [ ] Rebuild all images that were created during the compromise window. [ ] Re-sign the rebuilt images using the new keys. [ ] Notify all customers about the re-signed images.
If runtime exploitation:
[ ] Determine whether the vulnerability was actively exploited based on Falco alerts and logs. [ ] Assess whether the attacker achieved persistence through backdoors or file modifications. [ ] Evaluate whether sensitive data was accessed or exfiltrated. [ ] Investigate whether lateral movement occurred beyond the initially compromised pod. Action: [ ] Conduct thorough forensic investigation of all affected containers. [ ] Create a timeline of attacker actions from logs and Falco alerts. [ ] Determine the scope of data exposure. [ ] Notify customers if their personal information, payment data, or sensitive information was accessed.
5.2 Build Patched Image
If base image update:
undefined