The Detection Problem
A package maintainer's account is compromised. The attacker publishes a malicious version. The code review passes because it's "from the trusted maintainer."
How do you detect this?
Stylometry — the study of writing patterns — detects when behavior changes.
Just like forensic analysts can identify a person by their handwriting, CleanStart analyzes software development patterns to identify when a maintainer is not who they claim to be.
The 32-Dimensional Behavioral Fingerprint
CleanStart creates a unique fingerprint for each maintainer across 32 behavioral dimensions:
Behavioral Dimensions (32 total): Coding Style (7 dimensions): 1. Function length distribution (avg lines per function) 2. Variable naming conventions (camelCase, snake_case, etc.) 3. Comment frequency (lines per comment) 4. Error handling patterns (try/catch vs result types) 5. Import organization (alphabetical, grouped, etc.) 6. Indentation preference (2 spaces, 4 spaces, tabs) 7. Line length distribution (average characters per line) Commit Patterns (8 dimensions): 8. Commits per day (frequency) 9. Commits per hour (timezone preference) 10. Time between commits (consistency) 11. Commit message length (average characters) 12. Commit message structure (format patterns) 13. Commits per feature (how many commits per PR) 14. Merge vs rebase preference 15. Commit weekend activity (does maintainer work weekends) Workflow Patterns (6 dimensions): 16. Code review turnaround time (hours to review) 17. Pull request title format (naming convention) 18. Pull request description length 19. Release cadence (frequency of releases) 20. Release timing (day of week, time of day) 21. Issue response time (hours to respond) API Usage Patterns (6 dimensions): 22. Function call frequency distribution 23. Error type preferences (specific exception types used) 24. Library usage patterns (which libraries favored) 25. API deprecation handling (old vs new APIs) 26. Configuration option usage 27. Memory allocation patterns Temporal Patterns (5 dimensions): 28. Timezone offset (UTC offset of commit times) 29. Active hours (when maintainer typically works) 30. Vacation patterns (when inactive) 31. Response delay distribution (time between request and response) 32. Seasonal activity patterns (more active at certain times)Building the Fingerprint
CleanStart analyzes all historical commits from a maintainer:
# Pseudocode: Building maintainer fingerprintdef build_fingerprint(maintainer_id, last_n_commits=1000): commits = fetch_commits(maintainer_id, last_n_commits) fingerprint = { 'coding_style': { 'avg_function_length': calculate_function_lengths(commits), 'naming_convention': detect_naming_pattern(commits), 'comment_frequency': count_comments(commits), 'indentation': detect_indentation(commits), # ... 3 more dimensions }, 'commit_patterns': { 'commits_per_day': len(commits) / days_active, 'commits_per_hour': distribution_by_hour(commits), 'time_between_commits': gap_analysis(commits), 'commit_message_length': avg_length(commit_messages), # ... 4 more dimensions }, 'timezone': extract_timezone(commits), 'active_hours': extract_active_hours(commits), # ... 24 more dimensions } return fingerprintThe result is a 32-dimensional vector that uniquely characterizes the maintainer's behavior.
Detecting Anomalies
When a new commit arrives, it's compared to the fingerprint:
def detect_anomaly(commit, maintainer_fingerprint): commit_characteristics = { 'function_length': avg_function_length(commit), 'variable_names': naming_style(commit), 'comment_frequency': comments(commit) / lines_changed(commit), 'timezone': extract_tz(commit.timestamp), 'commit_time': commit.timestamp.hour, # ... 27 more dimensions } # Calculate distance from historical fingerprint distances = [] for i, dimension in enumerate(32_dimensions): historical_value = maintainer_fingerprint[dimension] current_value = commit_characteristics[dimension] distance = abs(historical_value - current_value) distances.append(distance) # Statistical analysis mean_distance = np.mean(distances) std_distance = np.std(distances) # Z-score: how many standard deviations from normal anomaly_score = mean_distance / std_distance if std_distance > 0 else 0 if anomaly_score > 3.0: # More than 3 std devs away return { 'anomaly_detected': True, 'confidence': calculate_confidence(anomaly_score), 'deviations': identify_deviating_dimensions(distances) } else: return {'anomaly_detected': False}Real-World Examples
Example 1: Timezone Shift
A maintainer (John, based in San Francisco, UTC-8) always commits between 9am-5pm.
Normal fingerprint: Timezone: UTC-8. Active hours: 17:00-01:00 UTC Commits per hour distribution: Peak at 18:00 UTC
Suspicious commit arrives:
Timestamp: 08:00 UTC (normal for John) BUT analysis reveals: Code written in unfamiliar Python style AND: Variable names use Greek letters (John uses English) AND: Commit message mentions Chinese holidays Anomaly score: 4.2 (highly suspicious)
Investigation: John's account was compromised in Shanghai (UTC+8). Attacker made changes that don't match John's usual patterns.
Result: Commit rejected, account locked, John notified.
Example 2: Function Complexity Spike
A maintainer's functions average 25 lines. New commit has functions averaging 180 lines.
Normal fingerprint: Avg function length: 25 lines. Max function length: 65 lines Code organization: Small, focused functions
Suspicious commit arrives: Avg function length: 180 lines. Functions contain complex business logic Comments are sparse
Analysis: Function length deviation: 155 lines above normal (7 std devs). Anomaly score: 6.1
Investigation: Malicious code attempted to hide exploit in large, complex functions to evade review.
Result: Commit flagged for human review, likely rejected.
Example 3: Message Format Change
A maintainer's commit messages follow a specific pattern:
Normal pattern:"[FEATURE] Add user authentication- Implement JWT token generation- Add password hashing- Update test coverage to 95%" Suspicious commit:"update dependencies"(Single line, no explanation, no detail)Analysis: Message length: 18 characters (normally 200+). Structure: No sub-bullets (normally has 3-5) Capitalization: lowercase (normally Title Case) Detail level: Minimal (normally extensive) Anomaly score: 5.7
Investigation: Maintainer's account compromised. Attacker changed configuration files to inject malicious code.
Result: Commit rejected, further analysis triggered.
Confidence Scoring
Anomalies aren't binary. CleanStart calculates confidence:
Anomaly Confidence Formula: confidence = (anomaly_score - threshold) / max_possible_score - Score 0-1.0: Normal (0% confidence in anomaly)- Score 1.0-2.0: Slightly unusual (< 40% confidence)- Score 2.0-3.0: Moderately unusual (40-70% confidence)- Score 3.0-5.0: Highly suspicious (70-95% confidence)- Score > 5.0: Extremely suspicious (95%+ confidence) If confidence > 70%: → Automatic code review flag → Human review required before merge If confidence > 90%: → Automatic account suspension → Security team investigation → Credentials reset requiredDeviations Report
When an anomaly is detected, CleanStart reports which dimensions deviated:
Anomaly Detected: High Confidence (92%) Deviating Dimensions: 1. Timezone: Expected UTC-8, observed UTC+8 (16 hours offset) 2. Active hours: Expected 17:00-01:00 UTC, observed 00:00-08:00 UTC 3. Function length: Expected avg 25 lines, observed avg 180 lines 4. Comment frequency: Expected 1:5 ratio, observed 1:50 ratio 5. Variable naming: Expected camelCase, observed snake_case+unicode 6. Commit frequency: Expected 5-10/day, observed 1/day 7. Commit message structure: Expected formatted, observed minimal Estimated Account Compromise Probability: 94% Recommended Actions: 1. Suspend account pending investigation 2. Review commits from past 7 days 3. Reset maintainer credentials 4. Require MFA re-authentication 5. Notify maintainer of suspicious activityMachine Learning Enhancement
CleanStart uses ML models trained on historical maintainer data:
Training data: 1000s of maintainer behavioral profiles + known anomalies (stolen accounts) + ground truth (was it actually compromised?) Model: Random Forest Classifier Input: 32-dimensional behavioral feature vector Output: Probability of compromise Machine learning approaches provide benefits over pure statistical analysis. They capture non-linear relationships, learn subtle patterns that correlate with attacks, adapt to new attack patterns, and reduce false positives through contextual analysis.Use Case: The colors.js Attack (Revisited)
In 2021, the colors.js package was compromised via stolen maintainer credentials.
What CleanStart would detect:
Historical profile: Sinceq (colors.js maintainer) - Timezone: UTC-5 (US Eastern) - Active hours: 20:00-04:00 UTC - Commit frequency: 1-3 per week - Function length: avg 15 lines - Comments: Detailed commit messages - Variables: English names Malicious commit arrives: - Timestamp: 12:00 UTC (outside active hours) - Commits: 1 commit (normal) - BUT: Function contains injection code (unusual) - BUT: No commit message explanation (unusual) - Timezone inferred: UTC+0 or UTC+1 (Europe) - Anomaly score: 6.8 (extremely suspicious) Detection Result: Commit flagged before reaching production Account suspension triggered Damage contained to single versionHandling Legitimate Changes
Not all anomalies are attacks. Maintainers change for legitimate reasons including job location changes that result in new timezones, work schedule changes that lead to different hours, coding style evolution as they learn new patterns, and team composition changes when different people contribute commits. CleanStart handles this through feedback loops that allow the system to learn and adapt:
Anomaly detected → Flag for review → Maintainer confirms Scenario 1: Legitimate change Maintainer: "I moved to Europe, timezone changed" System: Updates historical profile with new baseline Future commits: Evaluated against new profile Scenario 2: Attack System: "Account compromised, reset credentials" Maintainer: Confirms attack System: Learns attack pattern, updates ML modelPrivacy and Consent
Stylometry analysis raises privacy concerns:
CleanStart's approach:
- Consent: Maintainers opt-in to stylometry analysis
- Transparency: Results shared with maintainers
- Limited data retention: Profiles deleted after 1 year inactivity
- Aggregation: Never shares patterns of individual maintainers
- Security: Profiles encrypted at rest
False Positive Handling
Stylometry has inherent false positive rates:
Vacation period: - Maintainer on vacation (no commits) - Returns to work, commit patterns different - System flags as anomaly - Resolution: Maintainer confirms vacation - System: Learns and adjusts thresholds Seasonal changes: - School starts, maintainer has less time - Commit frequency drops 50% - System flags as anomaly - Resolution: Expected seasonal change - System: Adjusts baseline for recurring periods New team member: - Company hires new maintainer for project - New person has different coding style - System flags as anomaly - Resolution: Confirm new team member - System: Creates profile for new personIntegration with Supply Chain
Stylometry feeds into the broader supply chain verification:
When a commit arrives, it passes through four detection layers in sequence. Layer 1 verifies source code integrity via signature verification. Layer 2 detects behavioral anomalies using maintainer stylometry. Layer 3 executes the package in a behavioral sandbox and monitors activity. Layer 4 performs runtime verification in production. If any layer flags an anomaly, the commit is rejected, providing defense in depth.
Stylometry is one of four detection layers, providing defense in depth.
Limitations
Stylometry can't detect:
- Subtle malice: If attacker perfectly mimics maintainer's style
- Insider threats: Team member with legitimate access
- Social engineering: Maintainer voluntarily commits malicious code
- Permission escalation: Access through org admin, not maintainer account
Solution: Layered approach using all four detection layers, not just stylometry.
The Competitive Advantage
Organizations using stylometry: Detect account compromise immediately (not days later) Prevent supply chain attacks before distribution Maintain developer productivity (legitimate changes approved) Reduce false alarms (context-aware detection)
CleanStart's implementation of maintainer stylometry is the first production-grade system that brings this capability to open-source security.
It represents the evolution from "trust by default" to "verify and understand" — understanding the behavioral patterns that characterize legitimate maintainers.
