Scoring
Codemetry uses a deterministic heuristic scoring algorithm to convert normalized signals into a single score (0-100) and label (bad/medium/good).
The Algorithm
Starting Point
Every day starts with a base score of 70 (in the “medium” range).
Penalties
The scorer applies penalties based on percentile thresholds:
| Condition | Penalty | Reasoning |
|---|---|---|
| Churn at p95+ | -20 | Extreme activity, high risk |
| Churn at p90-p95 | -12 | Elevated activity |
| Scatter at p90+ | -10 | Highly dispersed changes |
| Fix density at p95+ | -25 | Many follow-up fixes needed |
| Fix density at p90-p95 | -15 | Elevated follow-up fixes |
| Any revert commits | -15 | Something went wrong |
| WIP ratio ≥ 0.3 | -8 | High proportion of incomplete work |
Rewards
The scorer applies rewards for positive patterns:
| Condition | Reward | Reasoning |
|---|---|---|
| Churn ≤p25 AND fix density ≤p25 | +5 | Low activity with stable quality |
Clamping
The final score is clamped to the range 0-100.
Label Mapping
| Score Range | Label |
|---|---|
| 75-100 | good |
| 45-74 | medium |
| 0-44 | bad |
Worked Example
Let’s walk through a scoring calculation:
Day’s normalized signals:
- Churn: 92nd percentile
- Scatter: 75th percentile
- Fix density: 96th percentile
- Reverts: 1
- WIP ratio: 0.1
Calculation:
Starting score: 70
Churn at p90-95: -12Scatter below p90: 0Fix density at p95+: -25Revert detected: -15WIP ratio < 0.3: 0 ────Subtotal: 18
Clamp to 0-100: 18 ✓Label (0-44 = bad): badResult: Score 18, Label “bad”
Why This Algorithm?
Deterministic
Same inputs always produce same outputs. No randomness, no machine learning variance. You can verify and predict results.
Percentile-Based
Using percentiles instead of absolute values means:
- “High” churn is relative to your repo
- Different projects have different norms
- Fair comparison across repositories
Additive Penalties
Multiple problems compound:
- High churn alone = medium score
- High churn + reverts = low score
- High churn + reverts + high fix density = very low score
This reflects reality: multiple warning signs are more concerning than isolated ones.
Conservative Rewards
The algorithm is intentionally more generous with penalties than rewards:
- Easy to lose points (many penalty conditions)
- Hard to gain points (one small reward)
This makes “good” scores meaningful—they represent genuinely stable periods.
Interpreting Scores
Score Ranges
| Range | Interpretation |
|---|---|
| 80-100 | Stable, low-risk period |
| 65-79 | Normal development activity |
| 50-64 | Some elevated signals, worth monitoring |
| 35-49 | Multiple warning signs, consider review |
| 0-34 | Significant strain indicators |
Context Matters
Always consider:
- Confounders:
large_refactor_suspectedchanges interpretation - Confidence: Low confidence = take score with a grain of salt
- Reasons: The “why” behind the score
- Patterns: One bad day happens; three in a row is a pattern
Reasons
The scorer generates human-readable reasons for the top 6 contributors to the score:
"reasons": [ { "signal_key": "change.churn", "direction": "negative", "magnitude": 20.0, "summary": "Churn at p95+ percentile" }, { "signal_key": "followup.fix_density", "direction": "negative", "magnitude": 15.0, "summary": "High follow-up fix density" }]Reasons are sorted by magnitude (impact on score), so the first reason is the biggest contributor.
Customization
The scoring algorithm is not currently configurable. If you need different thresholds or weights, you would need to:
- Create a custom scorer class
- Inject it into the Analyzer
This is intentionally limited in V1 to maintain consistency. Custom scoring may be exposed in future versions.