Confidence & Confounders
Not all scores are equally reliable. Codemetry provides two mechanisms to help interpret results: confidence scores and confounders.
Confidence
Confidence is a number from 0.0 to 1.0 indicating how reliable the score is.
How Confidence is Calculated
Base confidence: 0.6
Increases:
| Condition | Adjustment |
|---|---|
| 3+ commits in window | +0.1 |
| Follow-up provider ran successfully | +0.1 |
Decreases:
| Condition | Adjustment |
|---|---|
| 0-1 commits in window | -0.2 |
| Key provider skipped | -0.1 per provider |
Clamped to 0.0 - 1.0
Interpreting Confidence
| Confidence | Interpretation |
|---|---|
| 0.8+ | High confidence; reliable score |
| 0.6-0.8 | Moderate confidence; reasonable estimate |
| 0.4-0.6 | Low confidence; limited data |
| Below 0.4 | Very low confidence; treat with skepticism |
Common Low-Confidence Scenarios
Weekend/holiday with few commits:
- Only 1-2 commits
- Base score holds (70) but confidence drops
- Don’t overinterpret these days
Filtered analysis:
--author="Jane"may show days with 0 commits from Jane- Low confidence reflects limited filtered data
Provider failures:
- Git errors, permission issues
- Provider adds
provider_skipped:*confounder - Confidence reduced accordingly
Confounders
Confounders are contextual flags that provide additional interpretation:
Available Confounders
| Confounder | When Added | Meaning |
|---|---|---|
large_refactor_suspected | Churn p95+ but fix density ≤p50 | High activity without many follow-up fixes—likely intentional restructuring |
formatting_or_rename_suspected | High churn + high files touched + low follow-up | Many files changed with few corrections—likely formatting, renaming, or bulk changes |
ai_unavailable | AI enabled but couldn’t run | Missing API key, network error, or API failure |
provider_skipped:<id> | Provider threw an error | Specific provider failed; signals from that provider are missing |
large_refactor_suspected
This is the most important confounder for interpretation.
Triggered when:
- Churn is very high (95th+ percentile)
- But follow-up fix density is low (≤50th percentile)
Interpretation:
- High churn usually means risk
- But if the changes didn’t need fixing, they were probably intentional
- This could be:
- Planned refactoring
- Major feature completion
- Codebase reorganization
formatting_or_rename_suspected
Triggered when:
- High churn
- Many files touched
- Low follow-up fix rate
Interpretation:
- Bulk changes across many files
- But not generating follow-up fixes
- Likely:
- Code formatting (prettier, php-cs-fixer)
- Mass renames/refactors
- Automated changes
ai_unavailable
Triggered when:
--ai=1flag was used- But AI engine couldn’t complete
Causes:
- No API key configured
- Invalid API key
- Network timeout
- API rate limiting
Impact:
- Analysis continues without AI summaries
- Heuristic score unchanged
- Just indicates AI feature didn’t work
provider_skipped:<id>
Format: provider_skipped:change_shape, provider_skipped:follow_up_fix, etc.
Triggered when:
- A signal provider throws an exception
- Codemetry catches it and continues
Causes:
- Git command failures
- Permission issues
- Corrupt repository state
Impact:
- Signals from that provider are missing
- Score based on available signals only
- Confidence reduced by 0.1 per skipped provider
Using Confidence and Confounders Together
Decision Matrix
| Score | Confidence | Confounders | Action |
|---|---|---|---|
| Bad | High | None | Investigate; likely real issues |
| Bad | High | large_refactor_suspected | Check if refactor was planned |
| Bad | Low | None | Note it, but don’t overreact |
| Medium | High | None | Monitor; normal development |
| Good | High | None | All clear |
| Any | Low | provider_skipped:* | Check Git/system issues |
Example Interpretation
{ "mood_score": 28, "confidence": 0.85, "confounders": ["large_refactor_suspected"]}Reading this:
- Score 28 = “bad” label
- Confidence 0.85 = reliable score
- Confounder = possibly intentional
Conclusion: This was a high-activity day, but the large_refactor_suspected flag suggests it may have been planned work. Check the commits—if they’re a coordinated refactoring effort, the “bad” score is expected and acceptable.