Skip to content

Confidence & Confounders

Not all scores are equally reliable. Codemetry provides two mechanisms to help interpret results: confidence scores and confounders.

Confidence

Confidence is a number from 0.0 to 1.0 indicating how reliable the score is.

How Confidence is Calculated

Base confidence: 0.6

Increases:

ConditionAdjustment
3+ commits in window+0.1
Follow-up provider ran successfully+0.1

Decreases:

ConditionAdjustment
0-1 commits in window-0.2
Key provider skipped-0.1 per provider

Clamped to 0.0 - 1.0

Interpreting Confidence

ConfidenceInterpretation
0.8+High confidence; reliable score
0.6-0.8Moderate confidence; reasonable estimate
0.4-0.6Low confidence; limited data
Below 0.4Very low confidence; treat with skepticism

Common Low-Confidence Scenarios

Weekend/holiday with few commits:

  • Only 1-2 commits
  • Base score holds (70) but confidence drops
  • Don’t overinterpret these days

Filtered analysis:

  • --author="Jane" may show days with 0 commits from Jane
  • Low confidence reflects limited filtered data

Provider failures:

  • Git errors, permission issues
  • Provider adds provider_skipped:* confounder
  • Confidence reduced accordingly

Confounders

Confounders are contextual flags that provide additional interpretation:

Available Confounders

ConfounderWhen AddedMeaning
large_refactor_suspectedChurn p95+ but fix density ≤p50High activity without many follow-up fixes—likely intentional restructuring
formatting_or_rename_suspectedHigh churn + high files touched + low follow-upMany files changed with few corrections—likely formatting, renaming, or bulk changes
ai_unavailableAI enabled but couldn’t runMissing API key, network error, or API failure
provider_skipped:<id>Provider threw an errorSpecific provider failed; signals from that provider are missing

large_refactor_suspected

This is the most important confounder for interpretation.

Triggered when:

  • Churn is very high (95th+ percentile)
  • But follow-up fix density is low (≤50th percentile)

Interpretation:

  • High churn usually means risk
  • But if the changes didn’t need fixing, they were probably intentional
  • This could be:
    • Planned refactoring
    • Major feature completion
    • Codebase reorganization

formatting_or_rename_suspected

Triggered when:

  • High churn
  • Many files touched
  • Low follow-up fix rate

Interpretation:

  • Bulk changes across many files
  • But not generating follow-up fixes
  • Likely:
    • Code formatting (prettier, php-cs-fixer)
    • Mass renames/refactors
    • Automated changes

ai_unavailable

Triggered when:

  • --ai=1 flag was used
  • But AI engine couldn’t complete

Causes:

  • No API key configured
  • Invalid API key
  • Network timeout
  • API rate limiting

Impact:

  • Analysis continues without AI summaries
  • Heuristic score unchanged
  • Just indicates AI feature didn’t work

provider_skipped:<id>

Format: provider_skipped:change_shape, provider_skipped:follow_up_fix, etc.

Triggered when:

  • A signal provider throws an exception
  • Codemetry catches it and continues

Causes:

  • Git command failures
  • Permission issues
  • Corrupt repository state

Impact:

  • Signals from that provider are missing
  • Score based on available signals only
  • Confidence reduced by 0.1 per skipped provider

Using Confidence and Confounders Together

Decision Matrix

ScoreConfidenceConfoundersAction
BadHighNoneInvestigate; likely real issues
BadHighlarge_refactor_suspectedCheck if refactor was planned
BadLowNoneNote it, but don’t overreact
MediumHighNoneMonitor; normal development
GoodHighNoneAll clear
AnyLowprovider_skipped:*Check Git/system issues

Example Interpretation

{
"mood_score": 28,
"confidence": 0.85,
"confounders": ["large_refactor_suspected"]
}

Reading this:

  • Score 28 = “bad” label
  • Confidence 0.85 = reliable score
  • Confounder = possibly intentional

Conclusion: This was a high-activity day, but the large_refactor_suspected flag suggests it may have been planned work. Check the commits—if they’re a coordinated refactoring effort, the “bad” score is expected and acceptable.