🔍 CodeSheriff Bug Classifier

A fine-tuned CodeBERT model that classifies Python code snippets into five bug categories. Built as the classification engine inside CodeSheriff — an AI system that automatically reviews GitHub pull requests.

Base model: microsoft/codebert-base · Task: 5-class sequence classification · Language: Python

Labels

ID	Label	Example
0	Clean	Well-formed code, no issues
1	Null Reference Risk	`result.fetchone().name` without a None check
2	Type Mismatch	`"Error: " + error_code` where `error_code` is an int
3	Security Vulnerability	`"SELECT * FROM users WHERE id = " + user_id`
4	Logic Flaw	`for i in range(len(items) + 1)`

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("jayansh21/codesheriff-bug-classifier")
model = AutoModelForSequenceClassification.from_pretrained("jayansh21/codesheriff-bug-classifier")

LABELS = {
    0: "Clean",
    1: "Null Reference Risk",
    2: "Type Mismatch",
    3: "Security Vulnerability",
    4: "Logic Flaw"
}

code = """
def get_user(uid):
    query = "SELECT * FROM users WHERE id=" + uid
    return db.execute(query)
"""

inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    logits = model(**inputs).logits

probs = torch.softmax(logits, dim=-1)
pred = logits.argmax(dim=-1).item()
confidence = probs[0][pred].item()

print(f"{LABELS[pred]} ({confidence:.1%})")
# Security Vulnerability (99.3%)

Training

Dataset: CodeSearchNet Python split with heuristic labeling, augmented with seed templates for underrepresented classes. Final training set: 4,600 balanced samples across all five classes. Stratified 80/10/10 train/val/test split.

Key hyperparameters:

Parameter	Value
Epochs	4
Effective batch size	16 (8 × 2 grad accum)
Learning rate	2e-5
Optimizer	AdamW + linear warmup
Max token length	512
Class weighting	Yes — balanced
Hardware	NVIDIA RTX 3050 (4GB)

Evaluation

Test set: 840 samples (stratified).

Class	Precision	Recall	F1	Support
Clean	0.92	0.88	0.90	450
Null Reference Risk	0.63	0.78	0.70	120
Type Mismatch	0.96	0.95	0.95	75
Security Vulnerability	0.99	0.92	0.95	75
Logic Flaw	0.96	0.97	0.97	120
Macro F1	0.89	0.90	0.89

Confusion matrix:

                 Clean  NullRef  TypeMis  SecVuln  Logic
Actual Clean   [  394      52        1        1      2  ]
Actual NullRef [   23      93        1        0      3  ]
Actual TypeMis [    3       1       71        0      0  ]
Actual SecVuln [    4       1        1       69      0  ]
Actual Logic   [    3       0        0        0    117  ]

Logic Flaw and Security Vulnerability are the strongest classes — both have clear lexical patterns. Null Reference Risk is the weakest (precision 0.63) because null-risk code closely resembles clean code structurally. Most misclassifications there are false positives rather than missed bugs.

Limitations

Python only — not trained on other languages
Function-level input — works best on 5–50 line snippets
Heuristic labels — training data was pattern-matched, not expert-annotated
Not a SAST replacement — probabilistic classifier, not a sound static analysis tool

Model tree for jayansh21/codesheriff-bug-classifier

Base model

microsoft/codebert-base

Finetuned

(132)

this model

jayansh21
/

codesheriff-bug-classifier