πŸ” CodeSheriff Bug Classifier

A fine-tuned CodeBERT model that classifies Python code snippets into five bug categories. Built as the classification engine inside CodeSheriff β€” an AI system that automatically reviews GitHub pull requests.

Base model: microsoft/codebert-base Β· Task: 5-class sequence classification Β· Language: Python


Labels

ID Label Example
0 Clean Well-formed code, no issues
1 Null Reference Risk result.fetchone().name without a None check
2 Type Mismatch "Error: " + error_code where error_code is an int
3 Security Vulnerability "SELECT * FROM users WHERE id = " + user_id
4 Logic Flaw for i in range(len(items) + 1)

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("jayansh21/codesheriff-bug-classifier")
model = AutoModelForSequenceClassification.from_pretrained("jayansh21/codesheriff-bug-classifier")

LABELS = {
    0: "Clean",
    1: "Null Reference Risk",
    2: "Type Mismatch",
    3: "Security Vulnerability",
    4: "Logic Flaw"
}

code = """
def get_user(uid):
    query = "SELECT * FROM users WHERE id=" + uid
    return db.execute(query)
"""

inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    logits = model(**inputs).logits

probs = torch.softmax(logits, dim=-1)
pred = logits.argmax(dim=-1).item()
confidence = probs[0][pred].item()

print(f"{LABELS[pred]} ({confidence:.1%})")
# Security Vulnerability (99.3%)

Training

Dataset: CodeSearchNet Python split with heuristic labeling, augmented with seed templates for underrepresented classes. Final training set: 4,600 balanced samples across all five classes. Stratified 80/10/10 train/val/test split.

Key hyperparameters:

Parameter Value
Epochs 4
Effective batch size 16 (8 Γ— 2 grad accum)
Learning rate 2e-5
Optimizer AdamW + linear warmup
Max token length 512
Class weighting Yes β€” balanced
Hardware NVIDIA RTX 3050 (4GB)

Evaluation

Test set: 840 samples (stratified).

Class Precision Recall F1 Support
Clean 0.92 0.88 0.90 450
Null Reference Risk 0.63 0.78 0.70 120
Type Mismatch 0.96 0.95 0.95 75
Security Vulnerability 0.99 0.92 0.95 75
Logic Flaw 0.96 0.97 0.97 120
Macro F1 0.89 0.90 0.89

Confusion matrix:

                 Clean  NullRef  TypeMis  SecVuln  Logic
Actual Clean   [  394      52        1        1      2  ]
Actual NullRef [   23      93        1        0      3  ]
Actual TypeMis [    3       1       71        0      0  ]
Actual SecVuln [    4       1        1       69      0  ]
Actual Logic   [    3       0        0        0    117  ]

Logic Flaw and Security Vulnerability are the strongest classes β€” both have clear lexical patterns. Null Reference Risk is the weakest (precision 0.63) because null-risk code closely resembles clean code structurally. Most misclassifications there are false positives rather than missed bugs.


Limitations

  • Python only β€” not trained on other languages
  • Function-level input β€” works best on 5–50 line snippets
  • Heuristic labels β€” training data was pattern-matched, not expert-annotated
  • Not a SAST replacement β€” probabilistic classifier, not a sound static analysis tool

Links


Downloads last month
59
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jayansh21/codesheriff-bug-classifier

Finetuned
(132)
this model

Dataset used to train jayansh21/codesheriff-bug-classifier

Space using jayansh21/codesheriff-bug-classifier 1