--- title: Code Comment Classification Api emoji: 📚 colorFrom: indigo colorTo: yellow sdk: docker pinned: false license: mit short_description: Multi-label classification of code-comment sentences --- ## Overview CodeCommentClassification is an end-to-end pipeline to classify comment sentences into language-specific categories and to aggregate results at file/PR level so reviewers can focus on rationale, usage notes, deprecations, examples, and other high-value signals. The project targets and aims to surpass the NLBSE’26 baselines, providing reproducible training, evaluation, and inference. The full documentation is available here: https://se4ai2526-uniba.github.io/TheClouds/ #### Core choices: - Task: multi-label text classification at sentence level - Scope: three languages with per-language models (Java, Python, Pharo) - Usage: batch predictions on submissions (pre-review), summaries per file/PR - Human-in-the-loop: reviewer confirmations/overrides feed threshold recalibration #### Current model type: CodeBERT, a bimodal transformer pretrained on code and natural language, excels in code understanding tasks by generating contextual embeddings for comments, enabling superior multi-label classification (e.g., Java Macro F1 0.7457, Micro F1 0.8364; Python Macro F1 0.6385). The current (best) models are automatically downloaded from MLflow for each language. Model cards are available here: https://huggingface.co/spaces/seai2526-uniba-TheClouds/Code-Comment-Classification-Api/tree/main/models/model_cards #### API: The API module runs as a secure FastAPI web service in a Python 3.11 Docker container, automatically syncing the latest champion models from MLflow at startup. It exposes endpoints for dynamic model listing and core prediction, accepting code comments with specified language and model type, then returning multi-label classifications using SetFit, Random Forest, or Transformer models via lazy-loaded predictors. Endpoints: - / (GET): Root health check returning a welcome message pointing to /docs. - /privacy (GET): Static privacy notice confirming no data persistence. - /status (GET): Simple running status indicator. - /models (GET): Scans MODELS_DIR to list available language/model_type pairs dynamically. - /predict (POST): Core inference endpoint; validates PredictRequest payload, loads predictor on-demand, runs classification on input text, and returns predictions list or detailed errors.## main.py