YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

HUST RAG — Student Regulations Q&A System

A Retrieval-Augmented Generation (RAG) system that helps students query academic regulations and policies at Hanoi University of Science and Technology (HUST). The system processes Markdown-based regulation documents, stores them in a vector database, and uses a hybrid retrieval pipeline with reranking to provide accurate, context-grounded answers through a conversational chat interface.

✨ Key Features

Hybrid Search — Combines vector similarity search (ChromaDB) with BM25 keyword matching for both semantic and lexical retrieval
Reranking — Uses Qwen3-Reranker-8B via SiliconFlow API to re-score and sort retrieved documents by relevance
Small-to-Big Retrieval — Summarizes large tables with an LLM, embeds the summary for search, and returns the full original table at query time
4 Retrieval Modes — vector_only, bm25_only, hybrid, hybrid_rerank — configurable per query
Incremental Data Build — Hash-based change detection ensures only modified files are re-processed when rebuilding the database
Streaming Chat UI — Gradio-based conversational interface with real-time response streaming
RAGAS Evaluation — Built-in evaluation pipeline using the RAGAS framework with metrics like faithfulness, relevancy, precision, recall, and ROUGE scores

🏗️ System Architecture

┌────────────────────────────────────────────────────────────────────┐
│                        User Query (Gradio UI)                      │
└──────────────────────────────┬─────────────────────────────────────┘
                               │
                               ▼
┌──────────────────────────────────────────────────────────────────────┐
│                        Retrieval Pipeline                            │
│                                                                      │
│  ┌──────────────┐   ┌──────────────┐   ┌────────────────────────┐   │
│  │ Vector Search │ + │ BM25 Search  │ → │ Ensemble (weighted)    │   │
│  │  (ChromaDB)   │   │ (rank-bm25)  │   │ vector:0.5 + bm25:0.5 │   │
│  └──────────────┘   └──────────────┘   └──────────┬─────────────┘   │
│                                                     │                │
│                                                     ▼                │
│                                          ┌──────────────────┐        │
│                                          │ Qwen3-Reranker   │        │
│                                          │ (SiliconFlow API) │        │
│                                          └────────┬─────────┘        │
│                                                   │                  │
│                                    Small-to-Big:  │                  │
│                                    summary hit →  │                  │
│                                    swap w/ parent │                  │
└───────────────────────────────────────────┬──────────────────────────┘
                                            │
                                            ▼
┌──────────────────────────────────────────────────────────────────────┐
│                      Context Builder + LLM                           │
│                                                                      │
│   Context (top-k docs + metadata) → Prompt → LLM (Groq API)         │
│                                              → Streaming Response    │
└──────────────────────────────────────────────────────────────────────┘

📁 Project Structure

DoAn/
├── core/                          # Core application modules
│   ├── rag/                       # RAG engine
│   │   ├── chunk.py               # Markdown chunking with table extraction & Small-to-Big
│   │   ├── embedding_model.py     # Qwen3-Embedding wrapper (SiliconFlow API)
│   │   ├── vector_store.py        # ChromaDB wrapper with parent node storage
│   │   ├── retrieval.py           # Hybrid retriever + SiliconFlow reranker
│   │   └── generator.py           # Context builder & prompt construction
│   ├── gradio/                    # Chat interfaces
│   │   ├── user_gradio.py         # Main Gradio app (production + debug modes)
│   │   └── gradio_rag.py          # Debug mode launcher (thin wrapper)
│   └── hash_file/                 # File hashing utilities
│       └── hash_file.py           # SHA-256 hash processor for change detection
│
├── scripts/                       # Workflow scripts
│   ├── run_app.py                 # Application entry point (data check + env check + launch)
│   ├── build_data.py              # Build/update ChromaDB from markdown files
│   ├── download_data.py           # Download data from HuggingFace
│   └── run_eval.py                # Run RAGAS evaluation
│
├── evaluation/                    # Evaluation pipeline
│   ├── eval_utils.py              # Shared utilities (RAG init, answer generation)
│   └── ragas_eval.py              # RAGAS evaluation with multiple metrics
│
├── test/                          # Unit tests
│   ├── conftest.py                # Shared fixtures and sample data
│   ├── test_chunk.py              # Chunking logic tests
│   ├── test_embedding.py          # Embedding model tests
│   ├── test_vector_store.py       # Vector store tests
│   ├── test_retrieval.py          # Retrieval pipeline tests
│   ├── test_generator.py          # Generator/context builder tests
│   └── ...
│
├── data/                          # Data directory (downloaded from HuggingFace)
│   ├── data_process/              # Processed markdown files
│   └── chroma/                    # ChromaDB persistence directory
│
├── requirements.txt               # Python dependencies
├── setup.sh                       # Linux/Mac setup script
├── setup.bat                      # Windows setup script
└── .env                           # API keys (not tracked in git)

🚀 Getting Started

Prerequisites

Python 3.10+
API Keys:
- SiliconFlow — for embedding (Qwen3-Embedding-4B) and reranking (Qwen3-Reranker-8B)
- Groq — for LLM generation (Qwen3-32B)

Quick Setup (Recommended)

Run the automated setup script which creates a virtual environment, installs dependencies, downloads data, and creates the .env file:

# Linux / macOS
bash setup.sh

# Windows
setup.bat

Then edit .env with your API keys:

SILICONFLOW_API_KEY=your_siliconflow_key
GROQ_API_KEY=your_groq_key

Manual Setup

# 1. Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate        # Linux/Mac
# venv\Scripts\activate         # Windows

# 2. Install dependencies
pip install -r requirements.txt

# 3. Download data from HuggingFace
python scripts/download_data.py

# 4. Create .env file with your API keys
echo "SILICONFLOW_API_KEY=your_key" > .env
echo "GROQ_API_KEY=your_key" >> .env

Running the Application

source venv/bin/activate        # Linux/Mac
python scripts/run_app.py

Access the chat interface at: http://127.0.0.1:7860

📖 Usage Guide

Chat Interface

The Gradio chat interface supports natural language questions about HUST student regulations. Example questions:

Question	Topic
Sinh viên vi phạm quy chế thi thì bị xử lý như thế nào?	Exam violation penalties
Điều kiện để đổi ngành là gì?	Major transfer requirements
Làm thế nào để đăng ký hoãn thi?	Exam postponement registration

Debug Mode

To launch the debug interface that shows retrieved documents and relevance scores:

python core/gradio/gradio_rag.py

Building/Updating the Database

When you add, modify, or delete markdown files in data/data_process/, rebuild the database:

# Incremental update (only changed files)
python scripts/build_data.py

# Force full rebuild
python scripts/build_data.py --force

# Skip orphan deletion
python scripts/build_data.py --no-delete

The build script will:

Detect changed files via SHA-256 hash comparison
Delete chunks from removed files
Re-chunk and re-embed only modified files
Automatically invalidate the BM25 cache

🔧 Core Components

Chunking (`core/rag/chunk.py`)

Processes Markdown documents into searchable chunks:

Feature	Description
YAML Frontmatter Extraction	Parses metadata (document type, year, cohort, program) into chunk metadata
Heading-based Splitting	Uses `MarkdownNodeParser` to split by headings, preserving document structure
Table Extraction & Splitting	Extracts Markdown tables, splits large tables into chunks of 15 rows
Small-to-Big Pattern	Summarizes tables with LLM → embeds summary → links to parent (full table)
Small Chunk Merging	Merges chunks smaller than 200 characters with adjacent chunks
Metadata Enrichment	Extracts course names and codes from content using regex patterns

Configuration:

CHUNK_SIZE = 1500          # Maximum chunk size in characters
CHUNK_OVERLAP = 150        # Overlap between consecutive chunks
MIN_CHUNK_SIZE = 200       # Minimum chunk size (smaller chunks get merged)
TABLE_ROWS_PER_CHUNK = 15  # Maximum rows per table chunk

Embedding (`core/rag/embedding_model.py`)

Model: Qwen3-Embedding-4B via SiliconFlow API
Dimensions: 2048
Batch processing with configurable batch size (default: 16)
Rate limit handling with exponential backoff retry

Vector Store (`core/rag/vector_store.py`)

Backend: ChromaDB with LangChain integration
Parent node storage: Separate JSON file for Small-to-Big parent nodes (not embedded)
Content-based document IDs: SHA-256 hash of (source_file, header_path, chunk_index, content)
Metadata flattening: Converts complex metadata types to ChromaDB-compatible formats
Batch operations: add_documents() and upsert_documents() with configurable batch size

Retrieval (`core/rag/retrieval.py`)

Mode	Description
`vector_only`	Pure vector similarity search via ChromaDB
`bm25_only`	Pure keyword matching via BM25 (with lazy-load and disk caching)
`hybrid`	Ensemble of vector + BM25 with configurable weights (default: 0.5/0.5)
`hybrid_rerank`	Hybrid search followed by Qwen3-Reranker-8B reranking (default)

Small-to-Big at retrieval time: When a table summary node is retrieved, it is automatically swapped with the full parent table before returning results to the user.

Configuration:

rerank_model = "Qwen/Qwen3-Reranker-8B"  # Reranker model
initial_k = 25                             # Documents fetched before reranking
top_k = 5                                  # Final documents returned
vector_weight = 0.5                        # Weight for vector search
bm25_weight = 0.5                          # Weight for BM25 search

Generator (`core/rag/generator.py`)

Builds rich context strings with metadata (source, document type, year, cohort, program, faculty)
Constructs prompts with a Vietnamese system prompt that enforces context-grounded answers
RAGContextBuilder combines retrieval and context preparation into a single step

📊 Evaluation

The project includes a RAGAS-based evaluation pipeline.

Running Evaluation

# Evaluate with default settings (10 samples, hybrid_rerank)
python scripts/run_eval.py

# Custom sample size and mode
python scripts/run_eval.py --samples 50 --mode hybrid_rerank

# Run all retrieval modes for comparison
python scripts/run_eval.py --samples 20 --mode all

Metrics

Metric	Description
Faithfulness	How well the answer is grounded in the retrieved context
Answer Relevancy	How relevant the answer is to the question
Context Precision	How precise the retrieved contexts are
Context Recall	How well the retrieved contexts cover the ground truth
ROUGE-1 / ROUGE-2 / ROUGE-L	N-gram overlap with ground truth answers

Results

Benchmark on HUST student regulation Q&A dataset (200 samples):

Metric	vector_only	bm25_only	hybrid	hybrid_rerank
Answer Relevancy	0.749	0.635	0.832	0.872
Context Precision	0.678	0.538	0.795	0.861
Context Recall	0.815	0.732	0.849	0.872
Faithfulness	0.912	0.938	0.942	0.937
ROUGE-1	0.557	0.533	0.576	0.598
ROUGE-2	0.408	0.385	0.421	0.439
ROUGE-L	0.526	0.508	0.545	0.567

Key takeaways:

hybrid_rerank achieves the best scores in 6 out of 7 metrics, confirming it as the optimal default retrieval mode.
Faithfulness is consistently high (>0.91 across all modes), meaning the LLM reliably grounds its answers in the provided context with minimal hallucination.
Reranking significantly boosts Context Precision (+60% over BM25-only, +8% over hybrid), demonstrating the value of Qwen3-Reranker in filtering irrelevant documents.
Hybrid search substantially outperforms single-mode retrieval, validating the ensemble approach of combining semantic (vector) and lexical (BM25) search.

Results are saved to evaluation/results/ as both JSON and CSV files with timestamps.

🧪 Testing

# Run all tests
pytest test/ -v

# Run specific test module
pytest test/test_chunk.py -v
pytest test/test_retrieval.py -v

# Run with coverage
pytest test/ --cov=core --cov-report=term-missing

🛠️ Technology Stack

Category	Technology
Embedding	Qwen3-Embedding-4B (SiliconFlow API)
Reranking	Qwen3-Reranker-8B (SiliconFlow API)
LLM	Qwen3-32B (Groq API)
Vector Database	ChromaDB
Keyword Search	BM25 (rank-bm25)
Framework	LangChain + LlamaIndex (chunking)
UI	Gradio
Evaluation	RAGAS
Language	Python 3.10+

📦 Data

The processed data is hosted on HuggingFace: hungnha/do_an_tot_nghiep

Manual download:

huggingface-cli download hungnha/do_an_tot_nghiep --repo-type dataset --local-dir ./data

The data directory contains:

data_process/ — Processed Markdown regulation documents
chroma/ — ChromaDB persistence files (vector index + parent nodes)
data.csv — Evaluation dataset (questions + ground truth answers)

📄 License

This project is developed as an undergraduate thesis at Hanoi University of Science and Technology (HUST).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support