HUST RAG β Student Regulations Q&A System
A Retrieval-Augmented Generation (RAG) system that helps students query academic regulations and policies at Hanoi University of Science and Technology (HUST). The system processes Markdown-based regulation documents, stores them in a vector database, and uses a hybrid retrieval pipeline with reranking to provide accurate, context-grounded answers through a conversational chat interface.
β¨ Key Features
- Hybrid Search β Combines vector similarity search (ChromaDB) with BM25 keyword matching for both semantic and lexical retrieval
- Reranking β Uses Qwen3-Reranker-8B via SiliconFlow API to re-score and sort retrieved documents by relevance
- Small-to-Big Retrieval β Summarizes large tables with an LLM, embeds the summary for search, and returns the full original table at query time
- 4 Retrieval Modes β
vector_only,bm25_only,hybrid,hybrid_rerankβ configurable per query - Incremental Data Build β Hash-based change detection ensures only modified files are re-processed when rebuilding the database
- Streaming Chat UI β Gradio-based conversational interface with real-time response streaming
- RAGAS Evaluation β Built-in evaluation pipeline using the RAGAS framework with metrics like faithfulness, relevancy, precision, recall, and ROUGE scores
ποΈ System Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Query (Gradio UI) β
ββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Retrieval Pipeline β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββ β
β β Vector Search β + β BM25 Search β β β Ensemble (weighted) β β
β β (ChromaDB) β β (rank-bm25) β β vector:0.5 + bm25:0.5 β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββ¬ββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββ β
β β Qwen3-Reranker β β
β β (SiliconFlow API) β β
β ββββββββββ¬ββββββββββ β
β β β
β Small-to-Big: β β
β summary hit β β β
β swap w/ parent β β
βββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Context Builder + LLM β
β β
β Context (top-k docs + metadata) β Prompt β LLM (Groq API) β
β β Streaming Response β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Project Structure
DoAn/
βββ core/ # Core application modules
β βββ rag/ # RAG engine
β β βββ chunk.py # Markdown chunking with table extraction & Small-to-Big
β β βββ embedding_model.py # Qwen3-Embedding wrapper (SiliconFlow API)
β β βββ vector_store.py # ChromaDB wrapper with parent node storage
β β βββ retrieval.py # Hybrid retriever + SiliconFlow reranker
β β βββ generator.py # Context builder & prompt construction
β βββ gradio/ # Chat interfaces
β β βββ user_gradio.py # Main Gradio app (production + debug modes)
β β βββ gradio_rag.py # Debug mode launcher (thin wrapper)
β βββ hash_file/ # File hashing utilities
β βββ hash_file.py # SHA-256 hash processor for change detection
β
βββ scripts/ # Workflow scripts
β βββ run_app.py # Application entry point (data check + env check + launch)
β βββ build_data.py # Build/update ChromaDB from markdown files
β βββ download_data.py # Download data from HuggingFace
β βββ run_eval.py # Run RAGAS evaluation
β
βββ evaluation/ # Evaluation pipeline
β βββ eval_utils.py # Shared utilities (RAG init, answer generation)
β βββ ragas_eval.py # RAGAS evaluation with multiple metrics
β
βββ test/ # Unit tests
β βββ conftest.py # Shared fixtures and sample data
β βββ test_chunk.py # Chunking logic tests
β βββ test_embedding.py # Embedding model tests
β βββ test_vector_store.py # Vector store tests
β βββ test_retrieval.py # Retrieval pipeline tests
β βββ test_generator.py # Generator/context builder tests
β βββ ...
β
βββ data/ # Data directory (downloaded from HuggingFace)
β βββ data_process/ # Processed markdown files
β βββ chroma/ # ChromaDB persistence directory
β
βββ requirements.txt # Python dependencies
βββ setup.sh # Linux/Mac setup script
βββ setup.bat # Windows setup script
βββ .env # API keys (not tracked in git)
π Getting Started
Prerequisites
- Python 3.10+
- API Keys:
- SiliconFlow β for embedding (Qwen3-Embedding-4B) and reranking (Qwen3-Reranker-8B)
- Groq β for LLM generation (Qwen3-32B)
Quick Setup (Recommended)
Run the automated setup script which creates a virtual environment, installs dependencies, downloads data, and creates the .env file:
# Linux / macOS
bash setup.sh
# Windows
setup.bat
Then edit .env with your API keys:
SILICONFLOW_API_KEY=your_siliconflow_key
GROQ_API_KEY=your_groq_key
Manual Setup
# 1. Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate # Linux/Mac
# venv\Scripts\activate # Windows
# 2. Install dependencies
pip install -r requirements.txt
# 3. Download data from HuggingFace
python scripts/download_data.py
# 4. Create .env file with your API keys
echo "SILICONFLOW_API_KEY=your_key" > .env
echo "GROQ_API_KEY=your_key" >> .env
Running the Application
source venv/bin/activate # Linux/Mac
python scripts/run_app.py
Access the chat interface at: http://127.0.0.1:7860
π Usage Guide
Chat Interface
The Gradio chat interface supports natural language questions about HUST student regulations. Example questions:
| Question | Topic |
|---|---|
| Sinh viΓͺn vi phαΊ‘m quy chαΊΏ thi thΓ¬ bα» xα» lΓ½ nhΖ° thαΊΏ nΓ o? | Exam violation penalties |
| Δiα»u kiα»n Δα» Δα»i ngΓ nh lΓ gΓ¬? | Major transfer requirements |
| LΓ m thαΊΏ nΓ o Δα» ΔΔng kΓ½ hoΓ£n thi? | Exam postponement registration |
Debug Mode
To launch the debug interface that shows retrieved documents and relevance scores:
python core/gradio/gradio_rag.py
Building/Updating the Database
When you add, modify, or delete markdown files in data/data_process/, rebuild the database:
# Incremental update (only changed files)
python scripts/build_data.py
# Force full rebuild
python scripts/build_data.py --force
# Skip orphan deletion
python scripts/build_data.py --no-delete
The build script will:
- Detect changed files via SHA-256 hash comparison
- Delete chunks from removed files
- Re-chunk and re-embed only modified files
- Automatically invalidate the BM25 cache
π§ Core Components
Chunking (core/rag/chunk.py)
Processes Markdown documents into searchable chunks:
| Feature | Description |
|---|---|
| YAML Frontmatter Extraction | Parses metadata (document type, year, cohort, program) into chunk metadata |
| Heading-based Splitting | Uses MarkdownNodeParser to split by headings, preserving document structure |
| Table Extraction & Splitting | Extracts Markdown tables, splits large tables into chunks of 15 rows |
| Small-to-Big Pattern | Summarizes tables with LLM β embeds summary β links to parent (full table) |
| Small Chunk Merging | Merges chunks smaller than 200 characters with adjacent chunks |
| Metadata Enrichment | Extracts course names and codes from content using regex patterns |
Configuration:
CHUNK_SIZE = 1500 # Maximum chunk size in characters
CHUNK_OVERLAP = 150 # Overlap between consecutive chunks
MIN_CHUNK_SIZE = 200 # Minimum chunk size (smaller chunks get merged)
TABLE_ROWS_PER_CHUNK = 15 # Maximum rows per table chunk
Embedding (core/rag/embedding_model.py)
- Model: Qwen3-Embedding-4B via SiliconFlow API
- Dimensions: 2048
- Batch processing with configurable batch size (default: 16)
- Rate limit handling with exponential backoff retry
Vector Store (core/rag/vector_store.py)
- Backend: ChromaDB with LangChain integration
- Parent node storage: Separate JSON file for Small-to-Big parent nodes (not embedded)
- Content-based document IDs: SHA-256 hash of (source_file, header_path, chunk_index, content)
- Metadata flattening: Converts complex metadata types to ChromaDB-compatible formats
- Batch operations:
add_documents()andupsert_documents()with configurable batch size
Retrieval (core/rag/retrieval.py)
| Mode | Description |
|---|---|
vector_only |
Pure vector similarity search via ChromaDB |
bm25_only |
Pure keyword matching via BM25 (with lazy-load and disk caching) |
hybrid |
Ensemble of vector + BM25 with configurable weights (default: 0.5/0.5) |
hybrid_rerank |
Hybrid search followed by Qwen3-Reranker-8B reranking (default) |
Small-to-Big at retrieval time: When a table summary node is retrieved, it is automatically swapped with the full parent table before returning results to the user.
Configuration:
rerank_model = "Qwen/Qwen3-Reranker-8B" # Reranker model
initial_k = 25 # Documents fetched before reranking
top_k = 5 # Final documents returned
vector_weight = 0.5 # Weight for vector search
bm25_weight = 0.5 # Weight for BM25 search
Generator (core/rag/generator.py)
- Builds rich context strings with metadata (source, document type, year, cohort, program, faculty)
- Constructs prompts with a Vietnamese system prompt that enforces context-grounded answers
RAGContextBuildercombines retrieval and context preparation into a single step
π Evaluation
The project includes a RAGAS-based evaluation pipeline.
Running Evaluation
# Evaluate with default settings (10 samples, hybrid_rerank)
python scripts/run_eval.py
# Custom sample size and mode
python scripts/run_eval.py --samples 50 --mode hybrid_rerank
# Run all retrieval modes for comparison
python scripts/run_eval.py --samples 20 --mode all
Metrics
| Metric | Description |
|---|---|
| Faithfulness | How well the answer is grounded in the retrieved context |
| Answer Relevancy | How relevant the answer is to the question |
| Context Precision | How precise the retrieved contexts are |
| Context Recall | How well the retrieved contexts cover the ground truth |
| ROUGE-1 / ROUGE-2 / ROUGE-L | N-gram overlap with ground truth answers |
Results
Benchmark on HUST student regulation Q&A dataset (200 samples):
| Metric | vector_only | bm25_only | hybrid | hybrid_rerank |
|---|---|---|---|---|
| Answer Relevancy | 0.749 | 0.635 | 0.832 | 0.872 |
| Context Precision | 0.678 | 0.538 | 0.795 | 0.861 |
| Context Recall | 0.815 | 0.732 | 0.849 | 0.872 |
| Faithfulness | 0.912 | 0.938 | 0.942 | 0.937 |
| ROUGE-1 | 0.557 | 0.533 | 0.576 | 0.598 |
| ROUGE-2 | 0.408 | 0.385 | 0.421 | 0.439 |
| ROUGE-L | 0.526 | 0.508 | 0.545 | 0.567 |
Key takeaways:
hybrid_rerankachieves the best scores in 6 out of 7 metrics, confirming it as the optimal default retrieval mode.- Faithfulness is consistently high (>0.91 across all modes), meaning the LLM reliably grounds its answers in the provided context with minimal hallucination.
- Reranking significantly boosts Context Precision (+60% over BM25-only, +8% over hybrid), demonstrating the value of Qwen3-Reranker in filtering irrelevant documents.
- Hybrid search substantially outperforms single-mode retrieval, validating the ensemble approach of combining semantic (vector) and lexical (BM25) search.
Results are saved to evaluation/results/ as both JSON and CSV files with timestamps.
π§ͺ Testing
# Run all tests
pytest test/ -v
# Run specific test module
pytest test/test_chunk.py -v
pytest test/test_retrieval.py -v
# Run with coverage
pytest test/ --cov=core --cov-report=term-missing
π οΈ Technology Stack
| Category | Technology |
|---|---|
| Embedding | Qwen3-Embedding-4B (SiliconFlow API) |
| Reranking | Qwen3-Reranker-8B (SiliconFlow API) |
| LLM | Qwen3-32B (Groq API) |
| Vector Database | ChromaDB |
| Keyword Search | BM25 (rank-bm25) |
| Framework | LangChain + LlamaIndex (chunking) |
| UI | Gradio |
| Evaluation | RAGAS |
| Language | Python 3.10+ |
π¦ Data
The processed data is hosted on HuggingFace: hungnha/do_an_tot_nghiep
Manual download:
huggingface-cli download hungnha/do_an_tot_nghiep --repo-type dataset --local-dir ./data
The data directory contains:
data_process/β Processed Markdown regulation documentschroma/β ChromaDB persistence files (vector index + parent nodes)data.csvβ Evaluation dataset (questions + ground truth answers)
π License
This project is developed as an undergraduate thesis at Hanoi University of Science and Technology (HUST).