YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

HUST RAG β€” Student Regulations Q&A System

A Retrieval-Augmented Generation (RAG) system that helps students query academic regulations and policies at Hanoi University of Science and Technology (HUST). The system processes Markdown-based regulation documents, stores them in a vector database, and uses a hybrid retrieval pipeline with reranking to provide accurate, context-grounded answers through a conversational chat interface.


✨ Key Features

  • Hybrid Search β€” Combines vector similarity search (ChromaDB) with BM25 keyword matching for both semantic and lexical retrieval
  • Reranking β€” Uses Qwen3-Reranker-8B via SiliconFlow API to re-score and sort retrieved documents by relevance
  • Small-to-Big Retrieval β€” Summarizes large tables with an LLM, embeds the summary for search, and returns the full original table at query time
  • 4 Retrieval Modes β€” vector_only, bm25_only, hybrid, hybrid_rerank β€” configurable per query
  • Incremental Data Build β€” Hash-based change detection ensures only modified files are re-processed when rebuilding the database
  • Streaming Chat UI β€” Gradio-based conversational interface with real-time response streaming
  • RAGAS Evaluation β€” Built-in evaluation pipeline using the RAGAS framework with metrics like faithfulness, relevancy, precision, recall, and ROUGE scores

πŸ—οΈ System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        User Query (Gradio UI)                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        Retrieval Pipeline                            β”‚
β”‚                                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ Vector Search β”‚ + β”‚ BM25 Search  β”‚ β†’ β”‚ Ensemble (weighted)    β”‚   β”‚
β”‚  β”‚  (ChromaDB)   β”‚   β”‚ (rank-bm25)  β”‚   β”‚ vector:0.5 + bm25:0.5 β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                     β”‚                β”‚
β”‚                                                     β–Ό                β”‚
β”‚                                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚                                          β”‚ Qwen3-Reranker   β”‚        β”‚
β”‚                                          β”‚ (SiliconFlow API) β”‚        β”‚
β”‚                                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚                                                   β”‚                  β”‚
β”‚                                    Small-to-Big:  β”‚                  β”‚
β”‚                                    summary hit β†’  β”‚                  β”‚
β”‚                                    swap w/ parent β”‚                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                            β”‚
                                            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Context Builder + LLM                           β”‚
β”‚                                                                      β”‚
β”‚   Context (top-k docs + metadata) β†’ Prompt β†’ LLM (Groq API)         β”‚
β”‚                                              β†’ Streaming Response    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ Project Structure

DoAn/
β”œβ”€β”€ core/                          # Core application modules
β”‚   β”œβ”€β”€ rag/                       # RAG engine
β”‚   β”‚   β”œβ”€β”€ chunk.py               # Markdown chunking with table extraction & Small-to-Big
β”‚   β”‚   β”œβ”€β”€ embedding_model.py     # Qwen3-Embedding wrapper (SiliconFlow API)
β”‚   β”‚   β”œβ”€β”€ vector_store.py        # ChromaDB wrapper with parent node storage
β”‚   β”‚   β”œβ”€β”€ retrieval.py           # Hybrid retriever + SiliconFlow reranker
β”‚   β”‚   └── generator.py           # Context builder & prompt construction
β”‚   β”œβ”€β”€ gradio/                    # Chat interfaces
β”‚   β”‚   β”œβ”€β”€ user_gradio.py         # Main Gradio app (production + debug modes)
β”‚   β”‚   └── gradio_rag.py          # Debug mode launcher (thin wrapper)
β”‚   └── hash_file/                 # File hashing utilities
β”‚       └── hash_file.py           # SHA-256 hash processor for change detection
β”‚
β”œβ”€β”€ scripts/                       # Workflow scripts
β”‚   β”œβ”€β”€ run_app.py                 # Application entry point (data check + env check + launch)
β”‚   β”œβ”€β”€ build_data.py              # Build/update ChromaDB from markdown files
β”‚   β”œβ”€β”€ download_data.py           # Download data from HuggingFace
β”‚   └── run_eval.py                # Run RAGAS evaluation
β”‚
β”œβ”€β”€ evaluation/                    # Evaluation pipeline
β”‚   β”œβ”€β”€ eval_utils.py              # Shared utilities (RAG init, answer generation)
β”‚   └── ragas_eval.py              # RAGAS evaluation with multiple metrics
β”‚
β”œβ”€β”€ test/                          # Unit tests
β”‚   β”œβ”€β”€ conftest.py                # Shared fixtures and sample data
β”‚   β”œβ”€β”€ test_chunk.py              # Chunking logic tests
β”‚   β”œβ”€β”€ test_embedding.py          # Embedding model tests
β”‚   β”œβ”€β”€ test_vector_store.py       # Vector store tests
β”‚   β”œβ”€β”€ test_retrieval.py          # Retrieval pipeline tests
β”‚   β”œβ”€β”€ test_generator.py          # Generator/context builder tests
β”‚   └── ...
β”‚
β”œβ”€β”€ data/                          # Data directory (downloaded from HuggingFace)
β”‚   β”œβ”€β”€ data_process/              # Processed markdown files
β”‚   └── chroma/                    # ChromaDB persistence directory
β”‚
β”œβ”€β”€ requirements.txt               # Python dependencies
β”œβ”€β”€ setup.sh                       # Linux/Mac setup script
β”œβ”€β”€ setup.bat                      # Windows setup script
└── .env                           # API keys (not tracked in git)

πŸš€ Getting Started

Prerequisites

  • Python 3.10+
  • API Keys:
    • SiliconFlow β€” for embedding (Qwen3-Embedding-4B) and reranking (Qwen3-Reranker-8B)
    • Groq β€” for LLM generation (Qwen3-32B)

Quick Setup (Recommended)

Run the automated setup script which creates a virtual environment, installs dependencies, downloads data, and creates the .env file:

# Linux / macOS
bash setup.sh

# Windows
setup.bat

Then edit .env with your API keys:

SILICONFLOW_API_KEY=your_siliconflow_key
GROQ_API_KEY=your_groq_key

Manual Setup

# 1. Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate        # Linux/Mac
# venv\Scripts\activate         # Windows

# 2. Install dependencies
pip install -r requirements.txt

# 3. Download data from HuggingFace
python scripts/download_data.py

# 4. Create .env file with your API keys
echo "SILICONFLOW_API_KEY=your_key" > .env
echo "GROQ_API_KEY=your_key" >> .env

Running the Application

source venv/bin/activate        # Linux/Mac
python scripts/run_app.py

Access the chat interface at: http://127.0.0.1:7860


πŸ“– Usage Guide

Chat Interface

The Gradio chat interface supports natural language questions about HUST student regulations. Example questions:

Question Topic
Sinh viΓͺn vi phαΊ‘m quy chαΊΏ thi thΓ¬ bα»‹ xα»­ lΓ½ nhΖ° thαΊΏ nΓ o? Exam violation penalties
Điều kiện để Δ‘α»•i ngΓ nh lΓ  gΓ¬? Major transfer requirements
LΓ m thαΊΏ nΓ o để Δ‘Δƒng kΓ½ hoΓ£n thi? Exam postponement registration

Debug Mode

To launch the debug interface that shows retrieved documents and relevance scores:

python core/gradio/gradio_rag.py

Building/Updating the Database

When you add, modify, or delete markdown files in data/data_process/, rebuild the database:

# Incremental update (only changed files)
python scripts/build_data.py

# Force full rebuild
python scripts/build_data.py --force

# Skip orphan deletion
python scripts/build_data.py --no-delete

The build script will:

  1. Detect changed files via SHA-256 hash comparison
  2. Delete chunks from removed files
  3. Re-chunk and re-embed only modified files
  4. Automatically invalidate the BM25 cache

πŸ”§ Core Components

Chunking (core/rag/chunk.py)

Processes Markdown documents into searchable chunks:

Feature Description
YAML Frontmatter Extraction Parses metadata (document type, year, cohort, program) into chunk metadata
Heading-based Splitting Uses MarkdownNodeParser to split by headings, preserving document structure
Table Extraction & Splitting Extracts Markdown tables, splits large tables into chunks of 15 rows
Small-to-Big Pattern Summarizes tables with LLM β†’ embeds summary β†’ links to parent (full table)
Small Chunk Merging Merges chunks smaller than 200 characters with adjacent chunks
Metadata Enrichment Extracts course names and codes from content using regex patterns

Configuration:

CHUNK_SIZE = 1500          # Maximum chunk size in characters
CHUNK_OVERLAP = 150        # Overlap between consecutive chunks
MIN_CHUNK_SIZE = 200       # Minimum chunk size (smaller chunks get merged)
TABLE_ROWS_PER_CHUNK = 15  # Maximum rows per table chunk

Embedding (core/rag/embedding_model.py)

  • Model: Qwen3-Embedding-4B via SiliconFlow API
  • Dimensions: 2048
  • Batch processing with configurable batch size (default: 16)
  • Rate limit handling with exponential backoff retry

Vector Store (core/rag/vector_store.py)

  • Backend: ChromaDB with LangChain integration
  • Parent node storage: Separate JSON file for Small-to-Big parent nodes (not embedded)
  • Content-based document IDs: SHA-256 hash of (source_file, header_path, chunk_index, content)
  • Metadata flattening: Converts complex metadata types to ChromaDB-compatible formats
  • Batch operations: add_documents() and upsert_documents() with configurable batch size

Retrieval (core/rag/retrieval.py)

Mode Description
vector_only Pure vector similarity search via ChromaDB
bm25_only Pure keyword matching via BM25 (with lazy-load and disk caching)
hybrid Ensemble of vector + BM25 with configurable weights (default: 0.5/0.5)
hybrid_rerank Hybrid search followed by Qwen3-Reranker-8B reranking (default)

Small-to-Big at retrieval time: When a table summary node is retrieved, it is automatically swapped with the full parent table before returning results to the user.

Configuration:

rerank_model = "Qwen/Qwen3-Reranker-8B"  # Reranker model
initial_k = 25                             # Documents fetched before reranking
top_k = 5                                  # Final documents returned
vector_weight = 0.5                        # Weight for vector search
bm25_weight = 0.5                          # Weight for BM25 search

Generator (core/rag/generator.py)

  • Builds rich context strings with metadata (source, document type, year, cohort, program, faculty)
  • Constructs prompts with a Vietnamese system prompt that enforces context-grounded answers
  • RAGContextBuilder combines retrieval and context preparation into a single step

πŸ“Š Evaluation

The project includes a RAGAS-based evaluation pipeline.

Running Evaluation

# Evaluate with default settings (10 samples, hybrid_rerank)
python scripts/run_eval.py

# Custom sample size and mode
python scripts/run_eval.py --samples 50 --mode hybrid_rerank

# Run all retrieval modes for comparison
python scripts/run_eval.py --samples 20 --mode all

Metrics

Metric Description
Faithfulness How well the answer is grounded in the retrieved context
Answer Relevancy How relevant the answer is to the question
Context Precision How precise the retrieved contexts are
Context Recall How well the retrieved contexts cover the ground truth
ROUGE-1 / ROUGE-2 / ROUGE-L N-gram overlap with ground truth answers

Results

Benchmark on HUST student regulation Q&A dataset (200 samples):

Metric vector_only bm25_only hybrid hybrid_rerank
Answer Relevancy 0.749 0.635 0.832 0.872
Context Precision 0.678 0.538 0.795 0.861
Context Recall 0.815 0.732 0.849 0.872
Faithfulness 0.912 0.938 0.942 0.937
ROUGE-1 0.557 0.533 0.576 0.598
ROUGE-2 0.408 0.385 0.421 0.439
ROUGE-L 0.526 0.508 0.545 0.567

Key takeaways:

  • hybrid_rerank achieves the best scores in 6 out of 7 metrics, confirming it as the optimal default retrieval mode.
  • Faithfulness is consistently high (>0.91 across all modes), meaning the LLM reliably grounds its answers in the provided context with minimal hallucination.
  • Reranking significantly boosts Context Precision (+60% over BM25-only, +8% over hybrid), demonstrating the value of Qwen3-Reranker in filtering irrelevant documents.
  • Hybrid search substantially outperforms single-mode retrieval, validating the ensemble approach of combining semantic (vector) and lexical (BM25) search.

Results are saved to evaluation/results/ as both JSON and CSV files with timestamps.


πŸ§ͺ Testing

# Run all tests
pytest test/ -v

# Run specific test module
pytest test/test_chunk.py -v
pytest test/test_retrieval.py -v

# Run with coverage
pytest test/ --cov=core --cov-report=term-missing

πŸ› οΈ Technology Stack

Category Technology
Embedding Qwen3-Embedding-4B (SiliconFlow API)
Reranking Qwen3-Reranker-8B (SiliconFlow API)
LLM Qwen3-32B (Groq API)
Vector Database ChromaDB
Keyword Search BM25 (rank-bm25)
Framework LangChain + LlamaIndex (chunking)
UI Gradio
Evaluation RAGAS
Language Python 3.10+

πŸ“¦ Data

The processed data is hosted on HuggingFace: hungnha/do_an_tot_nghiep

Manual download:

huggingface-cli download hungnha/do_an_tot_nghiep --repo-type dataset --local-dir ./data

The data directory contains:

  • data_process/ β€” Processed Markdown regulation documents
  • chroma/ β€” ChromaDB persistence files (vector index + parent nodes)
  • data.csv β€” Evaluation dataset (questions + ground truth answers)

πŸ“„ License

This project is developed as an undergraduate thesis at Hanoi University of Science and Technology (HUST).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support