inclusionAI
/

Ring-flash-linear-2.0-GPTQ-int4

+---
+license: mit
+language:
+- en
+base_model:
+- inclusionAI/Ring-flash-linear-2.0
+pipeline_tag: text-generation
+---
+# Quantized Ring-Linear-2.0
+## Introduction
+To enable deployment of [Ring-Linear-2.0](https://github.com/inclusionAI/Ring-V2/blob/main/hybrid_linear/README.md
+) on memory-constrained devices, we release quantized weights using the GPTQ INT4 format. Additionally, we evaluate the online FP8 quantization performance of `Ring-Linear-2.0` models, which closely approaches that of BF16 precision.
+## Model Downloads
+|       **Model**        | **Maximum Supported Length** |                                                                             **Download**                                                                             |
+|:----------------------:| :----------------: |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
+| Ring-flash-linear-2.0-GPTQ-int4  |        128k         |  [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ring-flash-linear-2.0-GPTQ-int4) <br>[🤖 ModelScope](https://www.modelscope.cn/models/inclusionAI/Ring-flash-linear-2.0-GPTQ-int4)  |
+| Ring-mini-linear-2.0-GPTQ-int4   |        512k         |  [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ring-mini-linear-2.0-GPTQ-int4) <br>[🤖 ModelScope](https://www.modelscope.cn/models/inclusionAI/Ring-mini-linear-2.0-GPTQ-int4)  |
+## Quickstart
+### 🚀 vLLM
+#### Environment Preparation
+Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below:
+```shell
+pip install torch==2.7.0 torchvision==0.22.0
+```
+Then you should install our vLLM wheel package:
+```shell
+pip install https://media.githubusercontent.com/media/inclusionAI/Ring-V2/refs/heads/main/hybrid_linear/whls/vllm-0.8.5%2Bcuda12_8_gcc10_2_1-cp310-cp310-linux_x86_64.whl --no-deps --force-reinstall
+```
+#### Offline Inference
+```python
+from transformers import AutoTokenizer
+from vllm import LLM, SamplingParams
+tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ring-mini-linear-2.0-GPTQ-int4")
+sampling_params = SamplingParams(temperature=0.6, top_p=1.0, max_tokens=16384)
+llm = LLM(model="inclusionAI/Ring-mini-linear-2.0-GPTQ-int4", dtype='auto', enable_prefix_caching=False, max_num_seqs=128)
+prompt = "Give me a short introduction to large language models."
+messages = [
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+outputs = llm.generate([text], sampling_params)
+```
+#### Online Inference
+```shell
+vllm serve inclusionAI/Ring-mini-linear-2.0-GPTQ-int4 \
+              --tensor-parallel-size 2 \
+              --pipeline-parallel-size 1 \
+              --gpu-memory-utilization 0.90 \
+              --max-num-seqs 512 \
+              --no-enable-prefix-caching
+```
+## Evaluation
+We evaluate the INT4 and FP8 quantized models using several datasets. The FP8 quantization is applied via the quantization="fp8" argument in SGLang or vLLM.
+### Ring-mini-linear-2.0
+|  **Dataset** | **BF16** | **FP8** | **GPTQ-Int4** |
+| :----------------: |:--------:|:-------:|:-------------:|
+|       AIME25       |  73.65   |  72.40  |     66.56     |
+|       AIME24       |  79.95   |  79.53  |     74.95     |
+|       LiveCodeBench|  59.53   |  58.42  |     56.29     |
+|       GPQA         |  65.69   |  66.79  |     62.53     |
+### Ring-flash-linear-2.0
+|  **Dataset** | **BF16** | **FP8** |  **GPTQ-Int4** |
+| :----------------: |:--------:|:-------:|   :-----------------------:|
+|       AIME25       |  85.10  |  84.22  | 82.88 |
+|       LiveCodeBench|  69.82  |  69.44  | 66.14 |
+|       GPQA         |  72.85  |  72.95  | 71.72 |
+## License
+This code repository is licensed under [the MIT License](https://github.com/inclusionAI/Ring-V2/blob/master/LICENSE).
+## Citation
+If you find our work helpful, feel free to give us a cite.