how to run the model on vllm

#2
by reneho - opened

Thanks for this quant, much appreciated. The comment below provides some guidance on how the run the Minimax-M2.1-NVFP4 quant, being another version. It works well. I tried to run your model with the same settings, and noticed the warning and error below:

https://huggingface.co/lukealonso/MiniMax-M2.1-NVFP4/discussions/1

vllm.model_executor.layers.quantization.compressed_tensors.compressed_tensors_moe:WARNING] w1_weight_global_scale must match w3_weight_global_scale. Accuracy may be affected.

WARNING - autotuner.py:485 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_cutlass_fused_moe_module..MoERunner object at 0x74f87299aea0> 14, due to failure while profiling: [TensorRT-LLM][ERROR] Assertion failed: Failed to initialize cutlass TMA WS grouped gemm. Error: Error Internal (/workspace/csrc/nv_internal/tensorrt_llm/cutlass_instantiations/120/gemm_grouped/120/cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cu:60)

Were you running it on a B200 or a pair of RTX Pro 6000 Blackwells? I spent hours trying to get a 2 x Pro 6000 system running with all the same environment variables/settings from that discussion and was unable to get it to run. These quants were done using diffrent methods, ModelOpt for that other one and llm-compressor/compressed-tensors for this one. I was going to try to get a solid recreation script together and pass it over to the vllm devs to see if they could tell what was going on.

aah, ok, I am running on 2x Pro 6000 with a custom built vllm 0.14.0.rc1 image. Happy to push that to docker hub if that would help.

Sign up or log in to comment