How do you run this?

#1
by mtcl - opened

What framework do you use for it?

QuantTrio org

vllm, as indicated in README.md

I have 2 NVIDIA 6000 Pros. would it be possible to create a slightly smaller quant to run this on VLLM?

QuantTrio org

Unless 2bit/3bit fused moe kernels are implemented,
QuantTrio/GLM-4.7-AWQ is of the smallest size that can be run with vLLM / sglang.

To go below this threshold, one can look for
(1) low bit gguf quantization, served with llama-cpp or the similar kinds
(2) moe pruning from cerebras

Sign up or log in to comment