How do you run this?
#1
by
mtcl
- opened
What framework do you use for it?
vllm, as indicated in README.md
I have 2 NVIDIA 6000 Pros. would it be possible to create a slightly smaller quant to run this on VLLM?
Unless 2bit/3bit fused moe kernels are implemented,
QuantTrio/GLM-4.7-AWQ is of the smallest size that can be run with vLLM / sglang.
To go below this threshold, one can look for
(1) low bit gguf quantization, served with llama-cpp or the similar kinds
(2) moe pruning from cerebras