How do you run this?

by mtcl - opened about 15 hours ago

mtcl

about 15 hours ago

What framework do you use for it?

tclf90

QuantTrio org about 8 hours ago

vllm, as indicated in README.md

mtcl

about 4 hours ago

I have 2 NVIDIA 6000 Pros. would it be possible to create a slightly smaller quant to run this on VLLM?

tclf90

QuantTrio org about 1 hour ago

Unless 2bit/3bit fused moe kernels are implemented,
QuantTrio/GLM-4.7-AWQ is of the smallest size that can be run with vLLM / sglang.

To go below this threshold, one can look for
(1) low bit gguf quantization, served with llama-cpp or the similar kinds
(2) moe pruning from cerebras

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment