ExLlamaV2
Checked 56m agoLink OKFree plan available
Optimized CUDA kernel for extremely fast inference on quantized language models. Provides significantly faster inference than standard implementations. Uses specialized GPU optimization for Llama and compatible models. Reduces memory requirements through efficient quantization support. Used in production systems where speed is critical. Excellent for batch processing or high-throughput inference. Open source and actively maintained.
Comments
Sign in to add a comment. Your account must be at least 1 day old.