ExLlamaV2

Checked 5h agoLink OKFree plan available

Optimized CUDA kernel for extremely fast inference on quantized language models. Provides significantly faster inference than standard implementations. Uses specialized GPU optimization for Llama and compatible models. Reduces memory requirements through efficient quantization support. Used in production systems where speed is critical. Excellent for batch processing or high-throughput inference. Open source and actively maintained.

Learn more in this category

Tutorials

Build a Complete Private AI Workspaceintermediate
How to Choose the Right Local AI Modelintermediate
Building a Personal Automation System That Actually Runsadvanced
Building Workflows You Can Run Again and Againintermediate
Getting OpenClaw to Handle Files and Documents for Youbeginner
Teaching OpenClaw How You Workintermediate

Blog

Running Private AI Models at Home in 2026: Beginner Setup Guide
RAG Systems in Production: What Works and What Doesn't
AI-Generated Tests: How to Get Coverage Without Writing Every Case
Local vs Cloud AI: When to Choose Each

Browse tasks in this category · Category overview

Try ExLlamaV2

Learn more in this category

Tutorials

Blog

Comments