Now that you have explored the tools for Self-hosted AI chat interface, this tutorial picks up where that exploration left off.

How to Choose the Right Local AI Model

intermediate902 reads

Bookmark

Why Model Selection Matters

One of the first things people run into when setting up local AI is the model list. There are dozens of options with names like Llama-3.2-8B-Q4_K_M, Mistral-7B-Instruct-v0.3, Phi-3-mini, and Qwen2.5-14B. Understanding what these names mean helps you pick the right model the first time instead of downloading several that do not work well on your hardware.

This tutorial explains the main model families, what quantization means, how to match a model to your hardware, and which models are worth starting with for different tasks.

The Main Model Families

Llama (Meta)

Llama models are developed by Meta and released for public use. The Llama 3 series is one of the most widely used local model families. Llama 3.2 models come in 1B, 3B, and 8B sizes, making them accessible on most hardware. Larger Llama models (70B) are capable but require a high-end machine.

Best for: general-purpose chat, coding help, summarization, writing assistance.

Mistral

Mistral models come from Mistral AI, a French company known for producing highly efficient models. A 7B Mistral model often performs above its weight class compared to similarly sized models from other families. The Mistral Nemo (12B) and Mistral Small models are strong options for more demanding tasks.

Best for: efficient general use, coding, instruction following.

Gemma (Google)

Gemma is Google's family of open models. Gemma 2 (2B and 9B) is well-optimized and runs fast on lower-end hardware. It tends to be conversational and follows instructions clearly.

Best for: low-resource setups, fast responses, friendly conversational tone.

Phi (Microsoft)

Phi models are small but surprisingly capable. Microsoft designed them to pack strong reasoning performance into very small sizes (2B to 4B parameters). Phi-3 Mini can run on almost any modern computer and handles reasoning and coding tasks better than you would expect for its size.

Best for: devices with limited RAM, quick tasks, reasoning and math.

Qwen (Alibaba)

Qwen models are strong all-rounders, particularly good at coding and multilingual tasks. Qwen 2.5 in the 7B to 14B range performs well and runs efficiently. If you need to work in Chinese or other languages alongside English, Qwen is worth considering.

Best for: multilingual use, coding, general-purpose tasks.

DeepSeek Coder

DeepSeek Coder is specifically designed for code generation and understanding. If your main use case is writing, explaining, or debugging code, this family consistently outperforms general-purpose models at similar sizes.

Best for: coding, code review, technical documentation.

Understanding Quantization

Quantization is how model file sizes are compressed. A full-precision model (F16 or F32) stores each parameter at high precision, making the file large but accurate. Quantized models reduce the precision of each parameter to save memory and speed up inference.

Common quantization levels you will see:

Q4_K_M: 4-bit quantization, medium variant. This is the most popular choice for most users. It reduces the model size by roughly 60 to 70 percent compared to F16 with only a small loss in quality. A 7B model becomes around 4 to 5 GB.

Q8_0: 8-bit quantization. Higher quality than Q4, closer to the original model performance. Requires roughly twice the RAM of Q4. Good if you have 32 GB or more.

F16: Full 16-bit precision. Best quality, but requires the most RAM. Generally only practical if you have a lot of VRAM or unified memory.

Q2_K or Q3_K: Very aggressive compression. Smaller files, but quality drops noticeably. Use these only if you need to fit a model onto very limited hardware.

For most people starting out: use Q4_K_M. It gives you the best balance of quality and resource use.

Matching a Model to Your Hardware

Use this as a reference when choosing what to download:

Available RAM	Recommended Model Size	Good Starting Options
8 GB	3B to 7B (Q4)	Llama 3.2 3B, Phi-3 Mini, Gemma 2 2B
16 GB	7B to 13B (Q4 or Q8)	Llama 3.1 8B, Mistral 7B, Gemma 2 9B
32 GB	13B to 30B	Mistral Nemo 12B, Qwen 2.5 14B
64 GB or more	70B	Llama 3.1 70B, Qwen 2.5 72B

For Apple Silicon Macs: use the RAM figure from System Information as your guide. An M2 MacBook Air with 16 GB handles 7B Q4 and Q8 models smoothly, and can run 13B models at reasonable speed.

For NVIDIA GPU users: the model needs to fit in VRAM for GPU acceleration. If the model is too large, Ollama and LM Studio will split it between VRAM and RAM, which works but is slower.

Instruction Models vs Base Models

You will see model names ending in words like Instruct, Chat, or IT (instruction-tuned). These are versions of the model that have been fine-tuned to follow instructions and have conversations. Always use an instruct or chat variant for practical use. Base models are the raw foundation model and are primarily useful for researchers and developers doing further fine-tuning.

In Ollama, almost all available models are instruct-tuned by default. In LM Studio, filter by "Chat" when browsing.

Recommended First Models by Use Case

For general chat and writing: Llama 3.1 8B Instruct (Q4_K_M)

For coding: DeepSeek Coder 6.7B Instruct, or Qwen 2.5 Coder 7B

For very low-RAM hardware: Phi-3 Mini 3.8B (Q4) or Gemma 2 2B

For best quality on a 16 GB machine: Mistral Nemo 12B or Gemma 2 9B

For multilingual tasks: Qwen 2.5 7B or 14B

How to Download Models in Ollama

Once you have Ollama installed, downloading a model is a single command:

ollama pull llama3.1

Ollama defaults to the Q4_K_M quantization for most models. To pull a specific quantization:

ollama pull llama3.1:8b-instruct-q8_0

To see what you have downloaded:

ollama list

In LM Studio, use the search bar in the Discover tab to browse and download models from Hugging Face. Filter by your hardware profile using the green/yellow/red indicators that show compatibility.

A Note on Model Updates

The local AI ecosystem moves quickly. New models are released regularly and older recommendations become outdated. Before downloading, it is worth checking the Ollama model library at ollama.com/library or the LM Studio model search to see what is currently popular and well-reviewed. Community ratings and download counts are a practical guide to which models are performing well right now.

In the next step, you will explore the best AI tools for Deploy self-hosted AI stack. Browse the options, pick one that fits your workflow, and try it before continuing.

Discussion

Loading…

← Back to learning path