~/blog/llm-101-how-to-choose-a-model

LLM 101 · part 3

[LLM 101] How to Choose an AI Model: Gemma vs Llama vs Qwen vs Mistral (2026)

2026-04-10updated 2026-04-1411 min read#llm#model-selection#ollama#beginner中文版
cat --toc

TL;DR

Choosing a model is like buying a car: check your parking space (memory) first, then decide sedan or truck (model size), then pick a brand (Gemma, Llama, Qwen). One formula: parameters (B) × 0.6 ≈ required GB of memory. 8 GB RAM → 7B, 16 GB → 14B, 32 GB → 30B.

Plain-Language Version: How to pick the right AI model from hundreds of options

If you open Ollama's model library, you'll see hundreds of AI models with names that get longer by the day: Gemma 4 E4B, Llama 3.3 70B, Qwen3-Coder 235B-A22B. Each one claims to be great, but can your computer even run it? And if it can, which one is right for you?

It's exactly like buying a car. You wouldn't go test-drive a Ferrari before checking if it fits in your garage. Same with AI models: first figure out what your computer can handle, then pick the best one within that range.

Last article, we covered the four different "body types" of AI models. This one teaches you how to narrow down hundreds of models to the right one in three steps.


Preface

You walk into a dealership with 500 cars. Each one has a spec sheet: horsepower, torque, fuel economy, 0-60 time. You're not a race car driver. You just need something for your commute.

How do you choose?

Most people don't start with horsepower. You start with: how big is my parking space? What's my budget? Am I mostly on highways or city streets?

Choosing an AI model works the same way. Don't let the spec sheets intimidate you. Three steps is all you need.


Step 1: How big is your parking space? (Memory)

This is the most important step, because the best model in the world is useless if it can't run on your machine.

AI models need to be loaded into memory to run. Bigger model = more memory needed. "Memory" here means:

  • If you have a dedicated GPU (NVIDIA RTX series) → look at the GPU's own memory (called VRAM — separate from your computer's RAM, typically 8-24 GB)
  • If you're on a Mac → look at your RAM (Mac memory is shared between CPU and GPU, so all of it can be used for models)
  • If you're on a regular laptop → look at RAM, minus what the OS uses (usually 60-70% available)

Quick estimation formula

The B in model names stands for Billion parameters. To estimate memory usage:

Parameters (B) × 0.6 ≈ Required GB of memory

This assumes you download the compressed version (we'll explain "quantization" later — for now just remember this number). Some examples:

Model sizeMemory needed (approx.)What computer
1-3B1-2 GBPhone, low-end laptop
7B4-5 GB8 GB laptop
14B8-9 GB16 GB laptop or Mac
30B18-20 GB32 GB Mac or 24 GB GPU
70B40-45 GB64 GB+ or pro GPU

How much memory does your computer have? That's your parking space. Once you know this, move to the next step.


Step 2: Sedan or truck? (Size vs. quality)

After confirming your memory limit, you'll find several models that "fit in the garage." Now comes the tradeoff:

Bigger model = smarter, but slower

A 30B model usually answers much better than a 7B — better comprehension, cleaner logic, less hallucination. But it's also much slower.

According to the Chatbot Arena human blind-test rankings, doubling a model's size typically improves quality by 10-20%. But speed may drop by more than half.

How fast is "fast enough"?

When AI answers a question, text appears word by word — it doesn't show up all at once. This speed is measured in "words per second" (technically called tok/s — tokens per second, where one token is roughly one word or Chinese character).

A useful reference: your reading speed is about 4-5 words per second. If AI outputs slower than you can read, you'll feel it "lagging."

According to BentoML's inference performance research and community benchmarks, speed tiers feel like this:

Words/secHow it feelsCar analogy
< 5Slower than you can read, want to quitGridlock traffic
5-12Noticeably waiting, tempted to switch tabsCity traffic, barely tolerable
12-30Some waiting but acceptableRegular roads, fine
30-50Comfortable, feels like chattingHighway, recommended target
50-80Fast, streaming text feels naturalExpress lane
80+Can't tell the difference anymoreAirplane, but you just need to commute

How fast is ChatGPT? For reference: ChatGPT runs at about 50 words/sec, Claude at about 46, and Gemini can hit 220. On your own hardware, aiming for 30+ words/sec gives a similar experience to ChatGPT.

How much does speed actually vary?

I benchmarked different model sizes on my own machines to give you a feel:

ComputerModel sizeSpeed (words/sec)Feel
High-end GPU (RTX 5090)3B (small)310Instant
High-end GPU (RTX 5090)8B (medium-small)202Instant
Desktop workstation (128GB)8B (compressed)50Comfortable chat
Desktop workstation (128GB)31B (large)7Waiting, barely usable
MacBook Pro (32GB)31B (tuned)13Acceptable
MacBook Pro (32GB)31B (default settings)1.5Unusable

Notice the last two rows — same computer, same model, but different software settings caused an 8x speed difference. Settings matter, but that's an advanced topic (details here if you're curious).

Want to estimate speed yourself? Two formulas

Thanks to reader marqd114 for suggesting this addition.

AI runs models in two phases, each with a different bottleneck:

  • Generating (outputting the answer, word by word) — bottlenecked by "how fast stuff moves," i.e. memory bandwidth
  • Reading your prompt (processing the input) — bottlenecked by "how fast it can think," i.e. compute power

What you're usually waiting for is generation, so start here:

Generation speed (decode):

words/sec ≈ memory bandwidth (GB/s) ÷ model size (GB)

HardwareBandwidthRunning 14B Q4 (~9 GB)Running 31B Q4 (~19 GB)
MacBook Pro M1 Max400 GB/s~44 words/sec~21 words/sec
RTX 50901,792 GB/s~199 words/sec~94 words/sec
DGX Spark (GB10)273 GB/s~30 words/sec~14 words/sec

These are theoretical ceilings — real-world speed is typically 60-80% of this. But it's already useful for comparing "roughly how fast will this model run on my machine."

Prompt processing speed (prefill):

words/sec ≈ compute (TFLOPS) × utilization ÷ parameters (B) × 500

This means: stronger GPU and smaller model = faster prompt reading. Utilization is typically 40-60%.

HardwareCompute (FP16)UtilizationReading 14BReading 70B
MacBook Pro M1 Max10.4 TFLOPS~50%~185 words/sec~37 words/sec
RTX 5090209 TFLOPS~50%~3,732 words/sec~746 words/sec
10× A1003,120 TFLOPS~46%~51,171 words/sec~10,234 words/sec

Prompt processing is usually much faster than generation, so the "waiting" you feel is mostly the generation phase.

Note: If the model is a MoE architecture (names containing A4B, A3B, etc.), it only activates a fraction of its parameters per token, so actual speed will be faster than the formula suggests. For example, Qwen3.5-35B MoE hits 47 words/sec on DGX Spark because only 3B of active parameters are used per token.

Key takeaway: when choosing a model, don't just look at size — look at how fast your computer can run it. A medium model at 40 words/sec feels much better than a large model at 5 words/sec.

So how to choose?

  • Daily chat, translation, writing → Pick the biggest model your memory allows. Quality > speed when you're asking one question at a time.
  • Coding, debugging → Go bigger. Code quality differences between small and large models are stark — 7B often writes code that runs but has logic bugs.
  • Real-time chat, quick Q&A → Pick medium size. The sweet spot for speed/quality balance is usually 14-30B.
  • Very limited resources (8 GB or less) → Pick 7B, no hesitation. Today's 7B models are smarter than 70B models from two years ago.

Step 3: Pick a brand (Gemma, Llama, Qwen)

Once you've decided on a size, you'll find several brands at that size. Like deciding between Toyota, Honda, and Tesla after settling on a mid-size sedan.

The major AI model brands in 2026:

BrandMade byBest atChineseMain sizes (on Ollama)
Gemma 3GoogleMultilingual, multimodalGood1B, 4B, 12B, 27B
Gemma 4GoogleNewest, edge + workstation tiersGoodE2B, E4B, 26B-A4B (MoE), 31B
Llama 3.xMetaEnglish, biggest communityFair1B, 3B (3.2); 8B, 70B, 405B (3.1/3.3)
Qwen3AlibabaChinese, coding, reasoningBest0.6B, 1.7B, 4B, 8B, 14B, 30B-A3B (MoE), 32B, 235B-A22B (MoE)
MistralMistral AIEfficiency, European languagesFair7B (Mixtral variants are separate entries)
Phi-4MicrosoftSmall-model championFair14B

Recommendations:

  • Mostly Chinese → Qwen3 or Gemma 3
  • Mostly English → Llama 3.x or Gemma 3
  • Coding → Qwen3 or Qwen3-Coder
  • Very limited memory (≤ 8 GB) → Gemma 3 4B or Llama 3.2 3B
  • Not sure → Start with Qwen3 14B or Gemma 3 12B — best value for the Chinese-speaking world

Quantization: same car, different compression

You may have noticed the same model comes in many "versions": Q4_K_M, Q8_0, FP16. These are quantization levels — like the difference between MP3 128kbps, 320kbps, and FLAC for music files.

LevelSizeQualityPlain English
FP16LargestBestFLAC — lossless, but huge
Q8~Half of FP16Nearly losslessMP3 320kbps — can't tell the difference
Q4~1/3 of FP16 (≈3.3x smaller)Slight lossMP3 128kbps — good enough for daily use
Q2SmallestNoticeable loss64kbps — you can hear it

Pick Q4 for daily use. Real Q4 files are roughly one-third the size of the FP16 original (the theoretical "4x smaller" doesn't quite hold because of embeddings and metadata), with quality loss that most users can't feel. The memory formula above already assumes Q4.

Want better quality? Go Q8. But memory usage doubles — make sure your "parking space" can fit it.


Worked example: I have a 16 GB Mac. What should I pick?

Walk through the steps:

  1. Memory: 16 GB, minus OS leaves about 10-12 GB usable
  2. Model size limit: 10 GB ÷ 0.6 ≈ 14-16B range
  3. Use case: Daily chat + occasional coding
  4. Language: Mostly Chinese

Recommendation: qwen3:14b (Q4 version, ~9 GB)

One command in Ollama:

ollama run qwen3:14b

Don't like it? Try Gemma 3:

ollama run gemma3:12b

Try both, ask the same question, compare answers. Keep whichever you prefer.


One sentence

Check your parking space (memory), pick the car size (parameters), then choose the brand. If unsure, start with Qwen3 14B or Gemma 3 12B.

Next up: what is quantization? What's the actual difference between Q4, Q8, and FP16?

This is Part 3 of the "LLM 101" series. Previous: Four AI model architectures.

FAQ

How do I know how big an AI model is?
Look for the B number in the name — e.g. 7B means 7 billion parameters. More parameters = smarter but more memory. 7B needs about 4-5 GB, 70B needs about 40 GB. Quick formula: parameters (B) × 0.6 ≈ GB needed for the model weights at 4-bit quantization (leave extra headroom for context).
What if my computer doesn't have enough memory?
Three options: 1) Pick a smaller model (7B or 14B), 2) Use a quantized version (Q4 is roughly one-third the size of the original — about 3.3x smaller in real files), 3) If you have a dedicated GPU, use its VRAM instead. Rough ceilings: 8 GB RAM → 7B, 16 GB → 14B, 32 GB → 30B. These are conservative — exact fit depends on free memory and context length.
Which is better — Gemma, Llama, or Qwen?
Each has strengths. Google's Gemma is balanced and multilingual, Meta's Llama has the largest English ecosystem, Alibaba's Qwen is best for Chinese and strong at coding. At the same size, try Qwen (best Chinese) or Gemma (most balanced) first.
What do Q4, Q8, FP16 mean in model names?
These are quantization levels — compression ratios for model files. FP16 is original quality (largest, slowest), Q8 is light compression (nearly lossless), Q4 is heavy compression (~3.3x smaller in real files, slight quality drop). Think MP3 128kbps vs 320kbps vs FLAC. Q4 is fine for daily use.