~/blog/llm-101-how-to-choose-a-model

LLM 101 · part 3

[LLM 101] So Many Models — Which One Should You Download?

cat --toc

TL;DR

Choosing a model is like buying a car: check your parking space (memory) first, then decide sedan or truck (model size), then pick a brand (Gemma, Llama, Qwen). One formula: parameters (B) × 0.6 ≈ required GB of memory. 8 GB RAM → 7B, 16 GB → 14B, 32 GB → 30B.

Plain-Language Version: How to pick the right AI model from hundreds of options

If you open Ollama's model library, you'll see hundreds of AI models with names that get longer by the day: Gemma 4 E4B, Llama 3.3 70B, Qwen3-Coder 235B-A22B. Each one claims to be great, but can your computer even run it? And if it can, which one is right for you?

It's exactly like buying a car. You wouldn't go test-drive a Ferrari before checking if it fits in your garage. Same with AI models: first figure out what your computer can handle, then pick the best one within that range.

Last article, we covered the four different "body types" of AI models. This one teaches you how to narrow down hundreds of models to the right one in three steps.


Preface

You walk into a dealership with 500 cars. Each one has a spec sheet: horsepower, torque, fuel economy, 0-60 time. You're not a race car driver. You just need something for your commute.

How do you choose?

Most people don't start with horsepower. You start with: how big is my parking space? What's my budget? Am I mostly on highways or city streets?

Choosing an AI model works the same way. Don't let the spec sheets intimidate you. Three steps is all you need.


Step 1: How big is your parking space? (Memory)

This is the most important step, because the best model in the world is useless if it can't run on your machine.

AI models need to be loaded into memory to run. Bigger model = more memory needed. "Memory" here means:

  • If you have a dedicated GPU (NVIDIA RTX series) → look at the GPU's own memory (called VRAM — separate from your computer's RAM, typically 8-24 GB)
  • If you're on a Mac → look at your RAM (Mac memory is shared between CPU and GPU, so all of it can be used for models)
  • If you're on a regular laptop → look at RAM, minus what the OS uses (usually 60-70% available)

Quick estimation formula

The B in model names stands for Billion parameters. To estimate memory usage:

Parameters (B) × 0.6 ≈ Required GB of memory

This assumes you download the compressed version (we'll explain "quantization" later — for now just remember this number). Some examples:

Model sizeMemory needed (approx.)What computer
1-3B1-2 GBPhone, low-end laptop
7B4-5 GB8 GB laptop
14B8-9 GB16 GB laptop or Mac
30B18-20 GB32 GB Mac or 24 GB GPU
70B40-45 GB64 GB+ or pro GPU

How much memory does your computer have? That's your parking space. Once you know this, move to the next step.


Step 2: Sedan or truck? (Size vs. quality)

After confirming your memory limit, you'll find several models that "fit in the garage." Now comes the tradeoff:

Bigger model = smarter, but slower

A 30B model usually answers much better than a 7B — better comprehension, cleaner logic, less hallucination. But it's also much slower.

According to the Chatbot Arena human blind-test rankings, doubling a model's size typically improves quality by 10-20%. But speed may drop by more than half.

How fast is "fast enough"?

When AI answers a question, text appears word by word — it doesn't show up all at once. This speed is measured in "words per second" (technically called tok/s — tokens per second, where one token is roughly one word or Chinese character).

A useful reference: your reading speed is about 4-5 words per second. If AI outputs slower than you can read, you'll feel it "lagging."

According to BentoML's inference performance research and community benchmarks, speed tiers feel like this:

Words/secHow it feelsCar analogy
< 5Slower than you can read, want to quitGridlock traffic
5-12Noticeably waiting, tempted to switch tabsCity traffic, barely tolerable
12-30Some waiting but acceptableRegular roads, fine
30-50Comfortable, feels like chattingHighway, recommended target
50-80Fast, streaming text feels naturalExpress lane
80+Can't tell the difference anymoreAirplane, but you just need to commute

How fast is ChatGPT? For reference: ChatGPT runs at about 50 words/sec, Claude at about 46, and Gemini can hit 220. On your own hardware, aiming for 30+ words/sec gives a similar experience to ChatGPT.

How much does speed actually vary?

I benchmarked different model sizes on my own machines to give you a feel:

ComputerModel sizeSpeed (words/sec)Feel
High-end GPU (RTX 5090)3B (small)310Instant
High-end GPU (RTX 5090)8B (medium-small)202Instant
Desktop workstation (128GB)8B (compressed)50Comfortable chat
Desktop workstation (128GB)31B (large)7Waiting, barely usable
MacBook Pro (32GB)31B (tuned)13Acceptable
MacBook Pro (32GB)31B (default settings)1.5Unusable

Notice the last two rows — same computer, same model, but different software settings caused an 8x speed difference. Settings matter, but that's an advanced topic (details here if you're curious).

Key takeaway: when choosing a model, don't just look at size — look at how fast your computer can run it. A medium model at 40 words/sec feels much better than a large model at 5 words/sec.

So how to choose?

  • Daily chat, translation, writing → Pick the biggest model your memory allows. Quality > speed when you're asking one question at a time.
  • Coding, debugging → Go bigger. Code quality differences between small and large models are stark — 7B often writes code that runs but has logic bugs.
  • Real-time chat, quick Q&A → Pick medium size. The sweet spot for speed/quality balance is usually 14-30B.
  • Very limited resources (8 GB or less) → Pick 7B, no hesitation. Today's 7B models are smarter than 70B models from two years ago.

Step 3: Pick a brand (Gemma, Llama, Qwen)

Once you've decided on a size, you'll find several brands at that size. Like deciding between Toyota, Honda, and Tesla after settling on a mid-size sedan.

The major AI model brands in 2026:

BrandMade byBest atChineseSizes
GemmaGoogleMultilingual, reasoning, balancedGood1B, 4B, 12B, 27B
LlamaMetaEnglish ecosystem, largest communityFair1B, 3B, 8B, 70B, 405B
QwenAlibabaChinese, codingBest0.6B, 1.8B, 7B, 14B, 32B, 72B
MistralMistral AIEfficiency, European languagesFair7B, 8x7B, 8x22B
PhiMicrosoftSmall model championFair3B, 14B

Recommendations:

  • Mostly Chinese → Qwen or Gemma
  • Mostly English → Llama or Gemma
  • Coding → Qwen-Coder or Gemma
  • Very limited memory (≤ 8 GB) → Phi-3 or Gemma 4B
  • Not sure → Start with Qwen 14B or Gemma 12B — best value for the Chinese-speaking world

Quantization: same car, different compression

You may have noticed the same model comes in many "versions": Q4_K_M, Q8_0, FP16. These are quantization levels — like the difference between MP3 128kbps, 320kbps, and FLAC for music files.

LevelSizeQualityPlain English
FP16LargestBestFLAC — lossless, but huge
Q8~Half of FP16Nearly losslessMP3 320kbps — can't tell the difference
Q4~Quarter of FP16Slight lossMP3 128kbps — good enough for daily use
Q2SmallestNoticeable loss64kbps — you can hear it

Pick Q4 for daily use. It's one-quarter the size of the original with quality loss that most users can't feel. The memory formula above already assumes Q4.

Want better quality? Go Q8. But memory usage doubles — make sure your "parking space" can fit it.


Worked example: I have a 16 GB Mac. What should I pick?

Walk through the steps:

  1. Memory: 16 GB, minus OS leaves about 10-12 GB usable
  2. Model size limit: 10 GB ÷ 0.6 ≈ 14-16B range
  3. Use case: Daily chat + occasional coding
  4. Language: Mostly Chinese

Recommendation: qwen3:14b (Q4 version, ~9 GB)

One command in Ollama:

ollama run qwen3:14b

Don't like it? Try Gemma:

ollama run gemma4:12b

Try both, ask the same question, compare answers. Keep whichever you prefer.


One sentence

Check your parking space (memory), pick the car size (parameters), then choose the brand. If unsure, start with Qwen 14B or Gemma 12B.

Next up: what is quantization? What's the actual difference between Q4, Q8, and FP16?

This is Part 3 of the "LLM 101" series. Previous: Four AI model architectures.

FAQ

How do I know how big an AI model is?
Look for the B number in the name — e.g. 7B means 7 billion parameters. More parameters = smarter but more memory. 7B needs about 4-5 GB, 70B needs about 40 GB. Quick formula: parameters (B) × 0.6 = required GB of memory (at 4-bit quantization).
What if my computer doesn't have enough memory?
Three options: 1) Pick a smaller model (7B or 14B), 2) Use a quantized version (Q4 is 4x smaller than original), 3) If you have a dedicated GPU, use its VRAM instead. 8 GB RAM → 7B max, 16 GB → 14B, 32 GB → 30B.
Which is better — Gemma, Llama, or Qwen?
Each has strengths. Google's Gemma is balanced and multilingual, Meta's Llama has the largest English ecosystem, Alibaba's Qwen is best for Chinese and strong at coding. At the same size, try Qwen (best Chinese) or Gemma (most balanced) first.
What do Q4, Q8, FP16 mean in model names?
These are quantization levels — compression ratios for model files. FP16 is original quality (largest, slowest), Q8 is light compression (nearly lossless), Q4 is heavy compression (4x smaller, slight quality drop). Think MP3 128kbps vs 320kbps vs FLAC. Q4 is fine for daily use.