LLM 101 · part 3
[LLM 101] So Many Models — Which One Should You Download?
❯ cat --toc
- Plain-Language Version: How to pick the right AI model from hundreds of options
- Preface
- Step 1: How big is your parking space? (Memory)
- Quick estimation formula
- Step 2: Sedan or truck? (Size vs. quality)
- How fast is "fast enough"?
- How much does speed actually vary?
- Step 3: Pick a brand (Gemma, Llama, Qwen)
- Quantization: same car, different compression
- Worked example: I have a 16 GB Mac. What should I pick?
- One sentence
TL;DR
Choosing a model is like buying a car: check your parking space (memory) first, then decide sedan or truck (model size), then pick a brand (Gemma, Llama, Qwen). One formula: parameters (B) × 0.6 ≈ required GB of memory. 8 GB RAM → 7B, 16 GB → 14B, 32 GB → 30B.
Plain-Language Version: How to pick the right AI model from hundreds of options
If you open Ollama's model library, you'll see hundreds of AI models with names that get longer by the day: Gemma 4 E4B, Llama 3.3 70B, Qwen3-Coder 235B-A22B. Each one claims to be great, but can your computer even run it? And if it can, which one is right for you?
It's exactly like buying a car. You wouldn't go test-drive a Ferrari before checking if it fits in your garage. Same with AI models: first figure out what your computer can handle, then pick the best one within that range.
Last article, we covered the four different "body types" of AI models. This one teaches you how to narrow down hundreds of models to the right one in three steps.
Preface
You walk into a dealership with 500 cars. Each one has a spec sheet: horsepower, torque, fuel economy, 0-60 time. You're not a race car driver. You just need something for your commute.
How do you choose?
Most people don't start with horsepower. You start with: how big is my parking space? What's my budget? Am I mostly on highways or city streets?
Choosing an AI model works the same way. Don't let the spec sheets intimidate you. Three steps is all you need.
Step 1: How big is your parking space? (Memory)
This is the most important step, because the best model in the world is useless if it can't run on your machine.
AI models need to be loaded into memory to run. Bigger model = more memory needed. "Memory" here means:
- If you have a dedicated GPU (NVIDIA RTX series) → look at the GPU's own memory (called VRAM — separate from your computer's RAM, typically 8-24 GB)
- If you're on a Mac → look at your RAM (Mac memory is shared between CPU and GPU, so all of it can be used for models)
- If you're on a regular laptop → look at RAM, minus what the OS uses (usually 60-70% available)
Quick estimation formula
The B in model names stands for Billion parameters. To estimate memory usage:
Parameters (B) × 0.6 ≈ Required GB of memory
This assumes you download the compressed version (we'll explain "quantization" later — for now just remember this number). Some examples:
| Model size | Memory needed (approx.) | What computer |
|---|---|---|
| 1-3B | 1-2 GB | Phone, low-end laptop |
| 7B | 4-5 GB | 8 GB laptop |
| 14B | 8-9 GB | 16 GB laptop or Mac |
| 30B | 18-20 GB | 32 GB Mac or 24 GB GPU |
| 70B | 40-45 GB | 64 GB+ or pro GPU |
How much memory does your computer have? That's your parking space. Once you know this, move to the next step.
Step 2: Sedan or truck? (Size vs. quality)
After confirming your memory limit, you'll find several models that "fit in the garage." Now comes the tradeoff:
Bigger model = smarter, but slower
A 30B model usually answers much better than a 7B — better comprehension, cleaner logic, less hallucination. But it's also much slower.
According to the Chatbot Arena human blind-test rankings, doubling a model's size typically improves quality by 10-20%. But speed may drop by more than half.
How fast is "fast enough"?
When AI answers a question, text appears word by word — it doesn't show up all at once. This speed is measured in "words per second" (technically called tok/s — tokens per second, where one token is roughly one word or Chinese character).
A useful reference: your reading speed is about 4-5 words per second. If AI outputs slower than you can read, you'll feel it "lagging."
According to BentoML's inference performance research and community benchmarks, speed tiers feel like this:
| Words/sec | How it feels | Car analogy |
|---|---|---|
| < 5 | Slower than you can read, want to quit | Gridlock traffic |
| 5-12 | Noticeably waiting, tempted to switch tabs | City traffic, barely tolerable |
| 12-30 | Some waiting but acceptable | Regular roads, fine |
| 30-50 | Comfortable, feels like chatting | Highway, recommended target |
| 50-80 | Fast, streaming text feels natural | Express lane |
| 80+ | Can't tell the difference anymore | Airplane, but you just need to commute |
How fast is ChatGPT? For reference: ChatGPT runs at about 50 words/sec, Claude at about 46, and Gemini can hit 220. On your own hardware, aiming for 30+ words/sec gives a similar experience to ChatGPT.
How much does speed actually vary?
I benchmarked different model sizes on my own machines to give you a feel:
| Computer | Model size | Speed (words/sec) | Feel |
|---|---|---|---|
| High-end GPU (RTX 5090) | 3B (small) | 310 | Instant |
| High-end GPU (RTX 5090) | 8B (medium-small) | 202 | Instant |
| Desktop workstation (128GB) | 8B (compressed) | 50 | Comfortable chat |
| Desktop workstation (128GB) | 31B (large) | 7 | Waiting, barely usable |
| MacBook Pro (32GB) | 31B (tuned) | 13 | Acceptable |
| MacBook Pro (32GB) | 31B (default settings) | 1.5 | Unusable |
Notice the last two rows — same computer, same model, but different software settings caused an 8x speed difference. Settings matter, but that's an advanced topic (details here if you're curious).
Key takeaway: when choosing a model, don't just look at size — look at how fast your computer can run it. A medium model at 40 words/sec feels much better than a large model at 5 words/sec.
So how to choose?
- Daily chat, translation, writing → Pick the biggest model your memory allows. Quality > speed when you're asking one question at a time.
- Coding, debugging → Go bigger. Code quality differences between small and large models are stark — 7B often writes code that runs but has logic bugs.
- Real-time chat, quick Q&A → Pick medium size. The sweet spot for speed/quality balance is usually 14-30B.
- Very limited resources (8 GB or less) → Pick 7B, no hesitation. Today's 7B models are smarter than 70B models from two years ago.
Step 3: Pick a brand (Gemma, Llama, Qwen)
Once you've decided on a size, you'll find several brands at that size. Like deciding between Toyota, Honda, and Tesla after settling on a mid-size sedan.
The major AI model brands in 2026:
| Brand | Made by | Best at | Chinese | Sizes |
|---|---|---|---|---|
| Gemma | Multilingual, reasoning, balanced | Good | 1B, 4B, 12B, 27B | |
| Llama | Meta | English ecosystem, largest community | Fair | 1B, 3B, 8B, 70B, 405B |
| Qwen | Alibaba | Chinese, coding | Best | 0.6B, 1.8B, 7B, 14B, 32B, 72B |
| Mistral | Mistral AI | Efficiency, European languages | Fair | 7B, 8x7B, 8x22B |
| Phi | Microsoft | Small model champion | Fair | 3B, 14B |
Recommendations:
- Mostly Chinese → Qwen or Gemma
- Mostly English → Llama or Gemma
- Coding → Qwen-Coder or Gemma
- Very limited memory (≤ 8 GB) → Phi-3 or Gemma 4B
- Not sure → Start with Qwen 14B or Gemma 12B — best value for the Chinese-speaking world
Quantization: same car, different compression
You may have noticed the same model comes in many "versions": Q4_K_M, Q8_0, FP16. These are quantization levels — like the difference between MP3 128kbps, 320kbps, and FLAC for music files.
| Level | Size | Quality | Plain English |
|---|---|---|---|
| FP16 | Largest | Best | FLAC — lossless, but huge |
| Q8 | ~Half of FP16 | Nearly lossless | MP3 320kbps — can't tell the difference |
| Q4 | ~Quarter of FP16 | Slight loss | MP3 128kbps — good enough for daily use |
| Q2 | Smallest | Noticeable loss | 64kbps — you can hear it |
Pick Q4 for daily use. It's one-quarter the size of the original with quality loss that most users can't feel. The memory formula above already assumes Q4.
Want better quality? Go Q8. But memory usage doubles — make sure your "parking space" can fit it.
Worked example: I have a 16 GB Mac. What should I pick?
Walk through the steps:
- Memory: 16 GB, minus OS leaves about 10-12 GB usable
- Model size limit: 10 GB ÷ 0.6 ≈ 14-16B range
- Use case: Daily chat + occasional coding
- Language: Mostly Chinese
Recommendation: qwen3:14b (Q4 version, ~9 GB)
One command in Ollama:
ollama run qwen3:14b
Don't like it? Try Gemma:
ollama run gemma4:12b
Try both, ask the same question, compare answers. Keep whichever you prefer.
One sentence
Check your parking space (memory), pick the car size (parameters), then choose the brand. If unsure, start with Qwen 14B or Gemma 12B.
Next up: what is quantization? What's the actual difference between Q4, Q8, and FP16?
This is Part 3 of the "LLM 101" series. Previous: Four AI model architectures.
FAQ
- How do I know how big an AI model is?
- Look for the B number in the name — e.g. 7B means 7 billion parameters. More parameters = smarter but more memory. 7B needs about 4-5 GB, 70B needs about 40 GB. Quick formula: parameters (B) × 0.6 = required GB of memory (at 4-bit quantization).
- What if my computer doesn't have enough memory?
- Three options: 1) Pick a smaller model (7B or 14B), 2) Use a quantized version (Q4 is 4x smaller than original), 3) If you have a dedicated GPU, use its VRAM instead. 8 GB RAM → 7B max, 16 GB → 14B, 32 GB → 30B.
- Which is better — Gemma, Llama, or Qwen?
- Each has strengths. Google's Gemma is balanced and multilingual, Meta's Llama has the largest English ecosystem, Alibaba's Qwen is best for Chinese and strong at coding. At the same size, try Qwen (best Chinese) or Gemma (most balanced) first.
- What do Q4, Q8, FP16 mean in model names?
- These are quantization levels — compression ratios for model files. FP16 is original quality (largest, slowest), Q8 is light compression (nearly lossless), Q4 is heavy compression (4x smaller, slight quality drop). Think MP3 128kbps vs 320kbps vs FLAC. Q4 is fine for daily use.