What is the difference between Dense and MoE models?

Dense models use every parameter for every response — like a 100-person company where everyone works on every task. MoE models only activate a subset of experts each time — like a 260-person company that sends 10 relevant specialists per task. MoE models are larger but run faster.

What is PLE architecture in AI models?

Per-Layer Embedding gives each decoder layer its own vocabulary lookup table. Instead of one shared dictionary on the ground floor, each floor of the building has its own. Google's Gemma 4 E2B and E4B use this architecture.

What are the advantages of SSM models?

SSM (State Space Model) maintains a running summary instead of re-reading the entire conversation. As conversations get longer, SSM stays fast while traditional models slow down proportionally.

How can I tell a model's architecture from its name?

Look for number patterns: '30B-A3B' means MoE (30B total, 3B active). E2B/E4B means PLE. Names with Mamba or DeltaNet are SSM. A single number like '8B' or '70B' with no other suffix is usually Dense.

[LLM 101 #2] Dense, MoE, PLE, SSM — Four AI Model Architectures Explained Simply

TL;DR

Four main AI model architectures: Dense (everyone works, stable but slow), MoE (expert rotation, big but fast), PLE (dictionary on every floor, efficient lookups), SSM (speed reader, no slowdown on long conversations). Knowing the architecture tells you more than the parameter count.

Plain-Language Version: Why Do AI Models Have Different "Body Types"?

You've probably noticed AI model names getting weirder. Gemma 4 E2B, Qwen3-Coder 235B-A22B, Mamba — what do all these letters and numbers mean?

They describe the model's "body type" — its architecture. Just like athletes have different builds (sprinters vs marathon runners), AI models have different designs, each suited for different tasks.

This article explains the four most common AI model architectures in plain language. After reading it, the next time you see a model's spec sheet, it won't just be a wall of confusing numbers.

Preface

When buying a phone, you don't just look at "how many GB of RAM" and decide. You also want to know: what's the processor architecture? Power-efficient or performance-oriented? Can it run games?

Picking an AI model works the same way. "120B parameters" sounds impressive, but if it's a Dense model, your laptop probably can't run it. Meanwhile, a "30B parameter" MoE model might be both light and fast.

Architecture determines personality.

Dense — Everyone Works

Dense is the most intuitive architecture, based on the standard Transformer design. Every single parameter in the model is used for every single response.

What's it like?

Like a 100-person company where every question — no matter how simple — gets all 100 people working on it. Even if you just ask "what's the weather today," all 100 people get involved.

Strengths

Consistent quality. All parameters participate in every computation. Nothing gets skipped
Simple architecture. The oldest and most mature design. Every tool supports it
Predictable. Bigger model = proportionally slower. No surprises

Weaknesses

Slow. 100 people working means 100 people's worth of computation. No shortcuts
Memory hungry. Every parameter needs to be loaded into memory. A 31B model takes 31B worth of space

Notable models

Llama 3 8B / 70B — Meta's classic
Gemma 4 31B — Google's Dense model. On my DGX Spark it only managed 7 tok/s — too large for the hardware

How to spot it

Names with just one number: Llama-3-8B, Gemma-4-31B. No -A suffix.

MoE — Expert Rotation

MoE (Mixture of Experts) is the most popular "cheat code" in current AI — the model is huge, but only a fraction activates each time.

What's it like?

Like a company with 260 employees, but only 10 relevant specialists get dispatched per task. Finance question? Send the finance team. Legal question? Send the lawyers. The other 250 people stay on standby.

The company has the knowledge of 260 people, but the workload of 10.

Strengths

Big and fast. Knowledge of a giant model, computation cost of a small one
Lower compute burden. Though the full model is large, each inference only activates a small fraction
Great for generalist models. Different experts can specialize in different domains

Weaknesses

Still a large download. All 260 people need to be loaded into memory on standby, even if only 10 are working. Storage and download requirements don't shrink
Routing overhead. Deciding "who should handle this" takes computation too
Uneven expert quality. Some domains may not have a well-trained expert, causing inconsistent quality

Notable models

Qwen3-Coder 235B-A22B — 235B total, only 22B active. Excellent at coding
DeepSeek-V3 685B-A37B — Massive model, 685B parameters but only uses 37B
Gemma 4 26B-A4B — Google's compact MoE. 26B total, 4B active, hit 52 tok/s on DGX Spark

How to spot it

Two numbers connected by -A: 235B-A22B = 235B total, 22B active. The number after A tells you how fast it actually runs.

PLE — Dictionary on Every Floor

PLE (Per-Layer Embedding) is a newer architecture, currently used mainly by Google's Gemma 4 family.

What's it like?

Imagine a 42-story office building. In traditional architectures, no matter which floor you're working on, you need to go down to the lobby to look up a word in the big dictionary. The lobby has one massive dictionary (262,144 words) shared by all floors.

PLE's approach: every floor gets its own dictionary. 42 floors, 42 dictionaries. Working on the 17th floor? Use the 17th floor dictionary. No elevator trip needed.

Strengths

Fast lookups. No need to go downstairs — each layer resolves locally
Low actual computation. Those 42 dictionaries are just lookup tables, not math. Gemma 4 E4B is nominally 8B parameters, but the real compute path is only about 4B

Weaknesses

Model file is bigger than it "should" be. 42 dictionaries take space. E4B's dictionaries alone are 5.4 GB — nearly a third of the total model size
Newer architecture. Not all tools support it perfectly yet
Only Gemma 4 uses it. Small ecosystem compared to Dense and MoE

Notable models

Gemma 4 E2B — 2B compute parameters, 7.2 GB. Hit 81 tok/s on M1 Max
Gemma 4 E4B — 4B compute parameters, 9.6 GB

How to spot it

E prefix in the name: E2B = 2B compute parameters, E4B = 4B. The E means embedding-heavy — don't compare directly with Dense models' parameter counts.

SSM / Hybrid — The Speed Reader

SSM (State Space Model) is the newest and most different architecture. Traditional models use Transformers (attention mechanism). SSM uses a fundamentally different approach.

What's it like?

A traditional Transformer is like someone who flips back through previous pages every time you ask about something. "What was chapter 3 about?" — they literally re-read chapter 3. The longer the conversation, the more pages to flip, the slower it gets.

SSM is like a speed reader who compresses everything into running notes as they go. You ask about chapter 3? They check their notes. No flipping back. No matter how long the conversation gets, checking notes takes the same amount of time.

Strengths

No slowdown on long conversations. This is SSM's biggest selling point. Traditional models get slower as conversations grow; SSM barely changes
Stable memory usage. Regardless of conversation length, the SSM "notebook" stays a fixed size

Weaknesses

No advantage on short conversations. When the conversation is brief, "flipping back" and "checking notes" are roughly the same speed
Might miss details. Compressing an entire book into notes inevitably loses some specifics. For tasks requiring exact recall, SSM may underperform Transformers
Very new technology. Tool support and ecosystem are still developing

The reality: Hybrid architectures

Pure SSM can miss details in some scenarios, so the more common approach today is "hybrid" — part SSM for speed reading, part Transformer for close reading. Like someone who usually checks their notes, but goes back to the original text for key passages.

Notable models

Qwen3.5-35B-A3B — SSM + MoE hybrid. On DGX Spark: 56 tok/s on short prompts, still 56 tok/s at 8K tokens — virtually no degradation
Qwen3-Coder-Next 79.7B — SSM + MoE hybrid, 512 experts with only 10 active
Mamba series — The original pure SSM models, pioneers in academia

How to spot it

Names containing Mamba, DeltaNet, or SSM. Hybrid architectures may not be explicitly labeled — check the model card's architecture description. Quick test: if the spec sheet says "similar speed on short and long context," there's probably SSM inside.

All Four in One Table

	Dense	MoE	PLE	SSM / Hybrid
Analogy	Everyone works	Expert rotation	Dictionary per floor	Speed reader
Speed	Slow (more params = slower)	Fast (only uses a subset)	Medium (fast lookups, big file)	Fast (especially on long chats)
Model size	Big = slow	Big but runs light	Bigger than it looks	Medium
Long conversations	Gets slower	Gets slower	Gets slower	Almost no slowdown
Maturity	★★★★★	★★★★☆	★★☆☆☆	★★★☆☆
Example	Llama 3	DeepSeek-V3	Gemma 4 E2B	Qwen3.5

In Practice: Reading Model Names

Next time you see a model on Ollama or HuggingFace, try this decision tree:

Does the name have -A + a number? (e.g., 235B-A22B) → MoE. Look at the number after A to gauge real speed.

Does the name have an E prefix + B? (e.g., E2B, E4B) → PLE. Currently only the Gemma 4 family.

Does the name include Mamba, DeltaNet, or SSM? → SSM / Hybrid. Great for long conversation scenarios.

None of the above? Just one number + B? (e.g., 8B, 70B) → Dense. The most traditional, most stable, but also most resource-hungry.

Takeaways

Where the time went

Deciding whether to explain Transformer attention mechanisms. Ultimately decided against it — for the purpose of "choosing a model," knowing that "traditional models get slower with longer conversations" is enough without explaining why.

A thinking framework you can take with you

When reading spec sheets, don't just look at "how many B parameters." Look at three things:

Architecture — Dense / MoE / PLE / SSM?
Active parameters (the number after A in MoE) — this is the actual compute burden
Your use case — Short chats? Long conversations? Multiple simultaneous users?

The general principle

There's no best architecture, only the best architecture for your scenario. Nobody says "sprinters are better than marathon runners" — it depends on what race you're running.

What's Next

Previous: Ollama vs vLLM — Two Ways to Run AI on Your Own Computer
See how slow Dense gets → Gemma 4 31B Dense at 7 tok/s on DGX Spark
See how fast MoE runs → Gemma 4 26B-A4B NVFP4 at 52 tok/s
See PLE E2B vs E4B → Benchmarked on Three Machines
LLM 101 next: How to Pick a Model — with so many options, which one should you actually download? (Coming soon)

[LLM 101 #2] Dense, MoE, PLE, SSM — Four AI Model Architectures Explained Simply

Plain-Language Version: Why Do AI Models Have Different "Body Types"?

Preface

Dense — Everyone Works

What's it like?

Strengths

Weaknesses

Notable models

How to spot it

MoE — Expert Rotation

What's it like?

Strengths

Weaknesses

Notable models

How to spot it

PLE — Dictionary on Every Floor

What's it like?

Strengths

Weaknesses

Notable models

How to spot it

SSM / Hybrid — The Speed Reader

What's it like?

Strengths

Weaknesses

The reality: Hybrid architectures

Notable models

How to spot it

All Four in One Table

In Practice: Reading Model Names

Takeaways

Where the time went

A thinking framework you can take with you

The general principle

What's Next

FAQ

Read next

Don't miss the next one