Junyang Lin was the technical lead of Alibaba’s Qwen project. He announced he was stepping down on March 3, 2026. He now lists himself as an independent researcher on his personal site.
In a talk titled ‘Qwen: Towards a Generalist Model / Agent,‘ he walks through the Qwen family. It ends on a single line: “Training models -> training agents.” He later expanded that line into an detailed post as an independent researcher. This article reads the talk and the detailed post together.
What Lin’s Talk Actually Covers
The talk is a tour of the Qwen model family, not a single release. It moves through QwQ-32B, Qwen2.5-Max, Qwen3, Qwen2.5-VL, and Qwen2.5-Omni. Each stop shows benchmark charts against contemporaries. The named baselines include DeepSeek-R1, Grok 3 Beta, Gemini 2.5 Pro, and OpenAI’s o-series.
The Qwen3 stop carries the most detail. Lin highlights hybrid thinking modes: a thinking mode for step-by-step reasoning, and a non-thinking mode for near-instant responses. He adds dynamic thinking budgets, so callers can cap how much the model reasons. Qwen3 expanded multilingual support from 29 to 119 languages and dialects.
The presentation lists many model types and sizes from 0.6B to 235B parameters. It also lists quantized formats including GGUF, GPTQ, AWQ, and MLX, all under Apache 2.0. Two demos follow: a Web Dev demo and a Deep Research demo. The closing “Future work” slide points at agents. It lists more pretraining, RL with environment feedback, longer context, and more modalities. The last key mention is the “training models -> training agents.”
Qwen3 Architecture, As Shown in the Talk
The talk includes the Qwen3 architecture tables, reproduced below.
| Model | Layers | Heads (Q/KV) | Tie Embedding / Experts (Total/Act.) | Context |
|---|---|---|---|---|
| Qwen3-0.6B | 28 | 16 / 8 | Tie: Yes | 32K |
| Qwen3-1.7B | 28 | 16 / 8 | Tie: Yes | 32K |
| Qwen3-4B | 36 | 32 / 8 | Tie: Yes | 32K |
| Qwen3-8B | 36 | 32 / 8 | Tie: No | 128K |
| Qwen3-14B | 40 | 40 / 8 | Tie: No | 128K |
| Qwen3-32B | 64 | 64 / 8 | Tie: No | 128K |
| Qwen3-30B-A3B | 48 | 32 / 4 | Experts: 128 / 8 | 128K |
| Qwen3-235B-A22B | 94 | 64 / 4 | Experts: 128 / 8 | 128K |
The small dense models tie input and output embeddings and use a 32K context. The larger dense and MoE models drop tying and extend context to 128K. The two MoE models activate 8 of 128 experts per token.
Hybrid Thinking, and Why Merging is Hard
Lin presents hybrid thinking as a clean feature. The post explains why it was hard to build. Lin writes that thinking mode and instruct mode pull in opposite directions.
A strong instruct model is rewarded for directness, brevity, and low latency. A strong thinking model is rewarded for spending more tokens on hard problems. Merge the two carelessly, and both degrade. The thinking behavior gets bloated, and the instruct behavior gets less crisp.
Qwen3 tried the merge with a four-stage post-training pipeline. That pipeline included a long-CoT cold start, reasoning RL, and a “thinking mode fusion” step. Later in 2025, the 2507 line shipped separate Instruct and Thinking variants instead. Lin frames this as a data problem more than a model problem.
Anthropic took the opposite route, and Lin calls it a useful corrective. Claude 3.7 Sonnet shipped as a hybrid model with a user-set thinking budget. Claude 4 let reasoning interleave with tool use, aimed at coding and long-running tasks. His point: a longer reasoning trace does not make a model smarter. Thinking should be shaped by the target workload, not by the benchmark.
Interactive Explainer
Reasoning Thinking → Agentic Thinking
Two ways a model can “think.” One deliberates, then answers. The other thinks in order to act, looping with an environment. Pick a task and step through both.
Reasoning model
Agentic system
Test-time scaling: more thinking budget, more accuracy
MathVision accuracy vs. max thinking length, as shown in Junyang Lin’s talk. Drag to set the thinking budget.

