Baidu Releases Unlimited OCR, a 3B Model That Keeps the KV Cache Flat for Long-Document Parsing

Most end-to-end OCR models slow down as output grows. Each generated token adds to the KV cache. Memory rises and generation drags. Parsing dozens of pages becomes impractical. Baidu’s Unlimited OCR addresses this directly. It swaps the decoder’s attention for a design that keeps memory constant.

TL;DR

Unlimited OCR is a 3B-parameter Mixture-of-Experts model, with only 500M parameters active.
It replaces decoder attention with Reference Sliding Window Attention (R-SWA), keeping the KV cache constant.
The model parses dozens of pages in one forward pass under a 32K maximum length.
It scores 93.23 on OmniDocBench v1.5, beating the DeepSeek OCR baseline by 6.22 points.
It builds on DeepSeek OCR via continue-training, not a from-scratch run.

What is Unlimited OCR?

Unlimited OCR takes DeepSeek OCR as its baseline. It keeps the DeepEncoder and the Mixture-of-Experts decoder. The MoE design holds 3B total parameters but activates only 500M at inference.

The DeepEncoder is the compression engine. It cascades a SAM-ViT under window attention with a CLIP-ViT under global attention. At the bridge, it applies 16× token compression. A 1024×1024 PDF image becomes just 256 visual tokens. Fewer input tokens mean a smaller prefill.

DeepEncoder natively supports five resolution modes, and Unlimited OCR keeps two. ‘Base’ mode runs at 1024×1024 for multi-page work. ‘Gundam’ mode uses dynamic resolution for single pages.

How R-SWA Keeps the Cache Constant

The contribution is Reference Sliding Window Attention. Standard Multi-Head Attention stores a key and value for every token. As output length T grows, the cache grows with it. The size is C_MHA(T) = L_m + T. Memory and latency climb without bound.

R-SWA breaks that link. Each generated token attends to all reference tokens, meaning the visual tokens and the prompt. It also attends to the preceding n output tokens, where n defaults to 128. Everything older is evicted. The cache becomes a fixed queue of size m + n.

The size is C_R-SWA(T) = L_m + min(n, T) ≤ L_m + n. It is bounded by a constant. As T grows far beyond n, the cache ratio trends toward zero. So memory stays flat and per-step latency stays flat.

The research team compare this to soft forgetting. A person copying a book glances at the source and the last few words. They do not re-read everything transcribed so far. Visual tokens never undergo state updates. That avoids the progressive blurring seen in linear attention. The interactive simulator below lets you vary T and watch both caches respond.

<br /><head><br /><meta charset="UTF-8"><br /><meta name="viewport" content="width=device-width, initial-scale=1.0"><br /><title>Unlimited OCR — R-SWA vs MHA KV-Cache Simulator</title></p><style>:root{ --bg:#000000; --panel:#0d0d0d; --panel2:#141414; --line:#262626; --ink:#ffffff; --text:#ededed; --muted:#8a8a8a; --mha:#6e6e6e; --mono:'SFMono-Regular',Consolas,'Liberation Mono',Menlo,monospace; --sans:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,Helvetica,Arial,sans-serif; } *{box-sizing:border-box} body{margin:0;background:var(--bg);color:var(--text);font-family:var(--sans); -webkit-font-smoothing:antialiased;line-height:1.5} .wrap{max-width:880px;margin:0 auto;padding:22px 18px 14px} .eyebrow{font-family:var(--mono);font-size:11px;letter-spacing:.18em;text-transform:uppercase; color:var(--ink);margin:0 0 6px} h1{font-size:23px;line-height:1.2;margin:0 0 6px;font-weight:700;color:var(--ink)} .sub{color:var(--muted);font-size:14px;margin:0 0 18px;max-width:640px} .controls{background:var(--panel);border:1px solid var(--line);border-radius:12px;padding:16px 16px 6px} .ctl{margin-bottom:14px} .ctl-head{display:flex;justify-content:space-between;align-items:baseline;margin-bottom:7px} .ctl-head label{font-size:13px;font-weight:600} .ctl-head .val{font-family:var(--mono);font-size:13px;color:var(--ink)} .ctl-head .hint{font-size:11px;color:var(--muted);font-weight:400} input[type=range]{-webkit-appearance:none;width:100%;height:5px;border-radius:4px; background:linear-gradient(90deg,#3a3a3a,#ffffff);outline:none;margin:0} input[type=range]::-webkit-slider-thumb{-webkit-appearance:none;width:17px;height:17px;border-radius:50%; background:#fff;border:3px solid #000;cursor:pointer;box-shadow:0 0 0 2px #fff} input[type=range]::-moz-range-thumb{width:17px;height:17px;border-radius:50%;background:#fff; border:3px solid #000;cursor:pointer;box-shadow:0 0 0 2px #fff} .btnrow{display:flex;gap:10px;align-items:center;margin:4px 0 12px} button{font-family:var(--sans);font-size:13px;font-weight:600;padding:9px 16px;border-radius:8px; border:1px solid var(--ink);background:transparent;color:var(--ink);cursor:pointer;transition:.15s} button:hover{background:rgba(255,255,255,.1)} button.primary{background:var(--ink);color:#000;border-color:var(--ink)} button.primary:hover{background:#d8d8d8} .cards{display:grid;grid-template-columns:repeat(4,1fr);gap:10px;margin:16px 0} .card{background:var(--panel2);border:1px solid var(--line);border-radius:10px;padding:12px 12px 10px} .card .k{font-size:11px;color:var(--muted);text-transform:uppercase;letter-spacing:.05em;margin-bottom:5px} .card .v{font-family:var(--mono);font-size:19px;font-weight:600} .card.mha .v{color:var(--mha)} .card.swa .v{color:var(--ink)} .card .u{font-size:10px;color:var(--muted);font-family:var(--mono)} .bars{background:var(--panel);border:1px solid var(--line);border-radius:12px;padding:16px;margin-bottom:16px} .bar-label{display:flex;justify-content:space-between;font-size:12px;margin-bottom:6px} .bar-label b{font-family:var(--mono)} .bar-track{height:18px;background:#070707;border:1px solid var(--line);border-radius:5px;overflow:hidden;margin-bottom:14px} .bar-fill{height:100%;border-radius:4px;transition:width .25s ease;display:flex;align-items:center; justify-content:flex-end;padding-right:7px;font-family:var(--mono);font-size:10px;font-weight:700} .bar-fill.mha{background:repeating-linear-gradient(45deg,#5a5a5a 0,#5a5a5a 6px,#3f3f3f 6px,#3f3f3f 12px);color:#fff} .bar-fill.swa{background:#ffffff;color:#000} .stream-wrap{background:var(--panel);border:1px solid var(--line);border-radius:12px;padding:16px;margin-bottom:14px} .stream-title{font-size:12px;color:var(--muted);margin-bottom:10px} .stream-title b{color:var(--text)} .stream{display:flex;flex-wrap:nowrap;gap:3px;overflow-x:auto;padding-bottom:6px} .tok{flex:0 0 auto;width:13px;height:18px;border-radius:3px;background:#2e2e2e} .tok.ref{background:#b0b0b0} .tok.win{background:#ffffff} .tok.evicted{background:#1c1c1c} .legend{display:flex;gap:16px;flex-wrap:wrap;margin-top:11px;font-size:11px;color:var(--muted)} .legend span{display:inline-flex;align-items:center;gap:6px} .dot{width:11px;height:11px;border-radius:3px;display:inline-block;border:1px solid #333} .formula{font-family:var(--mono);font-size:12px;color:var(--muted);background:#070707;border:1px solid var(--line); border-radius:8px;padding:10px 12px;margin-bottom:6px;overflow-x:auto;white-space:nowrap} .formula b{color:var(--ink)} .note{font-size:11px;color:var(--muted);margin:10px 0 0;line-height:1.5} .foot{border-top:1px solid var(--line);margin-top:16px;padding-top:12px;display:flex; justify-content:space-between;align-items:center;flex-wrap:wrap;gap:8px;font-size:11px;color:var(--muted)} .foot b{color:var(--ink)} @media(max-width:640px){ .cards{grid-template-columns:repeat(2,1fr)} h1{font-size:20px} .formula{font-size:11px} }</style><p></head><br /><body></p><div class="wrap"><p class="eyebrow">Interactive Demo · Reference Sliding Window Attention</p><h1>Why Unlimited OCR Keeps Memory Flat While Standard Attention Grows</h1><p class="sub">Drag the sliders to decode a longer document. Standard Multi-Head Attention (MHA) grows its KV cache with every token. R-SWA caps it at a fixed window, so memory and speed stay constant. The formulas below come from the Unlimited OCR technical report.</p><div class="controls"><div class="ctl"><div class="ctl-head"> <label>Pages prefilled <span class="hint">→ reference tokens L<sub>m</sub></span></label><br /> <span class="val" id="pagesVal">8 pages</span></div><p> <input type="range" id="pages" min="1" max="100" value="8" step="1"></div><div class="ctl"><div class="ctl-head"> <label>Output tokens generated (T)</label><br /> <span class="val" id="tVal">8,000</span></div><p> <input type="range" id="tokens" min="256" max="120000" value="8000" step="256"></div><div class="ctl"><div class="ctl-head"> <label>Sliding window width (n)</label><br /> <span class="val" id="nVal">128</span></div><p> <input type="range" id="window" min="64" max="512" value="128" step="64"></div></p></div><div class="btnrow"> <button class="primary" id="play">▶ Animate decoding</button><br /> <button id="reset">Reset</button></div><div class="cards"><div class="card mha"><div class="k">MHA KV cache</div><div class="v" id="mhaVal">10,048</div><div class="u">key/value entries</div></p></div><div class="card swa"><div class="k">R-SWA KV cache</div><div class="v" id="swaVal">2,176</div><div class="u">key/value entries</div></p></div><div class="card"><div class="k">Cache ratio ρ</div><div class="v" id="ratioVal">0.217</div><div class="u">R-SWA ÷ MHA</div></p></div><div class="card"><div class="k">Memory saved</div><div class="v" id="saveVal">78%</div><div class="u">vs standard MHA</div></p></div></p></div><div class="bars"><div class="bar-label"><span>Standard MHA — grows with every token</span><b id="mhaBarTxt">10,048</b></div><div class="bar-track"><div class="bar-fill mha" id="mhaBar" style="width:100%"></div></div><div class="bar-label"><span>R-SWA — bounded by L<sub>m</sub> + n</span><b id="swaBarTxt">2,176</b></div><div class="bar-track"><div class="bar-fill swa" id="swaBar" style="width:21%"></div></div><div class="formula" id="fMha">C<sub>MHA</sub>(T) = L<sub>m</sub> + T = <b id="fMhaN">2,048 + 8,000 = 10,048</b></div><div class="formula" id="fSwa">C<sub>R-SWA</sub>(T) = L<sub>m</sub> + min(n, T) = <b id="fSwaN">2,048 + 128 = 2,176</b></div></p></div><div class="stream-wrap"><div class="stream-title">Attention view: each new token sees <b>all reference tokens</b> plus only the <b>last n output tokens</b>. Earlier output is evicted from the cache.</div><div class="stream" id="stream"></div><div class="legend"> <span><i class="dot" style="background:#b0b0b0"></i> Reference tokens (visual + prompt, always visible)</span><br /> <span><i class="dot" style="background:#ffffff"></i> Active window (last n tokens)</span><br /> <span><i class="dot" style="background:#1c1c1c"></i> Evicted output (soft-forgotten)</span></div></p></div><p class="note">Grounding: Unlimited OCR keeps the full reference cache of size L<sub>m</sub> but holds only the most recent n output tokens (n defaults to 128). As output length T grows far beyond n, the cache ratio ρ(T) trends toward zero, so MHA’s linear growth is replaced by a constant footprint. The page-to-token estimate uses the DeepEncoder figure of 256 tokens per 1024×1024 page. Numbers illustrate the cache formulas in the report, not a benchmark run.</p><div class="foot"> <span>R-SWA cache formulas from the Unlimited OCR technical report (arXiv:2606.23050)</span><br /> <span>Built by <b>Marktechpost</b></span></div></div><p><script><br /> (function(){<br /> var pages=document.getElementById('pages'),<br /> tokens=document.getElementById('tokens'),<br /> win=document.getElementById('window');<br /> var TOK_PER_PAGE=256;<br /> var anim=null;</p> <p> function fmt(n){return Math.round(n).toLocaleString('en-US');}</p> <p> function render(){<br /> var P=+pages.value, T=+tokens.value, n=+win.value;<br /> var Lm=P*TOK_PER_PAGE;<br /> var mha=Lm+T;<br /> var swa=Lm+Math.min(n,T);<br /> var ratio=swa/mha;<br /> var saved=Math.round((1-ratio)*100);</p> <p> document.getElementById('pagesVal').textContent=P+(P===1?' page':' pages');<br /> document.getElementById('tVal').textContent=fmt(T);<br /> document.getElementById('nVal').textContent=n;</p> <p> document.getElementById('mhaVal').textContent=fmt(mha);<br /> document.getElementById('swaVal').textContent=fmt(swa);<br /> document.getElementById('ratioVal').textContent=ratio.toFixed(3);<br /> document.getElementById('saveVal').textContent=saved+'%';</p> <p> var maxC=mha;<br /> document.getElementById('mhaBar').style.width=" document.getelementbyid="" drawstream="" postheight="" function="" var="" s="document.getElementById('stream');" s.innerhtml="" ref="10," out="34;" for="" i="0;i<REF;i++){" r="document.createElement('div');" r.classname="tok ref" s.appendchild="" prog="Math.min(1," t="" generated="Math.round(prog*OUT);" wincells="Math.max(1,Math.round((n/512)*8));" j="0;j<OUT;j++){" t.classname="tok" if="">generated-winCells){ t.className="tok win"; }<br /> else { t.className="tok evicted"; }<br /> }<br /> s.appendChild(t);<br /> }<br /> }</p> <p> function postHeight(){<br /> try{ parent.postMessage({type:'uocr-resize',height:document.body.offsetHeight+40},'*'); }catch(e){}<br /> }</p> <p> function play(){<br /> if(anim){stopAnim();return;}<br /> document.getElementById('play').textContent="⏸ Pause";<br /> tokens.value=256;<br /> anim=setInterval(function(){<br /> var v=+tokens.value+3000;<br /> if(v>=120000){v=120000; render(); stopAnim(); return;}<br /> tokens.value=v; render();<br /> },90);<br /> }<br /> function stopAnim(){clearInterval(anim);anim=null;document.getElementById('play').textContent="▶ Animate decoding";}</p> <p> pages.addEventListener('input',render);<br /> tokens.addEventListener('input',function(){ if(anim) stopAnim(); render();});<br /> win.addEventListener('input',render);<br /> document.getElementById('play').addEventListener('click',play);<br /> document.getElementById('reset').addEventListener('click',function(){<br /> stopAnim(); pages.value=8; tokens.value=8000; win.value=128; render();<br /> });</p> <p> window.addEventListener('load',render);<br /> window.addEventListener('resize',postHeight);<br /> render();<br /> })();</p> <p>

Source link