LiteResearcher: A Scalable Agentic RL Training Framework for Deep Research Agent.

A 4B deep research agent that beats 30B open-source baselines and matches Claude-4.5-Sonnet / GPT-5-high — trained with zero marginal RL API cost.

Wanli Li1,2,*, Bince Qu1,2,*, Bo Pan1, Jianyu Zhang1, Zheng Liu3,†, Pan Zhang2, Wei Chen1, Bo Zhang1,2,†

¹ Zhejiang University • ² Simplex AI • ³ The Hong Kong Polytechnic University

* Equal contribution. Work done during internship at Simplex AI. † Corresponding authors

✉ wanli_li@zju.edu.cn, tonyzhang@simplexai.com

Paper Code 🤗 Model (RL) 🤗 Model (SFT) NEW 🤗 Dataset 🤗 Corpus (32M) Trajectories

TL;DR

One sentence.

LiteResearcher-4B is a 4B deep research agent trained with zero marginal RL API cost, outperforming 30B open-source deep research agents and matching frontier systems such as Claude-4.5-Sonnet and GPT-5-high. Its RL stage runs entirely in a local search/browse environment, enabling 73.2M tool calls without live search or browse API consumption.

Headline numbers

The shape of the result.

71.3/78.0GAIA / Xbench-DS
open-source SOTA; beats 30B agents

+15.7GAIA points from RL
SFT 55.6 → RL 71.3; AgentCPM +3.8

73.2Mlocal RL tool calls
$0 marginal cost vs. $59K–$243K live web

Performance comparison across models — Performance of LiteResearcher. Left: Accuracy comparison on the Xbench DeepSearch benchmark across models of various scales. Right: Average rollout latency and cost per turn.

Abstract

The full story.

Reinforcement Learning (RL) has emerged as a powerful training paradigm for LLM-based agents. However, scaling agentic RL for deep research remains constrained by two coupled challenges: hand-crafted synthetic data fail to elicit genuine real-world search capabilities, and real-world search dependency during RL training introduces instability and expensive cost, which limit the scalability of Agentic RL.

LiteResearcher is a training framework to make Agentic RL scalable and low-cost: by constructing a lite virtual world that mirrors the real-world search dynamics, we enabled a continuously improving training recipe that empowers a tiny search agent to outperform large-scale open-source and commercial models (e.g. Tongyi DeepResearch and Claude-4.5 Sonnet). The RL stage runs entirely in a local search/browse environment, removing external API consumption during RL while preserving realistic tool-use dynamics. Specifically, on most common benchmarks like GAIA and Xbench, our LiteResearcher-4B achieves the open-source state-of-the-art results of 71.3% and 78.0% respectively, proving that scalable RL training is essential for Deep Research Agents.

Method

Three pillars.

LiteResearcher constructs a virtual world with identical architecture to the real web but isolated in execution. The framework consists of three key components:

(1) Co-constructed Training Data & Corpus: We scale up information sources (32M+ webpages, 1M+ domains) and identify five atomic search capabilities — direct retrieval, aggregation, enumeration, cross-verification, and statistics — to generate diverse, realistic training tasks.

(2) Stable Local Tool Environment: A local search engine (BGE-M3 + Milvus, ~0.15s/query) and local browse tool (PostgreSQL, ~0.17s/page) that enable 73.2M tool calls during training fully locally, with no external API consumption during RL and zero marginal tool cost.

(3) Difficulty-Aware Curriculum RL: Multi-stage training that progressively increases task difficulty and context length, retaining only partially-solvable instances to maintain consistent training signal.

LiteResearcher framework overview — Overview of the LiteResearcher training framework.

Quick Start

Build the local environment.

The local search/browse environment (BGE-M3 + Milvus hybrid search) is open-sourced under Environment/, and the ~32M-record corpus is on HuggingFace. Stand up your own local search tool in four steps:

# 0. Install (also needs local BGE-M3 weights + a Redis instance)
pip install -r requirements.txt

# 1. Start Milvus (edit milvus_config/.env first)
cd milvus_config && docker-compose up -d

# 2. Download the released 32M corpus and decompress
huggingface-cli download simplex-ai-inc/LiteResearcher-Corpus \
  serper_test_text.jsonl.zst --repo-type dataset --local-dir ./corpus
zstd -d ./corpus/serper_test_text.jsonl.zst

# 3. Build the index (configure paths/collection in config.py)
python data.py

# 4. Serve hybrid search at :8018
cd server && REDIS_HOST=127.0.0.1 EMBED_WORKERS=1 bash start.sh

Full setup, API reference, and optional PostgreSQL full-text fetch are documented in the Environment README.

Main Results.

LiteResearcher-4B consistently outperforms open-source models up to 8× larger and matches or exceeds proprietary systems across eight benchmarks, while remaining a low-cost 4B agent trained with fully local RL tool calls.

scroll →

Models	GAIA-Text	Browsecomp	Browse.(ZH)	HLE	Frames	Webwalker	Seal-0	Xbench-DS
Commercial Models
Claude-4-Sonnet	68.3	12.2	29.1	20.3	80.7	61.7	-	64.6
Claude-4.5-Sonnet	71.2	19.6	40.8	24.5	85.0	-	53.4	66.0
Deepseek-V3.2	63.5	67.6	65.0	40.8	80.2	-	38.5	71.0
DeepSeek-V3.1	63.1	30.0	49.2	29.8	83.7	61.2	-	71.0
Minimax-M2	75.7	44.0	48.5	31.8	-	-	-	72.0
OpenAI-GPT-5-high	76.4	54.9	65.0	35.2	-	-	51.4	77.8
GLM-4.6	71.9	45.1	49.5	30.4	-	-	-	70.0
Kimi-Researcher	-	-	-	26.9	78.8	-	36.0	69.0
Kimi-K2-0905	60.2	7.4	22.2	21.7	58.1	-	25.2	61.0
Open-Source Models
Mirothinker 8B	66.4	31.1	40.2	21.5	80.6	60.6	40.4	60.6
Tongyi Deepsearch 30B	70.9	43.4	46.7	32.9	90.6	72.2	-	75.0
ASearcher QWQ v2 32B	58.7	-	-	-	74.5	-	-	51.1
WebSailor 30B	53.2	-	-	-	-	-	-	53.3
WebDancer 32B (QwQ)	51.5	3.8	18.0	-	-	47.9	-	38.3
WebExplorer 8B	50.0	15.7	32.0	17.3	75.7	62.7	-	53.7
DeepMiner 32B	58.7	33.5	40.1	-	-	-	-	62.0
AFM-RL 32B	55.3	11.1	-	18.0	-	63.0	-	-
SFR-DeepResearch 20B	66.0	-	-	28.7	82.8	-	-	-
AgentCPM-Explore 4B	63.9	24.1	29.1	19.1	82.7	68.1	40.5	70.0
LiteResearcher-4B	71.3	27.5*	32.5*	22.0	83.1	72.7	41.8	78.0

Best open-source results in bold. Results with * use a 64k context window with a memory mechanism.

Training

Curriculum keeps signal alive.

Our difficulty-aware curriculum learning prevents training saturation. Stage 2 with adjusted difficulty yields +3.6% GAIA accuracy after Stage 1 plateaus, demonstrating the importance of progressive curriculum design.

Training dynamics across stages — GAIA accuracy across training stages, showing continued improvement with curriculum learning.

Trajectories

Watch it think.

A few of our trajectories — pick one and watch LiteResearcher-4B think, search, and answer.

Washington county-seat population diff (2020 census)

Loading cases…

Open full trajectory viewer →

BibTeX.

@article{li2026literesearcher,
  title={LiteResearcher: A Scalable Agentic RL Training Framework for Deep Research Agent},
  author={Li, Wanli and Qu, Bince and Pan, Bo and Zhang, Jianyu and Liu, Zheng and Zhang, Pan and Chen, Wei and Zhang, Bo},
  journal={arXiv preprint arXiv:2604.17931},
  year={2026}
}