Is Grep All You Need? The Harness Matters More Than the Search

Plain grep beats vector-based semantic search for LLM agents on long-memory question answering, sometimes by more than 20 points. But the more useful finding underneath it is that the agent harness, how results get delivered to the model, matters as much as which retrieval method you pick.

That comes from a new PwC study, “Is Grep All You Need? How Agent Harnesses Reshape Agentic Search.” The team ran 116 questions from the LongMemEval benchmark across 4 harnesses, 5 models, and 2 retrieval methods, then measured what actually moved accuracy.

Lexical grep hit 83.6 to 93.1% accuracy across harness-model pairs; vector search landed at 62.9 to 83.6%
Biggest single gap: Gemini 3.1 Flash-Lite on the Chronos harness scored 86.2% with grep vs 62.9% with vector, a 23-point swing
The same model, Claude Opus 4.6, scored 93.1% on the Chronos harness vs 76.7% on Claude Code, using the identical corpus and retrieval method
Switching result delivery from inline to file-based could “invert or erase the lexical advantage without any change to the corpus”

Grep vs Vector, Head to Head

The headline comparison is inline delivery on the Chronos harness, where retrieval results are dropped straight into the model’s context. Across every model tested, lexical search won.

Grep vs Vector Accuracy (Chronos harness, inline delivery)

LongMemEval, 116 questions. Blue = grep, gray = vector.

Claude Opus 4.693.1% vs 83.6%

GPT-5.489.7% vs 81.9%

Gemini 3.1 Pro91.4% vs 82.8%

Gemini 3.1 Flash-Lite86.2% vs 62.9%

Claude Haiku 4.583.6% vs 76.7%

The intuition is simple. LongMemEval questions tend to hinge on specific entities, names, dates, numbers, that appear verbatim in the conversation history. Exact string matching nails those. Semantic embeddings, by smoothing everything into a similarity space, can pull back plausible-but-wrong passages and miss the literal token the answer depends on.

The Harness Is the Hidden Variable

Here is the part that should change how teams think about retrieval. Hold the model and the corpus fixed, swap only the harness, and accuracy moves by double digits.

Same model, same corpus, different harness

Claude Opus 4.6, grep, inline delivery

Chronos harness93.1%

Claude Code harness76.7%

A 16-point gap from the wrapper alone. The harness controls prompting, tool framing, and how much of the context window the retrieved results consume, and those decisions rival the retrieval method itself.

The cleanest demonstration is delivery mode. Inline delivery dumps results directly into the conversation, where they compete with the system prompt and history for context space. File-based delivery writes results to disk and makes the agent fetch them with extra tool calls, decoupling result size from context pressure. Switching from inline to file-based was enough to erase or even reverse grep’s lead in several configurations, with no change to the underlying data. The Codex CLI run with GPT-5.4 is the starkest: grep dropped from 93.1% inline to 55.2% file-based.

Caveats

The result is scoped to long-term conversational memory QA, where answers live as exact tokens in the history. The authors are explicit that it may not transfer to scientific synthesis, paraphrased retrieval, or tasks where semantic generalization is the whole point, settings that play to vector search’s strengths. A second scaling experiment that injected more noise showed vector occasionally pulling ahead and non-monotonic curves, so “grep always wins” is the wrong takeaway.

It is also a 116-question subset graded by an LLM (GPT-4o), with some Codex vector configurations still incomplete, and the authors note they cannot isolate exactly which query patterns drive the gaps without trace-level analysis.

Still, the practical lesson lands: before reaching for an embedding index, try grep, and pay as much attention to how results reach the model as to how they are retrieved. The wrapper around your agent may be doing more for, or against, you than the retrieval algorithm inside it.