Web search adds real-time information to your AI. Most apps don't need it. The ones that do, need it badly. The mistake teams make is bolting it on by default, then paying for latency and tokens they didn't need to spend. The opposite mistake is rarer but worse: refusing to add web search when your users are asking about today's news.
This chapter is about the line between those two failure modes, and how to pick a provider once you've decided you're on the right side of it.
When web search is necessary
A few use cases genuinely require live web data:
- Real-time information. Stock prices, news, sports scores, weather. Anything where yesterday's answer is wrong today.
- Open-domain queries beyond your KB. A general-purpose assistant that takes any question can't pre-index the entire internet.
- Research tasks. When the answer isn't in any pre-existing dataset, the agent has to go find it.
- Fact-checking against current sources. Even if your KB has the topic, citations to live URLs matter for trust.
If your product description includes the words "current," "latest," or "today," you probably need web search.
When web search is over-engineering
The opposite cases are equally clear:
- Your KB already covers 90%+ of expected queries. A docs assistant for a specific product, a customer support bot, an internal HR tool. The corpus is closed.
- Cost or latency is a hard constraint. Each search call adds 500ms to 2s of latency and $0.001 to $0.01 per query. For a high-volume app, that math gets ugly fast.
- The use case is bounded. "Help users navigate our product" doesn't need the open web.
A simple decision rule
Track your query log for a week. Tag each query as "answerable from KB" or "needs external info." If less than 10% of queries fall into the second bucket, web search is premature. Spend that engineering effort on better retrieval instead.
If 30%+ of queries need external info, you have a real case. Anywhere in between, build a fallback: try the KB first, escalate to web only when retrieval scores are low.
Search provider comparison
The market has consolidated around a handful of usable APIs. Here's the honest landscape:
| Provider | Strength | Pricing | Output |
|---|---|---|---|
| Brave Search API | Privacy-focused, good general web | $3 to $9 per 1k queries | Snippets + URLs |
| Tavily | LLM-optimized, returns clean snippets | Free tier, paid $0.005 per query | LLM-ready summaries |
| Exa (formerly Metaphor) | Neural search, semantic, great for research | $10 per 1k searches | Snippets + content |
| SerpAPI | Google-grade results, structured | $50/mo for 5k searches | Full SERP including PAA |
| You.com / Perplexity API | Conversational answers | $5 per 1k queries | Pre-summarized answers |
A rough rule of thumb: Tavily for general LLM apps where you want clean inputs, Exa for research agents that need semantic match, SerpAPI when you specifically need Google's ranking, Brave when privacy or independence from Google matters, Perplexity/You.com when you want the search engine to do the summarization for you.
The agent loop pattern
The basic pattern is simple: search, then condition the model on the results.
from openai import OpenAI
client = OpenAI()
def web_search(query):
# call your provider here, return list of {title, url, snippet}
return [...]
def research_agent(question):
results = web_search(question)
context = "\n\n".join(f"{r['title']}: {r['snippet']}" for r in results)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Answer based only on the provided search results. Cite the URL after each claim."},
{"role": "user", "content": f"Search results:\n{context}\n\nQuestion: {question}"},
],
)
return response.choices[0].message.contentThat's the floor. Single search, single answer, cited.
Multi-step search agents
One search often isn't enough. The first query returns surface-level results; the agent needs to read them, identify a gap, and search again with a refined query. This is where tool calling earns its keep (covered in Chapter 1.6). The model sees the initial results, decides "I need more on X," and emits another web_search tool call before composing its answer.
Multi-step loops are powerful and also where things get expensive and weird. An agent that runs five searches per question costs five times as much and is five times more likely to go off the rails. Respan tracing is built for exactly this shape of debugging: every search query, every result set, every model decision shows up in a single timeline so you can see why the agent went down a rabbit hole on query three.
The hallucination risk specific to web search
Here's the failure mode nobody talks about until it bites them: the LLM can ignore the search results and answer from its training data anyway. You hand it five fresh URLs about today's market close, and it confidently quotes a number from 2024 because the prompt didn't make it clear that the search results are the source of truth.
The defenses:
- Strict prompting. "Answer ONLY based on the provided search results. If the results don't contain the answer, say so."
- Citation enforcement. Require a URL after every factual claim. Reject responses without citations.
- An evaluator that checks each cited claim actually appears in the linked source. This is the only defense that catches subtle drift.
Cost management
Web search is the most expensive part of most LLM apps once it's in the loop. A few tactics:
- Cache aggressively. Queries repeat more than you'd think. A simple Redis layer keyed on the normalized query saves real money.
- Tier your providers. First pass with a cheap provider (Brave or Tavily free tier), escalate to premium (SerpAPI, Exa) only when the first pass returns weak results.
- Limit fan-out. Cap the number of searches per user request. If your agent wants to do ten searches, it probably has a planning problem, not an information problem.
Wrap-up
Web search is the right answer for some apps and badly wrong for others. Decide based on your query log, not on what's fashionable. Pick a provider that matches your output shape: clean snippets for LLM consumption, structured SERPs for ranking work, neural search for semantic research. And whatever you build, instrument it, because multi-step agents fail in ways you can't reason about from logs alone.
Next up: model selection. When does a small local model beat a frontier API, and when is the opposite true?
