The smallest useful AI product is a function that takes a question and returns text from a model. Once you can write one and understand every line, the rest of this chapter is incremental.
A complete LLM call, line by line
Here is one call to OpenAI's GPT-4o-mini in Python:
from openai import OpenAI
client = OpenAI(api_key="YOUR_OPENAI_KEY")
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful customer support agent."},
{"role": "user", "content": "Where is my order?"},
],
)
print(response.choices[0].message.content)What each piece does:
from openai import OpenAI: imports the OpenAI Python SDK.client = OpenAI(api_key=...): creates an HTTP client that knows how to talk to OpenAI's servers. The API key authenticates you to your account.client.chat.completions.create(...): the method that sends one prompt and gets one response back.model="gpt-4o-mini": which model to use. Different models have different quality, speed, and price.messages=[...]: the conversation. A list of objects, each with aroleandcontent.role: "system"is instructions to the model about how to behave. The user does not see system messages.role: "user"is what the user actually said.role: "assistant"would be a previous response from the model (used in multi-turn conversations).
response.choices[0].message.content: the text the model produced.
That is one LLM call. Run it, and you get a string of text back.
What is happening inside the model
You do not need to know the internals to build with LLMs. The mental model that is enough:
- The model reads your input tokens (the system + user messages) and predicts the next token, then the next, until it decides to stop.
- A token is roughly a chunk of text. "Hello" is one token. "antidisestablishmentarianism" is several. On average, English text is ~0.75 words per token.
- The model has a context window, the maximum tokens it can read in one call. GPT-4o has a 128,000-token window. Once you exceed it, the API errors out and you have to drop or summarize older content.
- You pay per input token (what you send) and per output token (what the model produces). Output tokens are usually 3-5x more expensive.
When you see a slow response, it is usually because the model is generating a lot of output tokens, one at a time. Output token count is the lever that controls both cost and latency.
What is a "prompt"
In the example above, the prompt is the entire messages list. People use the word "prompt" to mean two slightly different things:
- The whole input you send to the model (system + user + any prior turns)
- Specifically, the system message that defines behavior
Both usages are common. From context it is usually clear.
A well-written system message often has:
- A role description ("You are a customer support agent for TechCo")
- Constraints ("Only answer based on the provided context. If you do not know, say so.")
- Format requirements ("Respond in valid JSON with fields
answerandconfidence.") - Examples (a few input/output pairs that show the expected behavior)
Prompt design is its own skill, covered in section 3.
Why one call is not a product
The example above works. Your customer asks a question, the model answers, the v0 ships. Now imagine three months later:
- Six prompts, four files, no version control. A product manager asks "which prompt is currently in production?" Nobody is sure.
- A new prompt drops quality 20%. You want to roll back. There is nothing to roll back to.
- A customer screenshots a wrong refund quote. You cannot reproduce what the agent said because you have no logs.
- GPT-5 is cheaper than GPT-4o-mini. Switching means rewriting every call site.
- You added retrieval, then a tool call, then a verifier. Each is a separate file. When something breaks, debugging is a four-tab grep.
- Compliance asks for an audit trail. You have nothing.
Each of these is a layer the next sections add. Read them in order.
Try it yourself
Before moving on, install the OpenAI SDK and run the example:
pip install openai
export OPENAI_API_KEY=sk-...Save the Python file as first_call.py and run python first_call.py. Change the system and user messages, run it again, watch the output change. Try gpt-4o instead of gpt-4o-mini and see the quality difference (and the cost difference in your OpenAI dashboard).
Next: calling LLMs in production
The next section, Calling LLMs in production, shows the gateway pattern: a single URL between your application and every LLM provider, so you get logging, fallbacks, caching, and cost caps for free.
Or jump to:
- Designing and managing prompts
- Workflows and tracing
- Measuring quality with evals
- Agents and tool use
Or back to the Chapter 1 hub.
