The smallest useful AI product is a function that takes a question and returns text from a model. Once you can write one and understand every line, the rest of this chapter is incremental.

A complete LLM call, line by line

Here is one call to OpenAI's GPT-4o-mini in Python:

from openai import OpenAI
 
client = OpenAI(api_key="YOUR_OPENAI_KEY")
 
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful customer support agent."},
        {"role": "user", "content": "Where is my order?"},
    ],
)
print(response.choices[0].message.content)

What each piece does:

from openai import OpenAI: imports the OpenAI Python SDK.
client = OpenAI(api_key=...): creates an HTTP client that knows how to talk to OpenAI's servers. The API key authenticates you to your account.
client.chat.completions.create(...): the method that sends one prompt and gets one response back.
model="gpt-4o-mini": which model to use. Different models have different quality, speed, and price.
messages=[...]: the conversation. A list of objects, each with a role and content.
- role: "system" is instructions to the model about how to behave. The user does not see system messages.
- role: "user" is what the user actually said.
- role: "assistant" would be a previous response from the model (used in multi-turn conversations).
response.choices[0].message.content: the text the model produced.

That is one LLM call. Run it, and you get a string of text back.

What is happening inside the model

You do not need to know the internals to build with LLMs. The mental model that is enough:

The model reads your input tokens (the system + user messages) and predicts the next token, then the next, until it decides to stop.
A token is roughly a chunk of text. "Hello" is one token. "antidisestablishmentarianism" is several. On average, English text is ~0.75 words per token.
The model has a context window, the maximum tokens it can read in one call. GPT-4o has a 128,000-token window. Once you exceed it, the API errors out and you have to drop or summarize older content.
You pay per input token (what you send) and per output token (what the model produces). Output tokens are usually 3-5x more expensive.

When you see a slow response, it is usually because the model is generating a lot of output tokens, one at a time. Output token count is the lever that controls both cost and latency.

What is a "prompt"

In the example above, the prompt is the entire messages list. People use the word "prompt" to mean two slightly different things:

The whole input you send to the model (system + user + any prior turns)
Specifically, the system message that defines behavior

Both usages are common. From context it is usually clear.

A well-written system message often has:

A role description ("You are a customer support agent for TechCo")
Constraints ("Only answer based on the provided context. If you do not know, say so.")
Format requirements ("Respond in valid JSON with fields answer and confidence.")
Examples (a few input/output pairs that show the expected behavior)

Prompt design is its own skill, covered in section 3.

Why one call is not a product

The example above works. Your customer asks a question, the model answers, the v0 ships. Now imagine three months later:

Six prompts, four files, no version control. A product manager asks "which prompt is currently in production?" Nobody is sure.
A new prompt drops quality 20%. You want to roll back. There is nothing to roll back to.
A customer screenshots a wrong refund quote. You cannot reproduce what the agent said because you have no logs.
GPT-5 is cheaper than GPT-4o-mini. Switching means rewriting every call site.
You added retrieval, then a tool call, then a verifier. Each is a separate file. When something breaks, debugging is a four-tab grep.
Compliance asks for an audit trail. You have nothing.

Each of these is a layer the next sections add. Read them in order.

Try it yourself

Before moving on, install the OpenAI SDK and run the example:

pip install openai
export OPENAI_API_KEY=sk-...

Save the Python file as first_call.py and run python first_call.py. Change the system and user messages, run it again, watch the output change. Try gpt-4o instead of gpt-4o-mini and see the quality difference (and the cost difference in your OpenAI dashboard).

Next: calling LLMs in production

The next section, Calling LLMs in production, shows the gateway pattern: a single URL between your application and every LLM provider, so you get logging, fallbacks, caching, and cost caps for free.

Or jump to:

Or back to the Chapter 1 hub.

The smallest useful AI product is a function that takes a question and returns text from a model. Once you can write one and understand every line, the rest of this chapter is incremental.

A complete LLM call, line by line

Here is one call to OpenAI's GPT-4o-mini in Python:

from openai import OpenAI
 
client = OpenAI(api_key="YOUR_OPENAI_KEY")
 
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful customer support agent."},
        {"role": "user", "content": "Where is my order?"},
    ],
)
print(response.choices[0].message.content)

What each piece does:

from openai import OpenAI: imports the OpenAI Python SDK.
client = OpenAI(api_key=...): creates an HTTP client that knows how to talk to OpenAI's servers. The API key authenticates you to your account.
client.chat.completions.create(...): the method that sends one prompt and gets one response back.
model="gpt-4o-mini": which model to use. Different models have different quality, speed, and price.
messages=[...]: the conversation. A list of objects, each with a role and content.
- role: "system" is instructions to the model about how to behave. The user does not see system messages.
- role: "user" is what the user actually said.
- role: "assistant" would be a previous response from the model (used in multi-turn conversations).
response.choices[0].message.content: the text the model produced.

That is one LLM call. Run it, and you get a string of text back.

What is happening inside the model

You do not need to know the internals to build with LLMs. The mental model that is enough:

The model reads your input tokens (the system + user messages) and predicts the next token, then the next, until it decides to stop.
A token is roughly a chunk of text. "Hello" is one token. "antidisestablishmentarianism" is several. On average, English text is ~0.75 words per token.
The model has a context window, the maximum tokens it can read in one call. GPT-4o has a 128,000-token window. Once you exceed it, the API errors out and you have to drop or summarize older content.
You pay per input token (what you send) and per output token (what the model produces). Output tokens are usually 3-5x more expensive.

When you see a slow response, it is usually because the model is generating a lot of output tokens, one at a time. Output token count is the lever that controls both cost and latency.

What is a "prompt"

In the example above, the prompt is the entire messages list. People use the word "prompt" to mean two slightly different things:

The whole input you send to the model (system + user + any prior turns)
Specifically, the system message that defines behavior

Both usages are common. From context it is usually clear.

A well-written system message often has:

A role description ("You are a customer support agent for TechCo")
Constraints ("Only answer based on the provided context. If you do not know, say so.")
Format requirements ("Respond in valid JSON with fields answer and confidence.")
Examples (a few input/output pairs that show the expected behavior)

Prompt design is its own skill, covered in section 3.

Why one call is not a product

The example above works. Your customer asks a question, the model answers, the v0 ships. Now imagine three months later:

Six prompts, four files, no version control. A product manager asks "which prompt is currently in production?" Nobody is sure.
A new prompt drops quality 20%. You want to roll back. There is nothing to roll back to.
A customer screenshots a wrong refund quote. You cannot reproduce what the agent said because you have no logs.
GPT-5 is cheaper than GPT-4o-mini. Switching means rewriting every call site.
You added retrieval, then a tool call, then a verifier. Each is a separate file. When something breaks, debugging is a four-tab grep.
Compliance asks for an audit trail. You have nothing.

Each of these is a layer the next sections add. Read them in order.

Try it yourself

Before moving on, install the OpenAI SDK and run the example:

pip install openai
export OPENAI_API_KEY=sk-...

Next: calling LLMs in production

Or jump to:

Or back to the Chapter 1 hub.

Your First LLM Call

A complete LLM call, line by line

What is happening inside the model

What is a "prompt"

Why one call is not a product

Try it yourself

Next: calling LLMs in production

Built for AI agents.
Break less.
Ship more.

Your First LLM Call

A complete LLM call, line by line

What is happening inside the model

What is a "prompt"

Why one call is not a product

Try it yourself

Next: calling LLMs in production

Built for AI agents.
Break less.
Ship more.

Your First LLM Call

A complete LLM call, line by line

What is happening inside the model

What is a "prompt"

Why one call is not a product

Try it yourself

Next: calling LLMs in production

Built for AI agents. Break less. Ship more.

Your First LLM Call

A complete LLM call, line by line

What is happening inside the model

What is a "prompt"

Why one call is not a product

Try it yourself

Next: calling LLMs in production

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.