Responses API
The Responses API is the recommended endpoint for generating model responses. It provides a modern, flexible interface compatible with OpenAI’s API format, supporting streaming, tool calling, and agent routing.
Recommended: Use the Responses API for all new integrations. It offers the most complete feature set and best developer experience.
Create Response
Generates a model response for the given input.
POST /v1/responses
Request Body
{
"model": "agent:assistant",
"input": "What can you help me with today?"
}
Limits and Timeouts
You can override server limits via metadata:
{
"model": "claude-opus-4-5-20251101",
"input": "Summarize the last three games",
"metadata": {
"tool_limits": { "max_tool_calls": 8 },
"timeout_ms": 120000
}
}
Resolution order:
metadata.tool_limits.max_tool_calls→ requestmax_tool_calls→ server config defaultsmetadata.timeout_ms→ server config defaults
Metadata Extensions
The metadata object supports additional fields for agent routing:
metadata.tool_headers→ per-request header overrides for agent MCP toolsmetadata.prompt_vars→ simple{{key}}substitutions in the agent system prompt
Per-request tool headers (agent tools)
{
"model": "agent:assistant",
"input": "Check the request id",
"metadata": {
"tool_headers": {
"get-request-id": {
"trace_id": "abc123",
"request_id": "req-456"
}
}
}
}
System prompt variables (agent only)
{
"model": "agent:assistant",
"input": "What can you do?",
"metadata": {
"prompt_vars": {
"user_id": "u_123",
"tenant": "acme"
}
}
}
If the agent system prompt contains {{user_id}} or {{tenant}}, they are replaced with the provided values.
Parameters
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model ID (see Supported Models), agent:{agent_name} for agent routing, or agent:{agent_name}:{model_override} to override an agent’s model |
input | string/array | No | Text input or array of input items |
instructions | string | No | System prompt / developer message |
stream | boolean | No | Enable streaming responses (default: false) |
max_output_tokens | integer | No | Maximum tokens to generate |
store | boolean | No | Store the response for later retrieval |
metadata | object | No | Request metadata (see Metadata Extensions) |
previous_response_id | string | No | Chain responses in a conversation |
reasoning | object | No | Enable extended thinking/reasoning (see Reasoning) |
tools | array | No | Tools the model may call (function or MCP tools) |
tool_choice | string/object | No | How model selects tools |
parallel_tool_calls | boolean | No | Allow parallel tool calls |
max_tool_calls | integer | No | Maximum number of tool calls (fallback if not provided in metadata) |
Reasoning
The reasoning parameter enables extended thinking capabilities for supported models. When enabled, the model will
perform additional reasoning steps before generating its response, which can improve quality for complex tasks.
Basic Usage
{
"model": "claude-sonnet-4-5-20250929",
"input": "Solve this step by step: If a train travels 120 miles in 2 hours, then stops for 30 minutes, then travels another 90 miles in 1.5 hours, what is the average speed for the entire journey?",
"reasoning": {
"effort": "medium"
}
}
Reasoning Parameters
| Field | Type | Required | Description |
|---|---|---|---|
effort | string | Yes | Reasoning intensity: "none", "low", "medium", or "high" |
Effort Levels
| Level | Description |
|---|---|
none | Disable reasoning (supported by OpenAI gpt-5.0+) |
low | Light reasoning, suitable for simpler problems |
medium | Balanced reasoning for most tasks |
high | Maximum reasoning depth for complex problems |
Note: If the
reasoningparameter is omitted, the provider’s default behavior is used. For OpenAI gpt-5.1+, the default is"none"(no reasoning).
Supported Models
Reasoning is supported on models with the reasoning capability:
- Anthropic: Claude Sonnet 3.7+, Claude Sonnet 4+, Claude Opus 4+
- Google: Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 3.0 Pro
- OpenAI: GPT-5.x, o-series models (o1, o3, o4)
Note: GPT-4.x models do not support the
reasoningparameter and will return an error if it’s provided. Thereasoningparameter is silently ignored for unsupported models.
Provider-Specific Behavior
Different providers implement reasoning differently:
| Provider | none | low | medium | high | Default (not specified) |
|---|---|---|---|---|---|
| OpenAI (gpt-5.0+) | No reasoning | Minimal reasoning | Balanced reasoning | Maximum reasoning | none (gpt-5.1) |
| gpt-oss (local) | Maps to low | Low thinking | Medium thinking | High thinking | medium |
| Anthropic | Disables thinking | ~1K token budget | ~8K token budget | ~24K token budget | No thinking |
Maps to low | Low budget | Medium budget | High budget | Provider default |
Note: gpt-oss models don’t support fully disabling reasoning -
"none"maps to"low"(minimal reasoning).
Example with Streaming
{
"model": "claude-sonnet-4-5-20250929",
"input": "Explain the proof of the Pythagorean theorem",
"reasoning": {
"effort": "high"
},
"stream": true
}
When streaming with reasoning enabled, you’ll receive response.reasoning_summary_text.delta events containing
the model’s reasoning process, followed by the regular response content.
Direct Model Calls
You can call models directly by specifying the model ID and optionally including MCP tools inline:
{
"model": "gpt-5.2",
"input": [
{
"role": "user",
"content": "Roll 2d4+1"
}
],
"tools": [
{
"type": "mcp",
"server_label": "dmcp",
"server_description": "A Dungeons and Dragons MCP server to assist with dice rolling.",
"server_url": "https://dmcp-server.deno.dev/sse",
"require_approval": "never"
}
]
}
MCP Tool Parameters
| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Must be "mcp" |
server_label | string | Yes | Identifier for the MCP server |
server_description | string | No | Description of what the server provides |
server_url | string | Yes | URL of the MCP server (SSE endpoint) |
require_approval | string | No | Approval mode: "never", "always", or "auto" |
This approach is useful when you want to:
- Use a specific model without agent configuration
- Dynamically specify MCP tools per request
- Test new tools without modifying agent config
Agent Routing
The recommended way to use the Responses API is through agent routing. Use the model field to route requests to
configured agents:
{
"model": "agent:assistant",
"input": "Help me with my task"
}
This routes to the agent named “assistant” and uses its configured model, system prompt, and tool access.
Benefits of agent routing:
- Pre-configured system prompts
- Automatic MCP tool access
- Centralized agent management
- No need to specify model or instructions per request
Model Override
You can override an agent’s configured model while still using its system prompt and tools by appending the model name:
agent:{agent_name}:{model_override}
Examples:
// Use agent's default model
{
"model": "agent:assistant",
"input": "Hello!"
}
// Override with Claude
{
"model": "agent:assistant:claude-haiku-4-5-20251001",
"input": "Hello!"
}
// Override with gpt-5.2
{
"model": "agent:assistant:gpt-5.2",
"input": "Hello!"
}
This is useful when you want to:
- Test an agent’s prompts and tools with different models
- Use a faster/cheaper model for simple tasks
- Use a more capable model for complex tasks
- A/B test model performance with the same agent configuration
Response Format
Non-Streaming Response
{
"id": "resp_abc123",
"object": "response",
"created_at": 1705312200,
"status": "completed",
"model": "claude-sonnet-4-5-20250929",
"output": [
{
"type": "message",
"id": "msg_xyz789",
"status": "completed",
"role": "assistant",
"content": [
{
"type": "output_text",
"text": "The capital of France is Paris."
}
]
}
],
"usage": {
"input_tokens": 25,
"output_tokens": 12,
"total_tokens": 37
}
}
Response with Reasoning
When reasoning is enabled, the response includes a reasoning output item before the message:
{
"id": "resp_abc123",
"object": "response",
"created_at": 1705312200,
"status": "completed",
"model": "claude-sonnet-4-5-20250929",
"output": [
{
"type": "reasoning",
"id": "reasoning_def456",
"status": "completed",
"summary": [
{
"type": "summary_text",
"text": "To solve this problem, I need to calculate the total distance and total time..."
}
]
},
{
"type": "message",
"id": "msg_xyz789",
"status": "completed",
"role": "assistant",
"content": [
{
"type": "output_text",
"text": "The average speed for the entire journey is 42 mph."
}
]
}
],
"usage": {
"input_tokens": 45,
"output_tokens": 156,
"total_tokens": 201
}
}
Response Fields
| Field | Type | Description |
|---|---|---|
id | string | Unique response identifier |
object | string | Always “response” |
created_at | integer | Unix timestamp of creation |
status | string | One of: completed, failed, in_progress, cancelled |
model | string | Model used for generation |
output | array | Array of output items (messages, reasoning, function calls) |
usage | object | Token usage statistics |
error | object | Error details if status is “failed” |
Output Item Types
| Type | Description |
|---|---|
message | Assistant’s response message with text content |
reasoning | Model’s reasoning/thinking process (when reasoning enabled) |
function_call | A tool/function call made by the model |
function_call_output | Result from a tool/function call |
Streaming
When stream: true, the endpoint returns Server-Sent Events (SSE):
curl -X POST http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"model": "agent:assistant",
"input": "Tell me a story",
"stream": true
}'
Event Types
event: response.created
data: {"id":"resp_abc123","object":"response","status":"in_progress",...}
event: response.output_item.added
data: {"type":"message","id":"msg_xyz789","role":"assistant",...}
event: response.content_part.added
data: {"type":"output_text","text":""}
event: response.output_text.delta
data: {"delta":"Once upon"}
event: response.output_text.delta
data: {"delta":" a time..."}
event: response.output_text.done
data: {"text":"Once upon a time..."}
Reasoning Events (when reasoning is enabled)
When reasoning is enabled, additional events are sent before the main response content:
event: response.reasoning_summary_part.added
data: {"item_id":"reasoning_abc","output_index":0,"summary_index":0,"part":{"type":"summary_text","text":""}}
event: response.reasoning_summary_text.delta
data: {"item_id":"reasoning_abc","output_index":0,"summary_index":0,"delta":"Let me think through this..."}
event: response.reasoning_summary_text.delta
data: {"item_id":"reasoning_abc","output_index":0,"summary_index":0,"delta":" First, I need to consider..."}
event: response.reasoning_summary_text.done
data: {"item_id":"reasoning_abc","output_index":0,"summary_index":0,"text":"Let me think through this... First, I need to consider..."}
event: response.output_item.done data: {“type”:“message”,“id”:“msg_xyz789”,“status”:“completed”,…}
event: response.completed data: {“id”:“resp_abc123”,“status”:“completed”,“usage”:{…}}
### Event Sequence
Standard sequence:
1. `response.created` - Response object created
2. `response.output_item.added` - New output item (message or function call)
3. `response.content_part.added` - New content part added
4. `response.output_text.delta` - Text chunk (repeated)
5. `response.output_text.done` - Text content complete
6. `response.output_item.done` - Output item complete
7. `response.completed` - Full response complete
With reasoning enabled, reasoning events appear after `response.created` and before the main content:
1. `response.created`
2. `response.reasoning_summary_part.added` - Reasoning output started
3. `response.reasoning_summary_text.delta` - Reasoning text chunk (repeated)
4. `response.reasoning_summary_text.done` - Reasoning complete
5. `response.output_item.added` - Main response content begins
6. ... (standard content events)
7. `response.completed`
---
## Conversation Chaining
Chain multiple responses together using `previous_response_id` to maintain conversation context:
```bash
# First message
curl -X POST http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "agent:assistant",
"input": "What is machine learning?"
}'
# Response includes "id": "resp_abc123"
# Follow-up message
curl -X POST http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "agent:assistant",
"previous_response_id": "resp_abc123",
"input": "Can you give me a specific example?"
}'
Examples
Chat with an Agent
curl -X POST http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "agent:assistant",
"input": "Hello! What can you help me with?"
}'
Agent with Model Override
curl -X POST http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "agent:assistant:gpt-5.2",
"input": "Hello! What can you help me with?"
}'
Streaming Chat
curl -X POST http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"model": "agent:assistant",
"input": "Explain how APIs work",
"stream": true
}'
Multi-turn Conversation
# Ask a question
curl -X POST http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "agent:researcher",
"input": "What are the main causes of climate change?"
}'
# Follow up (using the response ID from above)
curl -X POST http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "agent:researcher",
"previous_response_id": "resp_abc123",
"input": "What solutions are being proposed?"
}'
Direct Model with MCP Tools
curl -X POST http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <token>" \
-d '{
"model": "gpt-5.2",
"input": [
{
"role": "user",
"content": "Roll 2d4+1 for damage"
}
],
"tools": [
{
"type": "mcp",
"server_label": "dmcp",
"server_description": "A D&D MCP server for dice rolling",
"server_url": "https://dmcp-server.deno.dev/sse",
"require_approval": "never",
"headers": {
"X-API-Key": "your-api-key"
},
"tool_call_values": {
"player_id": "player_123"
}
}
]
}'
Using the OpenAI SDK
The Responses API is compatible with the OpenAI SDK:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-used", # Archia uses Basic auth
default_headers={"Authorization": "Basic <credentials>"}
)
response = client.responses.create(
model="agent:assistant",
input="What's the weather like today?"
)
print(response.output[0].content[0].text)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8080/v1",
apiKey: "not-used",
defaultHeaders: { Authorization: "Basic <credentials>" },
});
const response = await client.responses.create({
model: "agent:assistant",
input: "What's the weather like today?",
});
console.log(response.output[0].content[0].text);
Langfuse Integration
Langfuse provides observability for LLM applications. You can trace Archia API calls to monitor performance, debug issues, and analyze usage.
Python with Langfuse
from openai import OpenAI
from langfuse import Langfuse
# Initialize clients
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-used",
default_headers={"Authorization": "Basic <credentials>"}
)
langfuse = Langfuse(
public_key="pk-lf-...",
secret_key="sk-lf-...",
host="http://localhost:3000"
)
# Create a trace
trace = langfuse.trace(
name="chat-with-agent",
input={"prompt": "Hello!"},
tags=["archia", "assistant"]
)
# Create a generation span
generation = trace.generation(
name="responses-api-call",
model="agent:assistant",
input="Hello!"
)
# Make the API call
response = client.responses.create(
model="agent:assistant",
input="Hello!"
)
# Extract output and complete the trace
output_text = response.output[0].content[0].text
generation.end(
output=output_text,
usage={
"input": response.usage.input_tokens,
"output": response.usage.output_tokens,
"total": response.usage.total_tokens
}
)
# Flush traces
langfuse.flush()
TypeScript with Langfuse
import OpenAI from "openai";
import Langfuse from "langfuse";
const client = new OpenAI({
baseURL: "http://localhost:8080/v1",
apiKey: "not-used",
defaultHeaders: { Authorization: "Basic <credentials>" },
});
const langfuse = new Langfuse({
publicKey: "pk-lf-...",
secretKey: "sk-lf-...",
baseUrl: "http://localhost:3000",
});
// Create a trace
const trace = langfuse.trace({
name: "chat-with-agent",
input: { prompt: "Hello!" },
tags: ["archia", "assistant"],
});
// Create a generation span
const generation = trace.generation({
name: "responses-api-call",
model: "agent:assistant",
input: "Hello!",
});
// Make the API call
const response = await client.responses.create({
model: "agent:assistant",
input: "Hello!",
});
// Extract output and complete the trace
const outputText = response.output[0].content[0].text;
generation.end({
output: outputText,
usage: {
input: response.usage.input_tokens,
output: response.usage.output_tokens,
total: response.usage.total_tokens,
},
});
// Flush traces
await langfuse.flushAsync();
Python with Langfuse Annotations
Using the @observe decorator for automatic tracing:
from openai import OpenAI
from langfuse import Langfuse
from langfuse.decorators import observe
# Initialize clients
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-used",
default_headers={"Authorization": "Basic <credentials>"}
)
langfuse = Langfuse(
public_key="pk-lf-...",
secret_key="sk-lf-...",
host="http://localhost:3000"
)
@observe(name="chat_with_agent")
def chat_with_agent(prompt: str, agent: str = "assistant") -> str:
"""Chat with an agent and return the response."""
response = client.responses.create(
model=f"agent:{agent}",
input=prompt
)
output_text = response.output[0].content[0].text
return output_text
@observe(name="multi_turn_conversation")
def multi_turn_conversation(messages: list[dict]) -> str:
"""Have a multi-turn conversation with an agent."""
previous_response_id = None
for msg in messages:
if previous_response_id:
response = client.responses.create(
model="agent:assistant",
previous_response_id=previous_response_id,
input=msg["content"]
)
else:
response = client.responses.create(
model="agent:assistant",
input=msg["content"]
)
previous_response_id = response.id
return response.output[0].content[0].text
@observe(name="direct_model_call_with_tools")
def direct_model_call_with_tools(prompt: str) -> str:
"""Call a model directly with MCP tools."""
response = client.responses.create(
model="gpt-5.2",
input=[{"role": "user", "content": prompt}],
tools=[
{
"type": "mcp",
"server_label": "dmcp",
"server_description": "A Dungeons and Dragons MCP server",
"server_url": "https://dmcp-server.deno.dev/sse",
"require_approval": "never"
}
]
)
return response.output[0].content[0].text
# Usage examples
if __name__ == "__main__":
# Simple chat
result = chat_with_agent("What is machine learning?")
print(result)
# Multi-turn conversation
messages = [
{"role": "user", "content": "What is machine learning?"},
{"role": "user", "content": "Can you give me a specific example?"}
]
result = multi_turn_conversation(messages)
print(result)
# Direct model call with tools
result = direct_model_call_with_tools("Roll 2d4+1 for damage")
print(result)
# Flush traces to Langfuse
langfuse.flush()
The @observe decorator automatically:
- Creates a trace for each function call
- Captures input and output
- Measures execution time
- Logs any errors that occur
- Tracks nested function calls as child spans
What Langfuse Captures
| Field | Description |
|---|---|
| Model | Agent name (e.g., agent:assistant) |
| Input | The prompt sent to the API |
| Output | The response text |
| Usage | Token counts (input, output, total) |
| Tags | Filterable tags for organizing traces |
| Latency | Request duration |
| Metadata | Custom context and attributes |
For complete examples, see the poc/shottracker/langfuse/ directory which includes full Python and TypeScript
implementations.
Error Handling
Error Response
{
"id": "resp_abc123",
"status": "failed",
"error": {
"error_type": "invalid_request",
"message": "Agent 'unknown-agent' not found"
}
}
Common Errors
| Error Type | Description |
|---|---|
invalid_request | Malformed request or invalid parameters |
agent_not_found | Agent routing failed - agent doesn’t exist |
rate_limit_exceeded | Too many requests |
context_length_exceeded | Input too long for model |
Next Steps
- Agents API → - Manage agent configurations
- Tools API → - Configure MCP tools
- Agent Configuration → - Set up agents for routing