Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Responses API

The Responses API is the recommended endpoint for generating model responses. It provides a modern, flexible interface compatible with OpenAI’s API format, supporting streaming, tool calling, and agent routing.

Recommended: Use the Responses API for all new integrations. It offers the most complete feature set and best developer experience.


Create Response

Generates a model response for the given input.

POST /v1/responses

Request Body

{
  "model": "agent:assistant",
  "input": "What can you help me with today?"
}

Limits and Timeouts

You can override server limits via metadata:

{
  "model": "claude-opus-4-5-20251101",
  "input": "Summarize the last three games",
  "metadata": {
    "tool_limits": { "max_tool_calls": 8 },
    "timeout_ms": 120000
  }
}

Resolution order:

  • metadata.tool_limits.max_tool_calls → request max_tool_calls → server config defaults
  • metadata.timeout_ms → server config defaults

Metadata Extensions

The metadata object supports additional fields for agent routing:

  • metadata.tool_headers → per-request header overrides for agent MCP tools
  • metadata.prompt_vars → simple {{key}} substitutions in the agent system prompt

Per-request tool headers (agent tools)

{
  "model": "agent:assistant",
  "input": "Check the request id",
  "metadata": {
    "tool_headers": {
      "get-request-id": {
        "trace_id": "abc123",
        "request_id": "req-456"
      }
    }
  }
}

System prompt variables (agent only)

{
  "model": "agent:assistant",
  "input": "What can you do?",
  "metadata": {
    "prompt_vars": {
      "user_id": "u_123",
      "tenant": "acme"
    }
  }
}

If the agent system prompt contains {{user_id}} or {{tenant}}, they are replaced with the provided values.

Parameters

FieldTypeRequiredDescription
modelstringYesModel ID (see Supported Models), agent:{agent_name} for agent routing, or agent:{agent_name}:{model_override} to override an agent’s model
inputstring/arrayNoText input or array of input items
instructionsstringNoSystem prompt / developer message
streambooleanNoEnable streaming responses (default: false)
max_output_tokensintegerNoMaximum tokens to generate
storebooleanNoStore the response for later retrieval
metadataobjectNoRequest metadata (see Metadata Extensions)
previous_response_idstringNoChain responses in a conversation
reasoningobjectNoEnable extended thinking/reasoning (see Reasoning)
toolsarrayNoTools the model may call (function or MCP tools)
tool_choicestring/objectNoHow model selects tools
parallel_tool_callsbooleanNoAllow parallel tool calls
max_tool_callsintegerNoMaximum number of tool calls (fallback if not provided in metadata)

Reasoning

The reasoning parameter enables extended thinking capabilities for supported models. When enabled, the model will perform additional reasoning steps before generating its response, which can improve quality for complex tasks.

Basic Usage

{
  "model": "claude-sonnet-4-5-20250929",
  "input": "Solve this step by step: If a train travels 120 miles in 2 hours, then stops for 30 minutes, then travels another 90 miles in 1.5 hours, what is the average speed for the entire journey?",
  "reasoning": {
    "effort": "medium"
  }
}

Reasoning Parameters

FieldTypeRequiredDescription
effortstringYesReasoning intensity: "none", "low", "medium", or "high"

Effort Levels

LevelDescription
noneDisable reasoning (supported by OpenAI gpt-5.0+)
lowLight reasoning, suitable for simpler problems
mediumBalanced reasoning for most tasks
highMaximum reasoning depth for complex problems

Note: If the reasoning parameter is omitted, the provider’s default behavior is used. For OpenAI gpt-5.1+, the default is "none" (no reasoning).

Supported Models

Reasoning is supported on models with the reasoning capability:

  • Anthropic: Claude Sonnet 3.7+, Claude Sonnet 4+, Claude Opus 4+
  • Google: Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 3.0 Pro
  • OpenAI: GPT-5.x, o-series models (o1, o3, o4)

Note: GPT-4.x models do not support the reasoning parameter and will return an error if it’s provided. The reasoning parameter is silently ignored for unsupported models.

Provider-Specific Behavior

Different providers implement reasoning differently:

ProvidernonelowmediumhighDefault (not specified)
OpenAI (gpt-5.0+)No reasoningMinimal reasoningBalanced reasoningMaximum reasoningnone (gpt-5.1)
gpt-oss (local)Maps to lowLow thinkingMedium thinkingHigh thinkingmedium
AnthropicDisables thinking~1K token budget~8K token budget~24K token budgetNo thinking
GoogleMaps to lowLow budgetMedium budgetHigh budgetProvider default

Note: gpt-oss models don’t support fully disabling reasoning - "none" maps to "low" (minimal reasoning).

Example with Streaming

{
  "model": "claude-sonnet-4-5-20250929",
  "input": "Explain the proof of the Pythagorean theorem",
  "reasoning": {
    "effort": "high"
  },
  "stream": true
}

When streaming with reasoning enabled, you’ll receive response.reasoning_summary_text.delta events containing the model’s reasoning process, followed by the regular response content.


Direct Model Calls

You can call models directly by specifying the model ID and optionally including MCP tools inline:

{
  "model": "gpt-5.2",
  "input": [
    {
      "role": "user",
      "content": "Roll 2d4+1"
    }
  ],
  "tools": [
    {
      "type": "mcp",
      "server_label": "dmcp",
      "server_description": "A Dungeons and Dragons MCP server to assist with dice rolling.",
      "server_url": "https://dmcp-server.deno.dev/sse",
      "require_approval": "never"
    }
  ]
}

MCP Tool Parameters

FieldTypeRequiredDescription
typestringYesMust be "mcp"
server_labelstringYesIdentifier for the MCP server
server_descriptionstringNoDescription of what the server provides
server_urlstringYesURL of the MCP server (SSE endpoint)
require_approvalstringNoApproval mode: "never", "always", or "auto"

This approach is useful when you want to:

  • Use a specific model without agent configuration
  • Dynamically specify MCP tools per request
  • Test new tools without modifying agent config

Agent Routing

The recommended way to use the Responses API is through agent routing. Use the model field to route requests to configured agents:

{
  "model": "agent:assistant",
  "input": "Help me with my task"
}

This routes to the agent named “assistant” and uses its configured model, system prompt, and tool access.

Benefits of agent routing:

  • Pre-configured system prompts
  • Automatic MCP tool access
  • Centralized agent management
  • No need to specify model or instructions per request

Model Override

You can override an agent’s configured model while still using its system prompt and tools by appending the model name:

agent:{agent_name}:{model_override}

Examples:

// Use agent's default model
{
  "model": "agent:assistant",
  "input": "Hello!"
}

// Override with Claude
{
  "model": "agent:assistant:claude-haiku-4-5-20251001",
  "input": "Hello!"
}

// Override with gpt-5.2
{
  "model": "agent:assistant:gpt-5.2",
  "input": "Hello!"
}

This is useful when you want to:

  • Test an agent’s prompts and tools with different models
  • Use a faster/cheaper model for simple tasks
  • Use a more capable model for complex tasks
  • A/B test model performance with the same agent configuration

Response Format

Non-Streaming Response

{
  "id": "resp_abc123",
  "object": "response",
  "created_at": 1705312200,
  "status": "completed",
  "model": "claude-sonnet-4-5-20250929",
  "output": [
    {
      "type": "message",
      "id": "msg_xyz789",
      "status": "completed",
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "The capital of France is Paris."
        }
      ]
    }
  ],
  "usage": {
    "input_tokens": 25,
    "output_tokens": 12,
    "total_tokens": 37
  }
}

Response with Reasoning

When reasoning is enabled, the response includes a reasoning output item before the message:

{
  "id": "resp_abc123",
  "object": "response",
  "created_at": 1705312200,
  "status": "completed",
  "model": "claude-sonnet-4-5-20250929",
  "output": [
    {
      "type": "reasoning",
      "id": "reasoning_def456",
      "status": "completed",
      "summary": [
        {
          "type": "summary_text",
          "text": "To solve this problem, I need to calculate the total distance and total time..."
        }
      ]
    },
    {
      "type": "message",
      "id": "msg_xyz789",
      "status": "completed",
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "The average speed for the entire journey is 42 mph."
        }
      ]
    }
  ],
  "usage": {
    "input_tokens": 45,
    "output_tokens": 156,
    "total_tokens": 201
  }
}

Response Fields

FieldTypeDescription
idstringUnique response identifier
objectstringAlways “response”
created_atintegerUnix timestamp of creation
statusstringOne of: completed, failed, in_progress, cancelled
modelstringModel used for generation
outputarrayArray of output items (messages, reasoning, function calls)
usageobjectToken usage statistics
errorobjectError details if status is “failed”

Output Item Types

TypeDescription
messageAssistant’s response message with text content
reasoningModel’s reasoning/thinking process (when reasoning enabled)
function_callA tool/function call made by the model
function_call_outputResult from a tool/function call

Streaming

When stream: true, the endpoint returns Server-Sent Events (SSE):

curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "agent:assistant",
    "input": "Tell me a story",
    "stream": true
  }'

Event Types

event: response.created
data: {"id":"resp_abc123","object":"response","status":"in_progress",...}

event: response.output_item.added
data: {"type":"message","id":"msg_xyz789","role":"assistant",...}

event: response.content_part.added
data: {"type":"output_text","text":""}

event: response.output_text.delta
data: {"delta":"Once upon"}

event: response.output_text.delta
data: {"delta":" a time..."}

event: response.output_text.done
data: {"text":"Once upon a time..."}

Reasoning Events (when reasoning is enabled)

When reasoning is enabled, additional events are sent before the main response content:

event: response.reasoning_summary_part.added
data: {"item_id":"reasoning_abc","output_index":0,"summary_index":0,"part":{"type":"summary_text","text":""}}

event: response.reasoning_summary_text.delta
data: {"item_id":"reasoning_abc","output_index":0,"summary_index":0,"delta":"Let me think through this..."}

event: response.reasoning_summary_text.delta
data: {"item_id":"reasoning_abc","output_index":0,"summary_index":0,"delta":" First, I need to consider..."}

event: response.reasoning_summary_text.done
data: {"item_id":"reasoning_abc","output_index":0,"summary_index":0,"text":"Let me think through this... First, I need to consider..."}

event: response.output_item.done data: {“type”:“message”,“id”:“msg_xyz789”,“status”:“completed”,…}

event: response.completed data: {“id”:“resp_abc123”,“status”:“completed”,“usage”:{…}}


### Event Sequence

Standard sequence:

1. `response.created` - Response object created
2. `response.output_item.added` - New output item (message or function call)
3. `response.content_part.added` - New content part added
4. `response.output_text.delta` - Text chunk (repeated)
5. `response.output_text.done` - Text content complete
6. `response.output_item.done` - Output item complete
7. `response.completed` - Full response complete

With reasoning enabled, reasoning events appear after `response.created` and before the main content:

1. `response.created`
2. `response.reasoning_summary_part.added` - Reasoning output started
3. `response.reasoning_summary_text.delta` - Reasoning text chunk (repeated)
4. `response.reasoning_summary_text.done` - Reasoning complete
5. `response.output_item.added` - Main response content begins
6. ... (standard content events)
7. `response.completed`

---

## Conversation Chaining

Chain multiple responses together using `previous_response_id` to maintain conversation context:

```bash
# First message
curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "agent:assistant",
    "input": "What is machine learning?"
  }'
# Response includes "id": "resp_abc123"

# Follow-up message
curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "agent:assistant",
    "previous_response_id": "resp_abc123",
    "input": "Can you give me a specific example?"
  }'

Examples

Chat with an Agent

curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "agent:assistant",
    "input": "Hello! What can you help me with?"
  }'

Agent with Model Override

curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "agent:assistant:gpt-5.2",
    "input": "Hello! What can you help me with?"
  }'

Streaming Chat

curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "agent:assistant",
    "input": "Explain how APIs work",
    "stream": true
  }'

Multi-turn Conversation

# Ask a question
curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "agent:researcher",
    "input": "What are the main causes of climate change?"
  }'

# Follow up (using the response ID from above)
curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "agent:researcher",
    "previous_response_id": "resp_abc123",
    "input": "What solutions are being proposed?"
  }'

Direct Model with MCP Tools

curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <token>" \
  -d '{
    "model": "gpt-5.2",
    "input": [
      {
        "role": "user",
        "content": "Roll 2d4+1 for damage"
      }
    ],
    "tools": [
      {
        "type": "mcp",
        "server_label": "dmcp",
        "server_description": "A D&D MCP server for dice rolling",
        "server_url": "https://dmcp-server.deno.dev/sse",
        "require_approval": "never",
        "headers": {
          "X-API-Key": "your-api-key"
        },
        "tool_call_values": {
          "player_id": "player_123"
        }
      }
    ]
  }'

Using the OpenAI SDK

The Responses API is compatible with the OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-used",  # Archia uses Basic auth
    default_headers={"Authorization": "Basic <credentials>"}
)

response = client.responses.create(
    model="agent:assistant",
    input="What's the weather like today?"
)

print(response.output[0].content[0].text)
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "not-used",
  defaultHeaders: { Authorization: "Basic <credentials>" },
});

const response = await client.responses.create({
  model: "agent:assistant",
  input: "What's the weather like today?",
});

console.log(response.output[0].content[0].text);

Langfuse Integration

Langfuse provides observability for LLM applications. You can trace Archia API calls to monitor performance, debug issues, and analyze usage.

Python with Langfuse

from openai import OpenAI
from langfuse import Langfuse

# Initialize clients
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-used",
    default_headers={"Authorization": "Basic <credentials>"}
)

langfuse = Langfuse(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="http://localhost:3000"
)

# Create a trace
trace = langfuse.trace(
    name="chat-with-agent",
    input={"prompt": "Hello!"},
    tags=["archia", "assistant"]
)

# Create a generation span
generation = trace.generation(
    name="responses-api-call",
    model="agent:assistant",
    input="Hello!"
)

# Make the API call
response = client.responses.create(
    model="agent:assistant",
    input="Hello!"
)

# Extract output and complete the trace
output_text = response.output[0].content[0].text
generation.end(
    output=output_text,
    usage={
        "input": response.usage.input_tokens,
        "output": response.usage.output_tokens,
        "total": response.usage.total_tokens
    }
)

# Flush traces
langfuse.flush()

TypeScript with Langfuse

import OpenAI from "openai";
import Langfuse from "langfuse";

const client = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "not-used",
  defaultHeaders: { Authorization: "Basic <credentials>" },
});

const langfuse = new Langfuse({
  publicKey: "pk-lf-...",
  secretKey: "sk-lf-...",
  baseUrl: "http://localhost:3000",
});

// Create a trace
const trace = langfuse.trace({
  name: "chat-with-agent",
  input: { prompt: "Hello!" },
  tags: ["archia", "assistant"],
});

// Create a generation span
const generation = trace.generation({
  name: "responses-api-call",
  model: "agent:assistant",
  input: "Hello!",
});

// Make the API call
const response = await client.responses.create({
  model: "agent:assistant",
  input: "Hello!",
});

// Extract output and complete the trace
const outputText = response.output[0].content[0].text;
generation.end({
  output: outputText,
  usage: {
    input: response.usage.input_tokens,
    output: response.usage.output_tokens,
    total: response.usage.total_tokens,
  },
});

// Flush traces
await langfuse.flushAsync();

Python with Langfuse Annotations

Using the @observe decorator for automatic tracing:

from openai import OpenAI
from langfuse import Langfuse
from langfuse.decorators import observe

# Initialize clients
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-used",
    default_headers={"Authorization": "Basic <credentials>"}
)

langfuse = Langfuse(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="http://localhost:3000"
)

@observe(name="chat_with_agent")
def chat_with_agent(prompt: str, agent: str = "assistant") -> str:
    """Chat with an agent and return the response."""
    response = client.responses.create(
        model=f"agent:{agent}",
        input=prompt
    )
    
    output_text = response.output[0].content[0].text
    return output_text

@observe(name="multi_turn_conversation")
def multi_turn_conversation(messages: list[dict]) -> str:
    """Have a multi-turn conversation with an agent."""
    previous_response_id = None
    
    for msg in messages:
        if previous_response_id:
            response = client.responses.create(
                model="agent:assistant",
                previous_response_id=previous_response_id,
                input=msg["content"]
            )
        else:
            response = client.responses.create(
                model="agent:assistant",
                input=msg["content"]
            )
        
        previous_response_id = response.id
    
    return response.output[0].content[0].text

@observe(name="direct_model_call_with_tools")
def direct_model_call_with_tools(prompt: str) -> str:
    """Call a model directly with MCP tools."""
    response = client.responses.create(
        model="gpt-5.2",
        input=[{"role": "user", "content": prompt}],
        tools=[
            {
                "type": "mcp",
                "server_label": "dmcp",
                "server_description": "A Dungeons and Dragons MCP server",
                "server_url": "https://dmcp-server.deno.dev/sse",
                "require_approval": "never"
            }
        ]
    )
    
    return response.output[0].content[0].text

# Usage examples
if __name__ == "__main__":
    # Simple chat
    result = chat_with_agent("What is machine learning?")
    print(result)
    
    # Multi-turn conversation
    messages = [
        {"role": "user", "content": "What is machine learning?"},
        {"role": "user", "content": "Can you give me a specific example?"}
    ]
    result = multi_turn_conversation(messages)
    print(result)
    
    # Direct model call with tools
    result = direct_model_call_with_tools("Roll 2d4+1 for damage")
    print(result)
    
    # Flush traces to Langfuse
    langfuse.flush()

The @observe decorator automatically:

  • Creates a trace for each function call
  • Captures input and output
  • Measures execution time
  • Logs any errors that occur
  • Tracks nested function calls as child spans

What Langfuse Captures

FieldDescription
ModelAgent name (e.g., agent:assistant)
InputThe prompt sent to the API
OutputThe response text
UsageToken counts (input, output, total)
TagsFilterable tags for organizing traces
LatencyRequest duration
MetadataCustom context and attributes

For complete examples, see the poc/shottracker/langfuse/ directory which includes full Python and TypeScript implementations.


Error Handling

Error Response

{
  "id": "resp_abc123",
  "status": "failed",
  "error": {
    "error_type": "invalid_request",
    "message": "Agent 'unknown-agent' not found"
  }
}

Common Errors

Error TypeDescription
invalid_requestMalformed request or invalid parameters
agent_not_foundAgent routing failed - agent doesn’t exist
rate_limit_exceededToo many requests
context_length_exceededInput too long for model

Next Steps