Responses API

The Responses API is the recommended endpoint for generating model responses. It provides a modern, flexible interface compatible with OpenAI’s API format, supporting streaming, tool calling, and agent routing.

Recommended: Use the Responses API for all new integrations. It offers the most complete feature set and best developer experience.

Create Response

Generates a model response for the given input.

POST /v1/responses

Request Body

{
  "model": "agent:assistant",
  "input": "What can you help me with today?"
}

Limits and Timeouts

You can override server limits via metadata:

{
  "model": "claude-opus-4-5-20251101",
  "input": "Summarize the last three games",
  "metadata": {
    "tool_limits": { "max_tool_calls": 8 },
    "timeout_ms": 120000
  }
}

Resolution order:

metadata.tool_limits.max_tool_calls → request max_tool_calls → server config defaults
metadata.timeout_ms → server config defaults

Metadata Extensions

The metadata object supports additional fields for agent routing:

metadata.tool_headers → per-request header overrides for agent MCP tools
metadata.prompt_vars → simple {{key}} substitutions in the agent system prompt

Per-request tool headers (agent tools)

{
  "model": "agent:assistant",
  "input": "Check the request id",
  "metadata": {
    "tool_headers": {
      "get-request-id": {
        "trace_id": "abc123",
        "request_id": "req-456"
      }
    }
  }
}

System prompt variables (agent only)

{
  "model": "agent:assistant",
  "input": "What can you do?",
  "metadata": {
    "prompt_vars": {
      "user_id": "u_123",
      "tenant": "acme"
    }
  }
}

If the agent system prompt contains {{user_id}} or {{tenant}}, they are replaced with the provided values.

Parameters

Field	Type	Required	Description
`model`	string	Yes	Model ID (see Supported Models), `agent:{agent_name}` for agent routing, or `agent:{agent_name}:{model_override}` to override an agent’s model
`input`	string/array	No	Text input or array of input items
`instructions`	string	No	System prompt / developer message
`stream`	boolean	No	Enable streaming responses (default: false)
`max_output_tokens`	integer	No	Maximum tokens to generate
`store`	boolean	No	Store the response for later retrieval
`metadata`	object	No	Request metadata (see Metadata Extensions)
`previous_response_id`	string	No	Chain responses in a conversation
`reasoning`	object	No	Enable extended thinking/reasoning (see Reasoning)
`tools`	array	No	Tools the model may call (function or MCP tools)
`tool_choice`	string/object	No	How model selects tools
`parallel_tool_calls`	boolean	No	Allow parallel tool calls
`max_tool_calls`	integer	No	Maximum number of tool calls (fallback if not provided in metadata)

Reasoning

The reasoning parameter enables extended thinking capabilities for supported models. When enabled, the model will perform additional reasoning steps before generating its response, which can improve quality for complex tasks.

Basic Usage

{
  "model": "claude-sonnet-4-5-20250929",
  "input": "Solve this step by step: If a train travels 120 miles in 2 hours, then stops for 30 minutes, then travels another 90 miles in 1.5 hours, what is the average speed for the entire journey?",
  "reasoning": {
    "effort": "medium"
  }
}

Reasoning Parameters

Field	Type	Required	Description
`effort`	string	Yes	Reasoning intensity: `"none"`, `"low"`, `"medium"`, or `"high"`

Effort Levels

Level	Description
`none`	Disable reasoning (supported by OpenAI gpt-5.0+)
`low`	Light reasoning, suitable for simpler problems
`medium`	Balanced reasoning for most tasks
`high`	Maximum reasoning depth for complex problems

Note: If the reasoning parameter is omitted, the provider’s default behavior is used. For OpenAI gpt-5.1+, the default is "none" (no reasoning).

Supported Models

Reasoning is supported on models with the reasoning capability:

Anthropic: Claude Sonnet 3.7+, Claude Sonnet 4+, Claude Opus 4+
Google: Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 3.0 Pro
OpenAI: GPT-5.x, o-series models (o1, o3, o4)

Note: GPT-4.x models do not support the reasoning parameter and will return an error if it’s provided. The reasoning parameter is silently ignored for unsupported models.

Provider-Specific Behavior

Different providers implement reasoning differently:

Provider	`none`	`low`	`medium`	`high`	Default (not specified)
OpenAI (gpt-5.0+)	No reasoning	Minimal reasoning	Balanced reasoning	Maximum reasoning	`none` (gpt-5.1)
gpt-oss (local)	Maps to `low`	Low thinking	Medium thinking	High thinking	`medium`
Anthropic	Disables thinking	~1K token budget	~8K token budget	~24K token budget	No thinking
Google	Maps to `low`	Low budget	Medium budget	High budget	Provider default

Note: gpt-oss models don’t support fully disabling reasoning - "none" maps to "low" (minimal reasoning).

Example with Streaming

{
  "model": "claude-sonnet-4-5-20250929",
  "input": "Explain the proof of the Pythagorean theorem",
  "reasoning": {
    "effort": "high"
  },
  "stream": true
}

When streaming with reasoning enabled, you’ll receive response.reasoning_summary_text.delta events containing the model’s reasoning process, followed by the regular response content.

Direct Model Calls

You can call models directly by specifying the model ID and optionally including MCP tools inline:

{
  "model": "gpt-5.2",
  "input": [
    {
      "role": "user",
      "content": "Roll 2d4+1"
    }
  ],
  "tools": [
    {
      "type": "mcp",
      "server_label": "dmcp",
      "server_description": "A Dungeons and Dragons MCP server to assist with dice rolling.",
      "server_url": "https://dmcp-server.deno.dev/sse",
      "require_approval": "never"
    }
  ]
}

MCP Tool Parameters

Field	Type	Required	Description
`type`	string	Yes	Must be `"mcp"`
`server_label`	string	Yes	Identifier for the MCP server
`server_description`	string	No	Description of what the server provides
`server_url`	string	Yes	URL of the MCP server (SSE endpoint)
`require_approval`	string	No	Approval mode: `"never"`, `"always"`, or `"auto"`

This approach is useful when you want to:

Use a specific model without agent configuration
Dynamically specify MCP tools per request
Test new tools without modifying agent config

Agent Routing

The recommended way to use the Responses API is through agent routing. Use the model field to route requests to configured agents:

{
  "model": "agent:assistant",
  "input": "Help me with my task"
}

This routes to the agent named “assistant” and uses its configured model, system prompt, and tool access.

Benefits of agent routing:

Pre-configured system prompts
Automatic MCP tool access
Centralized agent management
No need to specify model or instructions per request

Model Override

You can override an agent’s configured model while still using its system prompt and tools by appending the model name:

agent:{agent_name}:{model_override}

Examples:

// Use agent's default model
{
  "model": "agent:assistant",
  "input": "Hello!"
}

// Override with Claude
{
  "model": "agent:assistant:claude-haiku-4-5-20251001",
  "input": "Hello!"
}

// Override with gpt-5.2
{
  "model": "agent:assistant:gpt-5.2",
  "input": "Hello!"
}

This is useful when you want to:

Test an agent’s prompts and tools with different models
Use a faster/cheaper model for simple tasks
Use a more capable model for complex tasks
A/B test model performance with the same agent configuration

Response Format

Non-Streaming Response

{
  "id": "resp_abc123",
  "object": "response",
  "created_at": 1705312200,
  "status": "completed",
  "model": "claude-sonnet-4-5-20250929",
  "output": [
    {
      "type": "message",
      "id": "msg_xyz789",
      "status": "completed",
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "The capital of France is Paris."
        }
      ]
    }
  ],
  "usage": {
    "input_tokens": 25,
    "output_tokens": 12,
    "total_tokens": 37
  }
}

Response with Reasoning

When reasoning is enabled, the response includes a reasoning output item before the message:

{
  "id": "resp_abc123",
  "object": "response",
  "created_at": 1705312200,
  "status": "completed",
  "model": "claude-sonnet-4-5-20250929",
  "output": [
    {
      "type": "reasoning",
      "id": "reasoning_def456",
      "status": "completed",
      "summary": [
        {
          "type": "summary_text",
          "text": "To solve this problem, I need to calculate the total distance and total time..."
        }
      ]
    },
    {
      "type": "message",
      "id": "msg_xyz789",
      "status": "completed",
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "The average speed for the entire journey is 42 mph."
        }
      ]
    }
  ],
  "usage": {
    "input_tokens": 45,
    "output_tokens": 156,
    "total_tokens": 201
  }
}

Response Fields

Field	Type	Description
`id`	string	Unique response identifier
`object`	string	Always “response”
`created_at`	integer	Unix timestamp of creation
`status`	string	One of: `completed`, `failed`, `in_progress`, `cancelled`
`model`	string	Model used for generation
`output`	array	Array of output items (messages, reasoning, function calls)
`usage`	object	Token usage statistics
`error`	object	Error details if status is “failed”

Output Item Types

Type	Description
`message`	Assistant’s response message with text content
`reasoning`	Model’s reasoning/thinking process (when reasoning enabled)
`function_call`	A tool/function call made by the model
`function_call_output`	Result from a tool/function call

Streaming

When stream: true, the endpoint returns Server-Sent Events (SSE):

curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "agent:assistant",
    "input": "Tell me a story",
    "stream": true
  }'

Event Types

event: response.created
data: {"id":"resp_abc123","object":"response","status":"in_progress",...}

event: response.output_item.added
data: {"type":"message","id":"msg_xyz789","role":"assistant",...}

event: response.content_part.added
data: {"type":"output_text","text":""}

event: response.output_text.delta
data: {"delta":"Once upon"}

event: response.output_text.delta
data: {"delta":" a time..."}

event: response.output_text.done
data: {"text":"Once upon a time..."}

Reasoning Events (when `reasoning` is enabled)

When reasoning is enabled, additional events are sent before the main response content:

event: response.reasoning_summary_part.added
data: {"item_id":"reasoning_abc","output_index":0,"summary_index":0,"part":{"type":"summary_text","text":""}}

event: response.reasoning_summary_text.delta
data: {"item_id":"reasoning_abc","output_index":0,"summary_index":0,"delta":"Let me think through this..."}

event: response.reasoning_summary_text.delta
data: {"item_id":"reasoning_abc","output_index":0,"summary_index":0,"delta":" First, I need to consider..."}

event: response.reasoning_summary_text.done
data: {"item_id":"reasoning_abc","output_index":0,"summary_index":0,"text":"Let me think through this... First, I need to consider..."}

event: response.output_item.done data: {“type”:“message”,“id”:“msg_xyz789”,“status”:“completed”,…}

event: response.completed data: {“id”:“resp_abc123”,“status”:“completed”,“usage”:{…}}


### Event Sequence

Standard sequence:

1. `response.created` - Response object created
2. `response.output_item.added` - New output item (message or function call)
3. `response.content_part.added` - New content part added
4. `response.output_text.delta` - Text chunk (repeated)
5. `response.output_text.done` - Text content complete
6. `response.output_item.done` - Output item complete
7. `response.completed` - Full response complete

With reasoning enabled, reasoning events appear after `response.created` and before the main content:

1. `response.created`
2. `response.reasoning_summary_part.added` - Reasoning output started
3. `response.reasoning_summary_text.delta` - Reasoning text chunk (repeated)
4. `response.reasoning_summary_text.done` - Reasoning complete
5. `response.output_item.added` - Main response content begins
6. ... (standard content events)
7. `response.completed`

---

## Conversation Chaining

Chain multiple responses together using `previous_response_id` to maintain conversation context:

```bash
# First message
curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "agent:assistant",
    "input": "What is machine learning?"
  }'
# Response includes "id": "resp_abc123"

# Follow-up message
curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "agent:assistant",
    "previous_response_id": "resp_abc123",
    "input": "Can you give me a specific example?"
  }'

Examples

Chat with an Agent

curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "agent:assistant",
    "input": "Hello! What can you help me with?"
  }'

Agent with Model Override

curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "agent:assistant:gpt-5.2",
    "input": "Hello! What can you help me with?"
  }'

Streaming Chat

curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "agent:assistant",
    "input": "Explain how APIs work",
    "stream": true
  }'

Multi-turn Conversation

# Ask a question
curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "agent:researcher",
    "input": "What are the main causes of climate change?"
  }'

# Follow up (using the response ID from above)
curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "agent:researcher",
    "previous_response_id": "resp_abc123",
    "input": "What solutions are being proposed?"
  }'

Direct Model with MCP Tools

curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <token>" \
  -d '{
    "model": "gpt-5.2",
    "input": [
      {
        "role": "user",
        "content": "Roll 2d4+1 for damage"
      }
    ],
    "tools": [
      {
        "type": "mcp",
        "server_label": "dmcp",
        "server_description": "A D&D MCP server for dice rolling",
        "server_url": "https://dmcp-server.deno.dev/sse",
        "require_approval": "never",
        "headers": {
          "X-API-Key": "your-api-key"
        },
        "tool_call_values": {
          "player_id": "player_123"
        }
      }
    ]
  }'

Using the OpenAI SDK

The Responses API is compatible with the OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-used",  # Archia uses Basic auth
    default_headers={"Authorization": "Basic <credentials>"}
)

response = client.responses.create(
    model="agent:assistant",
    input="What's the weather like today?"
)

print(response.output[0].content[0].text)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "not-used",
  defaultHeaders: { Authorization: "Basic <credentials>" },
});

const response = await client.responses.create({
  model: "agent:assistant",
  input: "What's the weather like today?",
});

console.log(response.output[0].content[0].text);

Langfuse Integration

Langfuse provides observability for LLM applications. You can trace Archia API calls to monitor performance, debug issues, and analyze usage.

Python with Langfuse

from openai import OpenAI
from langfuse import Langfuse

# Initialize clients
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-used",
    default_headers={"Authorization": "Basic <credentials>"}
)

langfuse = Langfuse(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="http://localhost:3000"
)

# Create a trace
trace = langfuse.trace(
    name="chat-with-agent",
    input={"prompt": "Hello!"},
    tags=["archia", "assistant"]
)

# Create a generation span
generation = trace.generation(
    name="responses-api-call",
    model="agent:assistant",
    input="Hello!"
)

# Make the API call
response = client.responses.create(
    model="agent:assistant",
    input="Hello!"
)

# Extract output and complete the trace
output_text = response.output[0].content[0].text
generation.end(
    output=output_text,
    usage={
        "input": response.usage.input_tokens,
        "output": response.usage.output_tokens,
        "total": response.usage.total_tokens
    }
)

# Flush traces
langfuse.flush()

TypeScript with Langfuse

import OpenAI from "openai";
import Langfuse from "langfuse";

const client = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "not-used",
  defaultHeaders: { Authorization: "Basic <credentials>" },
});

const langfuse = new Langfuse({
  publicKey: "pk-lf-...",
  secretKey: "sk-lf-...",
  baseUrl: "http://localhost:3000",
});

// Create a trace
const trace = langfuse.trace({
  name: "chat-with-agent",
  input: { prompt: "Hello!" },
  tags: ["archia", "assistant"],
});

// Create a generation span
const generation = trace.generation({
  name: "responses-api-call",
  model: "agent:assistant",
  input: "Hello!",
});

// Make the API call
const response = await client.responses.create({
  model: "agent:assistant",
  input: "Hello!",
});

// Extract output and complete the trace
const outputText = response.output[0].content[0].text;
generation.end({
  output: outputText,
  usage: {
    input: response.usage.input_tokens,
    output: response.usage.output_tokens,
    total: response.usage.total_tokens,
  },
});

// Flush traces
await langfuse.flushAsync();

Python with Langfuse Annotations

Using the @observe decorator for automatic tracing:

from openai import OpenAI
from langfuse import Langfuse
from langfuse.decorators import observe

# Initialize clients
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-used",
    default_headers={"Authorization": "Basic <credentials>"}
)

langfuse = Langfuse(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="http://localhost:3000"
)

@observe(name="chat_with_agent")
def chat_with_agent(prompt: str, agent: str = "assistant") -> str:
    """Chat with an agent and return the response."""
    response = client.responses.create(
        model=f"agent:{agent}",
        input=prompt
    )
    
    output_text = response.output[0].content[0].text
    return output_text

@observe(name="multi_turn_conversation")
def multi_turn_conversation(messages: list[dict]) -> str:
    """Have a multi-turn conversation with an agent."""
    previous_response_id = None
    
    for msg in messages:
        if previous_response_id:
            response = client.responses.create(
                model="agent:assistant",
                previous_response_id=previous_response_id,
                input=msg["content"]
            )
        else:
            response = client.responses.create(
                model="agent:assistant",
                input=msg["content"]
            )
        
        previous_response_id = response.id
    
    return response.output[0].content[0].text

@observe(name="direct_model_call_with_tools")
def direct_model_call_with_tools(prompt: str) -> str:
    """Call a model directly with MCP tools."""
    response = client.responses.create(
        model="gpt-5.2",
        input=[{"role": "user", "content": prompt}],
        tools=[
            {
                "type": "mcp",
                "server_label": "dmcp",
                "server_description": "A Dungeons and Dragons MCP server",
                "server_url": "https://dmcp-server.deno.dev/sse",
                "require_approval": "never"
            }
        ]
    )
    
    return response.output[0].content[0].text

# Usage examples
if __name__ == "__main__":
    # Simple chat
    result = chat_with_agent("What is machine learning?")
    print(result)
    
    # Multi-turn conversation
    messages = [
        {"role": "user", "content": "What is machine learning?"},
        {"role": "user", "content": "Can you give me a specific example?"}
    ]
    result = multi_turn_conversation(messages)
    print(result)
    
    # Direct model call with tools
    result = direct_model_call_with_tools("Roll 2d4+1 for damage")
    print(result)
    
    # Flush traces to Langfuse
    langfuse.flush()

The @observe decorator automatically:

Creates a trace for each function call
Captures input and output
Measures execution time
Logs any errors that occur
Tracks nested function calls as child spans

What Langfuse Captures

Field	Description
Model	Agent name (e.g., `agent:assistant`)
Input	The prompt sent to the API
Output	The response text
Usage	Token counts (input, output, total)
Tags	Filterable tags for organizing traces
Latency	Request duration
Metadata	Custom context and attributes

For complete examples, see the poc/shottracker/langfuse/ directory which includes full Python and TypeScript implementations.

Error Handling

Error Response

{
  "id": "resp_abc123",
  "status": "failed",
  "error": {
    "error_type": "invalid_request",
    "message": "Agent 'unknown-agent' not found"
  }
}

Common Errors

Error Type	Description
`invalid_request`	Malformed request or invalid parameters
`agent_not_found`	Agent routing failed - agent doesn’t exist
`rate_limit_exceeded`	Too many requests
`context_length_exceeded`	Input too long for model

Next Steps

Agents API → - Manage agent configurations
Tools API → - Configure MCP tools
Agent Configuration → - Set up agents for routing

Keyboard shortcuts

Archia.io