我们升级了 OpenAI 兼容接口,针对 Claude 系列模型进行了更深入的适配优化。你可以更精细、更便捷地控制思考(thinking)与缓存(caching)——特别是多轮对话中的交错思考,我们做了更人性化的处理,无需额外传参即可无痛接入。同时支持开启 Anthropic 提供的 beta 功能。
1. 模型思考(Extended Thinking)
1.1 交错思考的优点
未开启交错思考时,模型在一个 assistant turn 中只在开头进行一次思考,后续收到工具结果后直接生成回复,不再产生新的 thinking block:
User → [Thinking] → Tool Call → Tool Result → Response开启交错思考后,模型在每次收到工具结果时都会插入新的 thinking block,形成链式推理:
User → [Thinking] → Tool Call → Tool Result → [Thinking] → Response
↑ Interleaved Thinking这使得模型能够:
- 基于工具返回结果进行二次推理,而非直接拼接输出
- 在多次工具调用之间链式推理,每一步决策都建立在上一步的分析之上
1.2 开启思考
支持四种方式,任选其一:
| 方式 | 示例 | 说明 |
|---|---|---|
reasoning_effort | "reasoning_effort": "low" | OpenAI 标准参数,直接放在请求体顶层 |
reasoning.effort | "reasoning": {"effort": "low"} | 等效于上一种,放在 reasoning 对象中 |
reasoning.max_tokens | "reasoning": {"max_tokens": 1024} | 精确控制思考的最大 token 数 |
模型名加 -think | "model": "claude-sonnet-4-5-think" | 最简单的方式,无需额外参数 |
优先级(同时使用多种时):
reasoning_effort>reasoning.max_tokens>reasoning.effort>-think后缀
effort 可选值: minimal / low / medium / high / xhigh
1.3 思考返回
响应的 message 中会增加两个字段:
reasoning_content:思考内容(字符串),方便直接展示reasoning_details:思考的完整结构化信息,多轮对话时需要原样回传,内部结构在不同供应商下可能有差异
非流式示例(省略无关字段):
{
"choices": [{
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?",
"reasoning_content": "The user is just saying hello...",
"reasoning_details": {
"type": "thinking",
"thinking": "The user is just saying hello...",
"signature": "Er8CCkYI..."
}
}
}]
}流式返回时,思考内容会通过 delta.reasoning_content 和 delta.reasoning_details 逐块下发。完整的流式拼接逻辑见下方完整示例。
1.4 多轮对话中保留思考(已内置交错思考,无需额外传参)
要让模型在多轮对话中延续推理能力,只需将上一轮返回的 reasoning_details 原样放入下一轮的 assistant 消息中:
messages = [
{"role": "user", "content": "What's the weather like in Boston?"},
{
"role": "assistant",
"content": response.choices[0].message.content,
"tool_calls": response.choices[0].message.tool_calls,
"reasoning_details": response.choices[0].message.reasoning_details,
},
{
"role": "tool",
"tool_call_id": "toolu_xxx",
"content": '{"temperature": 45, "condition": "rainy"}',
}
]CaMeL AI 检测到请求中包含历史思考信息时,会自动开启交错思考(Interleaved Thinking),让模型在收到工具调用结果后继续深度推理,无需额外传参。
1.5 完整示例
以下两个示例演示了完整的多轮 Tool Call + 交错思考流程:用户提问 → 模型思考并调用工具 → 注入工具结果(保留 reasoning_details)→ 模型交错思考后给出最终回复。
非流式 · 交错思考
import os
import json
from openai import OpenAI
client = OpenAI(
base_url="https://api.kr777.top",
api_key=os.environ.get("CAMELAI_API_KEY", "sk-***"),
)
# ── Tool definition ───────────────────────────────────────────
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string", "description": "City name"}},
"required": ["location"]
}
}
}]
# ── Mock tool execution ───────────────────────────────────────
WEATHER_DB = {
"boston": {"temperature": "45°F (7°C)", "condition": "rainy", "humidity": "85%", "wind": "15 mph NE"},
"tokyo": {"temperature": "72°F (22°C)", "condition": "sunny", "humidity": "45%", "wind": "5 mph S"},
}
def execute_tool(name: str, args: dict) -> str:
if name == "get_weather":
key = next((k for k in WEATHER_DB if k in args.get("location", "").lower()), None)
return json.dumps(WEATHER_DB.get(key, {"temperature": "65°F", "condition": "clear"}))
return "{}"
# ── Multi-turn conversation loop ─────────────────────────────
messages = [
{"role": "user", "content": "What's the weather like in Boston? Then recommend what to wear."}
]
turn = 0
while True:
turn += 1
print(f"\n── Turn {turn} ──")
response = client.chat.completions.create(
model="claude-sonnet-4-5",
messages=messages,
tools=tools,
extra_body={"reasoning": {"max_tokens": 2000}},
)
msg = response.choices[0].message
# Print thinking process
if msg.reasoning_content:
label = "Interleaved Thinking" if turn > 1 else "Thinking"
print(f"[{label}] {msg.reasoning_content}")
# Print response content
if msg.content:
print(f"[Response] {msg.content}")
# Print tool calls
if msg.tool_calls:
for tc in msg.tool_calls:
print(f"[Tool Call: {tc.function.name}] {tc.function.arguments}")
# Build assistant message, preserve reasoning_details (critical!)
assistant_msg = {"role": "assistant", "content": msg.content}
if msg.tool_calls:
assistant_msg["tool_calls"] = msg.tool_calls
if msg.reasoning_details:
assistant_msg["reasoning_details"] = msg.reasoning_details # pass back unmodified
messages.append(assistant_msg)
# No tool_calls means conversation is done
if not msg.tool_calls:
break
# Execute tools and append results to messages
for tc in msg.tool_calls:
args = json.loads(tc.function.arguments)
result = execute_tool(tc.function.name, args)
print(f"[Tool Result: {tc.function.name}] {result}")
messages.append({"role": "tool", "tool_call_id": tc.id, "content": result})流式 · 交错思考
import os
import sys
import json
from openai import OpenAI
client = OpenAI(
base_url="https://api.kr777.top",
api_key=os.environ.get("CAMELAI_API_KEY", "sk-***"),
)
# ── Tool definition & mock execution ─────────────────────────
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string", "description": "City name"}},
"required": ["location"]
}
}
}]
WEATHER_DB = {
"boston": {"temperature": "45°F (7°C)", "condition": "rainy", "humidity": "85%", "wind": "15 mph NE"},
"tokyo": {"temperature": "72°F (22°C)", "condition": "sunny", "humidity": "45%", "wind": "5 mph S"},
}
def execute_tool(name: str, args: dict) -> str:
if name == "get_weather":
key = next((k for k in WEATHER_DB if k in args.get("location", "").lower()), None)
return json.dumps(WEATHER_DB.get(key, {"temperature": "65°F", "condition": "clear"}))
return "{}"
# ── Stream response collector ────────────────────────────────
def stream_and_collect(turn: int, **kwargs):
"""Stream response, print thinking/content in real-time, accumulate reasoning_details/tool_calls."""
rd = {} # accumulated reasoning_details
content = "" # accumulated response text
tc_map = {} # accumulated tool_calls (by index)
cur = "none" # current output section: none / thinking / content
stream = client.chat.completions.create(stream=True, **kwargs)
for chunk in stream:
if not chunk.choices:
continue
delta = chunk.choices[0].delta
# ── Handle thinking ──
rd_delta = getattr(delta, "reasoning_details", None)
if rd_delta and isinstance(rd_delta, dict):
for k, v in rd_delta.items():
if k == "type":
rd[k] = v
elif isinstance(v, str):
rd[k] = rd.get(k, "") + v
elif v is not None:
rd[k] = v
# Print thinking chunks in real-time
thinking_chunk = rd_delta.get("thinking", "")
if thinking_chunk:
if cur != "thinking":
cur = "thinking"
label = "Interleaved Thinking" if turn > 1 else "Thinking"
sys.stdout.write(f"\n[{label}] ")
sys.stdout.write(thinking_chunk)
sys.stdout.flush()
# ── Handle content ──
if delta.content:
if cur != "content":
if cur == "thinking":
sys.stdout.write("\n")
cur = "content"
sys.stdout.write("\n[Response] ")
sys.stdout.write(delta.content)
sys.stdout.flush()
content += delta.content
# ── Handle tool_calls ──
for tc in delta.tool_calls or []:
i = tc.index
if i not in tc_map:
tc_map[i] = {"id": "", "type": "function",
"function": {"name": "", "arguments": ""}}
if tc.id:
tc_map[i]["id"] = tc.id
if tc.function:
tc_map[i]["function"]["name"] += tc.function.name or ""
tc_map[i]["function"]["arguments"] += tc.function.arguments or ""
# End current output section
if cur in ("thinking", "content"):
sys.stdout.write("\n")
tool_calls = [tc_map[i] for i in sorted(tc_map)] if tc_map else None
return {
"content": content or None,
"reasoning_details": rd or None,
"tool_calls": tool_calls,
}
# ── Multi-turn conversation loop ─────────────────────────────
messages = [
{"role": "user", "content": "What's the weather like in Boston? Then recommend what to wear."}
]
turn = 0
while True:
turn += 1
print(f"\n── Turn {turn} ──")
result = stream_and_collect(
turn,
model="claude-sonnet-4-5",
messages=messages,
tools=tools,
extra_body={"reasoning": {"max_tokens": 2000}},
)
# Print tool calls
if result["tool_calls"]:
for tc in result["tool_calls"]:
print(f"[Tool Call: {tc['function']['name']}] {tc['function']['arguments']}")
# Build assistant message, preserve reasoning_details (critical!)
assistant_msg = {"role": "assistant", "content": result["content"]}
if result["tool_calls"]:
assistant_msg["tool_calls"] = result["tool_calls"]
if result["reasoning_details"]:
assistant_msg["reasoning_details"] = result["reasoning_details"] # pass back unmodified
messages.append(assistant_msg)
# No tool_calls means conversation is done
if not result["tool_calls"]:
break
# Execute tools and append results to messages
for tc in result["tool_calls"]:
args = json.loads(tc["function"]["arguments"])
tool_result = execute_tool(tc["function"]["name"], args)
print(f"[Tool Result: {tc['function']['name']}] {tool_result}")
messages.append({"role": "tool", "tool_call_id": tc["id"], "content": tool_result})1.6 思考强度映射规则
effort 模式:
- Opus 4.6 / Sonnet 4.6 及以上:映射为 Anthropic 原生的自适应思考(Adaptive Thinking) effort 级别
- 其他模型:按公式计算
budget_tokens:
budget_tokens = max(min(max_tokens × effort_ratio, 128000), 1024)| effort | effort_ratio |
|---|---|
| xhigh | 0.95 |
| high | 0.80 |
| medium | 0.50 |
| low | 0.20 |
| minimal | 0.10 |
自适应思考 effort 映射:
| 传入 effort | Opus 4.6 | Sonnet 4.6 |
|---|---|---|
| xhigh | max | high |
| high | high | high |
| medium | medium | medium |
| low | low | low |
| minimal | low | low |
max_tokens 模式: 直接赋值为 Anthropic 的 budget_tokens。
-think 后缀: Opus/Sonnet 4.6+ 使用自适应思考(effort=medium);其他模型设 budget_tokens = min(10240, max_tokens - 1),max_tokens 默认 4096。
2. 提示词缓存(Prompt Caching)
你可以在 Chat 接口请求 Claude 模型时使用 Prompt Caching。通过在消息中设置 cache_control 断点,让重复使用的大段文本(角色卡片、RAG 数据、书籍章节等)被缓存下来,后续请求直接命中缓存,大幅降低成本。
Claude 官方文档:Prompt Caching
2.1 缓存价格
| 操作 | 价格倍率(相对于原始输入价格) |
|---|---|
| 缓存写入(5 分钟 TTL) | 1.25x |
| 缓存写入(1 小时 TTL) | 2x |
| 缓存读取 | 0.1x |
2.2 支持的模型与最小缓存长度
| 模型 | 最小缓存 token 数 |
|---|---|
| Claude Opus 4.6 / Opus 4.5 | 4096 |
| Claude Sonnet 4.6 / Sonnet 4.5 / Opus 4.1 / Opus 4 / Sonnet 4 / Sonnet 3.7(已弃用) | 1024 |
| Claude Haiku 4.5 | 4096 |
| Claude Haiku 3.5(已弃用)/ Haiku 3 | 2048 |
断点数量限制: 每个请求最多 4 个
cache_control断点。
2.3 缓存 TTL
| TTL | 写法 | 适用场景 |
|---|---|---|
| 5 分钟(默认) | "cache_control": {"type": "ephemeral"} | 短会话、常规请求 |
| 1 小时 | "cache_control": {"type": "ephemeral", "ttl": "1h"} | 长会话,避免反复缓存写入 |
1 小时 TTL 的写入成本更高,但在长时间会话中可通过减少重复写入来节省总费用。所有 Claude 4.5 及之后版本模型的所有提供商(含 Anthropic、Amazon Bedrock、Google Vertex AI)均支持 1 小时 TTL。
2.4 使用方式
在 system、user(含图片)、tools 中均可通过 cache_control 字段设置缓存断点。以下示例仅展示关键结构,省略了大段正文内容。
System 消息缓存(默认 5 分钟 TTL):
{
"model": "claude-opus-4-5",
"messages": [
{
"role": "system",
"content": [
{"type": "text", "text": "You are an AI assistant"},
{
"type": "text",
"text": "(long context)",
"cache_control": {"type": "ephemeral"}
}
]
},
{
"role": "user",
"content": [{"type": "text", "text": "Hello"}]
}
]
}User 消息缓存(1 小时 TTL):
{
"model": "claude-opus-4-5",
"messages": [
{
"role": "system",
"content": [{"type": "text", "text": "You are an AI assistant"}]
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "(long context)",
"cache_control": {"type": "ephemeral", "ttl": "1h"}
},
{"type": "text", "text": "Hello"}
]
}
]
}图片消息缓存:
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"detail": "auto", "url": "data:image/jpeg;base64,/9j/4AAQ..."},
"cache_control": {"type": "ephemeral"}
},
{"type": "text", "text": "What's this?"}
]
}Tool 定义缓存:
cache_control 放在 tool 对象的顶层(与 type、function 同级):
{
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
},
"cache_control": {"type": "ephemeral", "ttl": "1h"}
}]
}2.5 缓存状态查看
响应的 usage 中会返回 claude_cache_tokens_details,记录缓存的详细信息:
首次请求(创建缓存):
{
"usage": {
"prompt_tokens": 22,
"completion_tokens": 890,
"total_tokens": 912,
"claude_cache_tokens_details": {
"cache_creation_input_tokens": 6266,
"cache_read_input_tokens": 0,
"cache_write_5_minutes_input_tokens": 6266,
"cache_write_1_hour_input_tokens": 0
}
}
}后续请求(命中缓存):
{
"usage": {
"prompt_tokens": 22,
"completion_tokens": 810,
"total_tokens": 832,
"prompt_tokens_details": {
"cached_tokens": 6266
},
"claude_cache_tokens_details": {
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 6266,
"cache_write_5_minutes_input_tokens": 0,
"cache_write_1_hour_input_tokens": 0
}
}
}| 字段 | 含义 |
|---|---|
cache_creation_input_tokens | 本次请求新写入缓存的 token 数 |
cache_read_input_tokens | 本次请求命中缓存读取的 token 数 |
cache_write_5_minutes_input_tokens | 其中写入 5 分钟 TTL 缓存的 token 数 |
cache_write_1_hour_input_tokens | 其中写入 1 小时 TTL 缓存的 token 数 |
prompt_tokens_details.cached_tokens | 命中缓存时,兼容 OpenAI 格式的缓存 token 数 |
3. 请求头传递 anthropic-beta
你可以通过 HTTP Header anthropic-beta 来开启 Claude 模型的 beta 特性,CaMeL AI 会将该 header 透传给 Anthropic API。
用法
在请求头中添加 anthropic-beta,值为对应的 beta 功能标识符:
curl "https://api.kr777.top/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-***" \
-H "anthropic-beta: context-1m-2025-08-07" \
-d '{
"model": "claude-opus-4-5",
"messages": [
{
"role": "system",
"content": [
{"type": "text", "text": "You are an AI assistant"},
{
"type": "text",
"text": "(long context)",
"cache_control": {"type": "ephemeral"}
}
]
},
{"role": "user", "content": [{"type": "text", "text": "hello"}]}
]
}'具体可用的 beta 标识符请参考 Anthropic API 文档 。