MCP Sampling Explained: How Servers Query LLMs During Tool Execution

MCP sampling lets servers request LLM completions through the client — no API keys required. Here's the full technical breakdown, with the schema, a Python example, and security rules.

MK

Mohammed Kafeel

Machine Learning Researcher

June 24, 202611 min read
On this page

Most MCP tutorials stop at tools and resources. They skip the feature that makes servers genuinely intelligent: sampling.

MCP sampling is the protocol mechanism that lets an MCP server request an LLM completion from the client's AI model - mid-execution, inside a running tool function - without ever holding its own API key. It's what turns a passive tool server into an active reasoning agent.

This guide covers everything: the exact execution flow, the sampling/createMessage JSON schema, a working Python example, the nested agentic loop pattern, security rules, and real use cases. If you're building MCP servers that need to think, this is the page.


What Is MCP Sampling?

MCP sampling is a protocol feature that allows an MCP server to request an LLM completion from the client's AI model, mid-execution, without needing its own API credentials.

The Model Context Protocol (MCP) is an open standard created by Anthropic - specifically by engineers David Soria Parra and Justin Spahr-Summers - and launched in November 2024. As of June 2026, the production version is 2025-11-25, with the latest spec revision dated 2025-06-18. (New to the protocol? Start with what MCP is.)

MCP's core promise is secure, two-way connections between data sources and AI-powered tools. Sampling is the feature that closes the loop: instead of the server being a dumb executor, it can think by delegating reasoning back to the LLM. (Sampling sits alongside the better-known MCP primitives - tools, resources, and prompts.)

Here's the one-sentence mental model: the server asks the client "hey, what does your LLM think about this?" - and the client answers.

The protocol uses JSON-RPC 2.0 for all communication. The sampling request method is sampling/createMessage. The server sends it; the client receives it, calls its LLM, and returns the result.


How Does the Execution Flow Work?

MCP sampling happens in three distinct phases: server initiation, client processing, and result return. The server's tool function suspends while the client handles the LLM call.

Here's the full 13-step flow, broken into phases:

Phase A - Server Initiation (Inside the Tool Function)

  1. A tool function runs on the server as part of an agent workflow.
  2. The tool logic determines it needs LLM help - analysis, a decision, text generation.
  3. The server invokes the sampling method: ctx.sample(messages, tools, tool_choice).
  4. The server packages the request into an MCP JSON-RPC message: sampling/createMessage.

Phase B - Client Processing & LLM Invocation

  1. The message transmits via the configured transport layer (TCP, STDIO, or HTTP Streaming).
  2. The client receives the request and triggers its user-defined sampling_handler().
  3. The client invokes the actual LLM - GPT-4, Claude, LLaMA, Mistral, or any other model it has access to.
  4. The LLM generates a response: a text completion or a tool-use request block.
  5. Optional human-in-the-loop step: the client may present the tool call to the user for approval before proceeding.

Phase C - Result Return & Server Continuation

  1. The client packages the LLM completion into a sampling/createMessageResponse.
  2. The response transmits back to the server.
  3. The server's suspended tool coroutine resumes execution.
  4. The tool function uses the LLM-generated result to continue its work.

The key architectural insight: the server's coroutine is suspended at step 3 and only resumes at step 12. From the server's perspective, it's a simple await. All the complexity lives in the client.


The sampling/createMessage Request Schema

The sampling/createMessage request is a standard JSON-RPC 2.0 message with a params object covering messages, model preferences, and generation controls.

Before a client can accept sampling requests, it must declare support during the MCP initialization handshake:

{
  "capabilities": {
    "sampling": {}
  }
}

That empty object is intentional - it's a capability flag, not a configuration block.

Full Request Example

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "sampling/createMessage",
  "params": {
    "messages": [
      {
        "role": "user",
        "content": {
          "type": "text",
          "text": "What is the capital of France?"
        }
      }
    ],
    "modelPreferences": {
      "hints": [{ "name": "claude-3-sonnet" }],
      "intelligencePriority": 0.8,
      "speedPriority": 0.5
    },
    "systemPrompt": "You are a helpful assistant.",
    "maxTokens": 100
  }
}

Field-by-Field Breakdown

messages (required, array) The conversation history. Each message has a role (user or assistant) and content. Content supports three types:

  • text - plain string
  • image - base64-encoded data + mimeType (e.g., image/jpeg)
  • audio - base64-encoded data + mimeType (e.g., audio/wav)

modelPreferences (optional, object) This is how servers suggest a model without mandating one. It has two sub-components:

  • hints: an array of model name suggestions (e.g., "claude-3-sonnet"). Treated as substrings - the client may map "claude-3-sonnet" to gemini-1.5-pro if that's what it has. Hints are advisory, not binding.
  • Priority scores (all 0–1 floats):
    • costPriority - higher means prefer cheaper models
    • speedPriority - higher means prefer lower-latency models
    • intelligencePriority - higher means prefer more capable models

systemPrompt (optional, string) The system instruction passed to the LLM. The client may modify or override this.

maxTokens (required, number) Maximum tokens to generate. This is the one required generation parameter.

temperature (optional, 0–1) Controls randomness. Lower = more deterministic.

stopSequences (optional, array of strings) The LLM stops generating when it hits one of these strings.

includeContext (optional) One of "none", "thisServer", or "allServers". Controls whether MCP context is injected into the prompt.

metadata (optional, object) Provider-specific pass-through data.


The sampling/createMessage Response Schema

The response is a JSON-RPC 2.0 result object containing the LLM's output, the actual model used, and the reason generation stopped.

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "role": "assistant",
    "content": {
      "type": "text",
      "text": "The capital of France is Paris."
    },
    "model": "claude-3-sonnet-20240307",
    "stopReason": "endTurn"
  }
}

Response Fields

Field Type Description
role string Always "assistant"
content.type string "text", "image", or "audio"
content.text string The generated text (for text type)
content.data string Base64-encoded data (for image/audio)
content.mimeType string MIME type for binary content
model string The actual model used (not the hint)
stopReason string "endTurn", "stopSequence", or "maxTokens"

The model field is particularly useful for debugging and auditing - you can see exactly which model the client chose, even if it differed from your hint.

If the user rejects the sampling request, the client returns a JSON-RPC error instead:

{
  "jsonrpc": "2.0",
  "id": 1,
  "error": {
    "code": -1,
    "message": "User rejected sampling request"
  }
}

Your server code needs to handle this gracefully.


Python Code Example: Sampling Inside a Tool

Here's a complete, working example of an MCP server tool that uses sampling to request LLM-powered code analysis - no API key on the server side.

import asyncio
from mcp import ServerSession, types
from mcp.server import Server

app = Server("code-analyzer-server")

@app.call_tool()
async def analyze_code(
    session: ServerSession, name: str, arguments: dict
) -> list[types.CallToolResult]:
    code = arguments.get("code", "")

    system_prompt = (
        "You are an expert Python developer. "
        "Review the provided code, identify issues or inefficiencies, "
        "and return ONLY the improved, refactored code snippet."
    )

    # Server sends sampling request to client's LLM
    result = await session.create_message(
        messages=[
            {
                "role": "user",
                "content": {
                    "type": "text",
                    "text": f"Analyze this code:\n{code}"
                }
            }
        ],
        system_prompt=system_prompt,
        max_tokens=1000
    )

    return [
        types.CallToolResult(
            content=[{"type": "text", "text": result.content.text}]
        )
    ]

Walk through what's happening:

  1. @app.call_tool() registers analyze_code as an MCP tool.
  2. The function receives code from the tool arguments.
  3. await session.create_message(...) is the sampling call - it suspends the coroutine and sends sampling/createMessage to the client.
  4. The client calls its LLM (GPT-4, Claude, whatever it's configured with) and returns the result.
  5. result.content.text contains the LLM's refactored code, which the tool returns as its output.

The server never touches an API key. It doesn't know which LLM ran. It just gets back a string.


Sampling vs. Direct LLM Calls: What's the Difference?

Sampling delegates LLM calls to the client; direct calls require the server to manage its own API credentials, model selection, and billing. They're architecturally opposite approaches.

Dimension MCP Sampling Direct LLM Call (from server)
API key location Client owns it Server needs its own key
Model selection Client decides (with server hints) Server decides
Human oversight Built-in (client can pause for approval) None by default
Model flexibility Works with any LLM the client uses Locked to server's chosen provider
Audit trail Every call goes through client Server-side only
Server complexity Low - server stays lightweight High - server manages auth, retries, billing
Nested tool loops Native support Requires custom implementation
Cost attribution Client's account Server's account

The bottom line: use sampling when you want your server to be portable, auditable, and model-agnostic. Use direct calls only when the server genuinely needs to own its LLM relationship - for example, a dedicated inference service with specialized fine-tuned models.


Human-in-the-Loop: Why the Client Sits in the Middle

The client's position between server and LLM isn't just architectural - it's a deliberate safety mechanism. The MCP spec explicitly requires that there SHOULD always be a human with the ability to deny sampling requests.

This is the human-in-the-loop MCP design, and it matters for agentic workflows where a single user action might trigger dozens of LLM calls across multiple tools. (We cover the broader approval patterns in human-in-the-loop MCP workflows.)

Here's what the spec says applications SHOULD do:

  • Provide UI that makes reviewing sampling requests easy and intuitive
  • Allow users to view and edit prompts before they're sent to the LLM
  • Present generated responses for review before delivering them to the server

In practice, this means a well-built MCP client might show a dialog: "The code-analyzer server wants to send this prompt to Claude. Approve?" The user can read the prompt, edit it, approve it, or reject it entirely. (Where sampling pulls reasoning from the LLM, MCP elicitation is the mirror primitive that pulls structured input from the user.)

If the user rejects it, the client sends back a JSON-RPC error ("User rejected sampling request"), and the server's tool function needs to handle that case.

Why does this matter? In a complex MCP agentic workflow, a server might make 20 LLM calls in a single task. Without human-in-the-loop controls, a malicious or buggy server could exfiltrate data, generate harmful content, or rack up enormous API costs - all without the user knowing. The client-as-gatekeeper pattern prevents this.


The Nested Agentic Loop Pattern

MCP sampling supports full agentic loops where the server sends a sampling request, the LLM requests tool calls, the server executes them, and the cycle repeats until the LLM stops.

This is where Model Context Protocol sampling gets genuinely powerful. Here's the loop:

  1. Server sends a sampling/createMessage request with tool definitions included.
  2. LLM returns a ToolUseContent block - it wants to call a tool, not generate text yet.
  3. Server executes the requested tool and creates a ToolResultContent block.
  4. Server sends a new sampling/createMessage with the tool result appended to the message history.
  5. LLM processes the tool result and either requests another tool or returns a final text response.
  6. Loop ends when the LLM returns text with no further tool calls.

This is the same pattern as OpenAI's function calling or Anthropic's tool use - but mediated through the MCP protocol, with the client controlling every LLM invocation. (Scale this across several agents and you're into multi-agent workflows, where a supervisor coordinates many such loops.)

Practical example: a research agent tool that:

  • Calls a web_search tool to find relevant pages
  • Asks the LLM to identify which pages are worth reading
  • Calls a fetch_page tool for each selected URL
  • Asks the LLM to synthesize the content into a summary
  • Returns the final summary to the user

Each LLM call in that chain is a separate sampling/createMessage request. The client sees every one of them. The user can inspect any of them.


Security Considerations

MCP sampling has a clear security model: clients validate and gate all LLM access; servers treat the client as a trusted intermediary and never bypass it.

The spec is explicit. Here are the rules, with their RFC-style obligation levels:

Client obligations:

  • MUST validate all sampling requests before forwarding to the LLM
  • MUST respect model preferences while retaining final control over model selection
  • SHOULD implement rate limiting on sampling requests to prevent abuse
  • SHOULD present sampling requests to users for review when appropriate
  • MUST handle sensitive data appropriately

Server obligations:

  • SHOULD NOT include sensitive data in sampling requests unless strictly necessary
  • SHOULD treat model preferences as suggestions, not requirements
  • MUST handle error responses (including user rejections) gracefully

The big picture: the client is the security boundary. A server can request an LLM call, but it can never force one. The client decides whether to proceed, which model to use, and whether to show the user first.

This architecture means a compromised or malicious server cannot silently exfiltrate data through LLM calls - every call goes through the client, which can log, rate-limit, or block it.


Real-World Use Cases

MCP sampling is the right tool whenever a server-side workflow needs dynamic reasoning that can't be pre-programmed. Here are five concrete scenarios:

1. Code Analysis & Refactoring

A code-analyzer server (like the Python example above) receives raw code, sends it to the LLM via sampling, and returns improved code. The server doesn't need to know anything about Python best practices - the LLM does.

2. Research Agent with Web Summarization

A research agent fetches web pages via HTTP tools, then uses sampling to ask the LLM: "Which of these three pages is most relevant to the user's question?" The LLM's judgment drives the next fetch.

3. Data Pipeline Classification

A data pipeline server processes thousands of records. For ambiguous entries that rule-based logic can't classify, it fires a sampling/createMessage to get the LLM's classification. Cheap, fast, no separate ML infrastructure needed.

4. Customer Support Draft Generation

A support server retrieves relevant knowledge base entries via a vector search tool, then uses sampling to ask the LLM to draft a response. The human agent reviews the draft before sending. Classic human-in-the-loop.

5. Document Processing & Structured Extraction

A document server receives PDFs, extracts text, and uses sampling to ask the LLM to output structured JSON (invoice fields, contract clauses, medical codes). The server validates the JSON schema and returns clean data.

In every case, the pattern is the same: the server handles I/O and orchestration; the LLM handles reasoning; the client handles trust.


FAQ

What is MCP sampling in simple terms?

MCP sampling is the feature that lets an MCP server ask "what does the LLM think?" during a tool execution - without the server needing its own API key. The server sends a sampling/createMessage request to the client, the client calls its LLM, and the result comes back to the server.

Does the MCP server need an LLM API key to use sampling?

No. That's the whole point. The server never touches an API key. The client owns all LLM credentials and makes the actual API call. The server just sends a JSON-RPC request and waits for the response.

What is sampling/createMessage and how does it work?

sampling/createMessage is the JSON-RPC 2.0 method that triggers an LLM completion request. The server sends it with a messages array, optional modelPreferences, a systemPrompt, and a maxTokens limit. The client receives it, optionally shows it to the user for approval, calls the LLM, and returns the result in a sampling/createMessageResponse.

Can the server choose which LLM model to use?

Not directly. The server can suggest a model via modelPreferences.hints (e.g., "claude-3-sonnet") and express priorities for cost, speed, and intelligence. But the client makes the final model selection. If the client doesn't have Claude, it might map the hint to a comparable model like gemini-1.5-pro.

What does "human-in-the-loop MCP" mean in practice?

It means the client can pause a sampling request and show it to the user before sending it to the LLM. The user can read the prompt, edit it, approve it, or reject it. If rejected, the client returns an error to the server. This is a core safety mechanism for agentic workflows where servers might make many LLM calls.

What happens if the LLM requests a tool call during sampling?

The server receives a ToolUseContent block in the response instead of text. It executes the requested tool, appends the result to the message history, and sends a new sampling/createMessage request. This loop continues until the LLM returns a plain text response - that's the nested agentic loop pattern.


Sources