Advanced Topics

MCP Evals

Evaluate MCP tools and workflows with Evalite and the AI SDK MCP client.

Overview

Evals help you verify that your MCP tools are called correctly by an LLM. This guide shows how to run tool-call evaluations with Evalite using the AI SDK MCP client.

The approach stays library-agnostic, Evalite is just the example runner. You can adapt the patterns to other evaluation frameworks.

For a real-world example, check out the nuxt.com MCP evals.

Prerequisites

  • An MCP server running locally (e.g., pnpm dev with the module enabled)
  • A model provider API key (AI Gateway, OpenAI, etc.)

Setup

Install dependencies

Install Evalite, Vitest, and the AI SDK packages:

pnpm add -D evalite vitest @ai-sdk/mcp ai

Add eval scripts

Add the following scripts to your package.json:

package.json
{
  "scripts": {
    "eval": "evalite",
    "eval:ui": "evalite watch"
  }
}

Configure environment variables

Create a .env file with your AI provider key and MCP endpoint:

.env
# AI provider (AI Gateway example)
AI_GATEWAY_API_KEY=your_key

# MCP endpoint exposed by your dev server
MCP_URL=http://localhost:3000/mcp

Write your first eval

Create an eval file in your test/ directory:

test/mcp.eval.ts
import { experimental_createMCPClient as createMCPClient } from '@ai-sdk/mcp'
import { generateText } from 'ai'
import { evalite } from 'evalite'
import { toolCallAccuracy } from 'evalite/scorers'

// AI Gateway model format: provider/model-name
const model = 'openai/gpt-4o-mini'
const MCP_URL = process.env.MCP_URL ?? 'http://localhost:3000/mcp'

evalite('BMI Calculator', {
  data: async () => [
    {
      input: 'Calculate BMI for someone who weighs 70kg and is 1.75m tall',
      expected: [{ toolName: 'calculate-bmi', input: { weightKg: 70, heightM: 1.75 } }],
    },
  ],
  task: async (input) => {
    const mcp = await createMCPClient({ transport: { type: 'http', url: MCP_URL } })
    try {
      const result = await generateText({
        model,
        prompt: input,
        tools: await mcp.tools(),
      })
      return result.toolCalls ?? []
    }
    finally {
      await mcp.close()
    }
  },
  scorers: [({ output, expected }) => toolCallAccuracy({ actualCalls: output, expectedCalls: expected })],
})

Running Evals

Make sure your MCP server is running first:

pnpm dev

Then run your evals in a separate terminal:

pnpm eval

Or launch the Evalite UI for a visual interface:

pnpm eval:ui

The UI is available at http://localhost:3006 and shows traces, scores, inputs, and outputs for each eval.

Project Structure

We recommend placing eval files in a test/ directory at your project root:

your-project/
├── server/
│   └── mcp/
│       ├── tools/
│       │   └── calculate-bmi.ts
│       ├── resources/
│       └── prompts/
├── test/
│   └── mcp.eval.ts          # Your MCP eval tests
├── nuxt.config.ts
└── package.json
Evalite looks for files with the .eval.ts extension by default.

Writing Effective Evals

Testing Tool Selection

Verify the model picks the correct tool:

test/mcp.eval.ts
evalite('Tool Selection', {
  data: async () => [
    {
      input: 'List all available documentation pages',
      expected: [{ toolName: 'list-pages' }],
    },
    {
      input: 'Show me the installation guide',
      expected: [{ toolName: 'get-page', input: { path: '/getting-started/installation' } }],
    },
  ],
  task: async (input) => {
    const mcp = await createMCPClient({ transport: { type: 'http', url: MCP_URL } })
    try {
      const result = await generateText({
        model,
        prompt: input,
        tools: await mcp.tools(),
      })
      return result.toolCalls ?? []
    }
    finally {
      await mcp.close()
    }
  },
  scorers: [({ output, expected }) => toolCallAccuracy({ actualCalls: output, expectedCalls: expected })],
})

Testing Multi-Step Workflows

For workflows that require multiple tool calls, increase maxSteps:

test/mcp.eval.ts
evalite('Multi-Step Workflows', {
  data: async () => [
    {
      input: 'Find the installation page and show me its content',
      expected: [
        { toolName: 'list-pages' },
        { toolName: 'get-page', input: { path: '/getting-started/installation' } },
      ],
    },
  ],
  task: async (input) => {
    const mcp = await createMCPClient({ transport: { type: 'http', url: MCP_URL } })
    try {
      const result = await generateText({
        model,
        prompt: input,
        tools: await mcp.tools(),
        maxSteps: 5, // Allow multiple tool calls
      })
      return result.toolCalls ?? []
    }
    finally {
      await mcp.close()
    }
  },
  scorers: [({ output, expected }) => toolCallAccuracy({ actualCalls: output, expectedCalls: expected })],
})

Organize evals by feature or tool category:

test/mcp.eval.ts
// Documentation tools
evalite('Documentation Tools', {
  data: async () => [
    { input: 'List all docs', expected: [{ toolName: 'list-pages' }] },
    { input: 'Get the intro page', expected: [{ toolName: 'get-page' }] },
  ],
  // ...
})

// API tools
evalite('API Tools', {
  data: async () => [
    { input: 'Fetch user data', expected: [{ toolName: 'get-user' }] },
    { input: 'Create a new post', expected: [{ toolName: 'create-post' }] },
  ],
  // ...
})

Tips

  • Keep prompts specific so the model chooses the intended tool
  • Use realistic inputs that match how users phrase requests
  • Start with happy-path cases before adding edge cases
  • Test parameter extraction by including specific values in your prompts
  • Run evals before deploying to catch regressions in tool behavior