diff --git a/README.md b/README.md index bccf9b2..03158e8 100644 --- a/README.md +++ b/README.md @@ -1,618 +1,386 @@ # WebMCP 🧪 -_Enabling web apps to provide JavaScript-based tools that can be accessed by AI agents and assistive technologies to create collaborative, human-in-the-loop workflows._ +WebMCP lets developers expose web application functionality—either JavaScript functions or HTML `
` elements—as "tools" with natural language descriptions and structured schemas, designed for AI agent ingestion. These tools can be invoked by AI agents, including those built into the browser, hosted in iframes, or running in extensions to actuate web content that was traditionally designed for human interaction. -> First published August 13, 2025 -> -> Brandon Walderman <brwalder@microsoft.com>
-> Leo Lee <leo.lee@microsoft.com>
-> Andrew Nolan <annolan@microsoft.com>
-> David Bokan <bokan@google.com>
-> Khushal Sagar <khushalsagar@google.com>
-> Hannah Van Opstal <hvanopstal@google.com> - -## TL;DR - -We propose a new JavaScript interface that allows web developers to expose their web application functionality as "tools" - JavaScript functions with natural language descriptions and structured schemas that can be invoked by AI agents, browser assistants, and assistive technologies. Web pages that use WebMCP can be thought of as [Model Context Protocol (MCP)](https://modelcontextprotocol.io/introduction) servers that implement tools in client-side script instead of on the backend. WebMCP enables collaborative workflows where users and agents work together within the same web interface, leveraging existing application logic while maintaining shared context and user control. - -For the technical details of the proposal, code examples, API shape, etc. see [proposal.md](./docs/proposal.md). - -## Terminology Used - -###### Agent -An autonomous assistant that can understand a user's goals and take actions on the user's behalf to achieve them. Today, -these are typically implemented by large language model (LLM) based AI platforms, interacting with users via text-based -chat interfaces. - -###### Browser's Agent -An autonomous assistant as described above but provided by or through the browser. This could be an agent built directly -into the browser or hosted by it, for example, via an extension or plug-in. - -###### AI Platform -Providers of agentic assistants such as OpenAI's ChatGPT, Anthropic's Claude, or Google's Gemini. - -###### Backend Integration -A form of API integration between an AI platform and a third-party service in which the AI platform can talk directly to -the service's backend servers without a UI or running code in the client. For example, the AI platform communicating with -an MCP server provided by the service. - -###### Actuation -An agent interacting with a web page by simulating user input such as clicking, scrolling, typing, etc. ## Background and Motivation -The web platform's ubiquity and popularity have made it the world's gateway to information and capabilities. Its ability to support complex, interactive applications beyond static content, has empowered developers to build rich user experiences and applications. These user experiences rely on visual layouts, mouse and touch interactions, and visual cues to communicate functionality and state. - -As AI agents become more prevalent, the potential for even greater user value is within reach. AI platforms such as Copilot, ChatGPT, Claude, and Gemini are increasingly able to interact with external services to perform actions such as checking local weather, finding flight and hotel information, and providing driving directions. These functions are provided by external services that extend the AI model’s capabilities. These extensions, or “tools”, can be used by an AI to provide domain-specific functionality that the AI cannot achieve on its own. Existing tools integrate with each AI platform via bespoke “integrations” - each service registers itself with the chosen platform(s) and the platform communicates with the service via an API (MCP, OpenAPI, etc). In this document, we call this style of tool a “backend integration”; users make use of the tools/services by chatting with an AI, the AI platform communicates with the service on the user's behalf. +The web platform is the world's largest gateway to information and capabilities. Today, user experiences rely on visual layouts, mouse and touch interactions, and visual cues to communicate functionality and state, but as AI agents become prevalent, the potential for even greater user value is within reach. The motivation of WebMCP is to provide a lightweight way to adapt web content for use by AI agents. -Much of the challenges faced by assistive technologies also apply to AI agents that struggle to navigate existing human-first interfaces when agent-first "tools" are not available. Even when agents succeed, simple operations often require multiple steps and can be slow or unreliable. -The web needs web developer involvement to thrive. What if web developers could easily provide their site's capabilities to the agentic web to engage with their users? We propose WebMCP, a JavaScript API that allows developers to define tools for their webpage. These tools allow for code reuse with frontend code, maintain a single interface for users and agents, and simplify auth and state where users and agents are interacting in the same user interface. Such an API would also be a boon for accessibility tools, enabling them to offer users higher-level actions to perform on a page. This would mark a significant step forward in making the web more inclusive and actionable for everyone. +### Backend Integrations vs. In-browser WebMCP Tools -AI agents can integrate in the backend via protocols like MCP in order to fulfill a user's task. For a web developer to expose their site's functionality this way, they need to write a server, usually in Python or NodeJS, instead of frontend JS which may be more familiar. +AI platforms such as Copilot, ChatGPT, Claude, and Gemini are increasingly able to interact with external services to perform actions such as checking local weather, finding flight and hotel information, and providing driving directions. This is facilitated by "tools" that external services provide to extend the AI model’s capabilities, and give the AI domain-specific functionality that it cannot obtain on its own. -There are several advantages to using the web to connect agents to services: +External tools integrate with each AI platform via bespoke **backend integrations**, such as [Model Context Protocol](https://modelcontextprotocol.io/) or [OpenAPI](https://www.openapis.org/). A service registers its tools with an AI platform, and the platform communicates directly with the service's backend servers via an API. In this document, we call this style of tool a “backend integration”; users make use of the tools by chatting with an AI, and the AI platform communicates with the service on the user's behalf. -* **Businesses near-universally already offer their services via the web.** - - WebMCP allows them to leverage their existing business logic and UI, providing a quick, simple, and incremental - way to integrate with agents. They don't have to re-architect their product to fit the API shape of a given agent. - This is especially true when the logic is already heavily client-side. - - -* **Enables visually rich, cooperative interplay between a user, web page, and agent with shared context.** - - Users often start with a vague goal which is refined over time. Consider a user browsing for a high-value purchase. - The user may prefer to start their journey on a specific page, ask their agent to perform some of the more tedious - actions ("find me some options for a dress that's appropriate for a summer wedding, preferably red or orange, short - or no sleeves and no embellishments"), and then take back over to browse among the agent-selected options. - -* **Allows authors to serve humans and agents from one source** +Backend integrations work well for server-side actions, but they pose significant challenges for interactive web applications: - The human-use web is not going away. Integrating agents into it prevents fragmentation of their service and allows - them to keep ownership of their interface, branding and connection with their users. +- **UI Disintermediation & Context Loss**: Backend integrations take place directly between the agent and the service, bypassing the service's web UI / browser experience. +- **Replication of State & Auth**: Web developers must replicate the user's state, active context, and authentication credentials on a separate server. +- **Developer Burden**: Exposing a site's client-side capabilities requires writing a dedicated backend server, rather than reusing familiar client-side JavaScript. -WebMCP is a proposal for a web API that enables web pages to provide agent-specific paths in their UI. With WebMCP, agent-service interaction takes place _via app-controlled UI_, providing a shared context available to app, agent, and user. In contrast to backend integrations, WebMCP tools are available to an agent only once it has loaded a page and they execute on the client. Page content and actuation remain available to the agent (and the user) but the agent also has access to tools which it can use to achieve its goal more directly. +**WebMCP** introduces a client-side alternative. It allows web developers to define tools directly in the browser page's script. This enables visually rich, cooperative interplay between a user, a web page, and an agent with shared context. Page UI and content remain available to the agent for actuation, but the agent can use WebMCP tools to achieve the user's goals more directly, reliably, and quickly, as the tools are in a format more suited to the agent. +#### WebMCP In-browser tool flow ![A diagram showing an agent communicating with a third-party service via WebMCP running in a live web page](./content/explainer_webmcp.png) -In contrast, in a backend integration, the agent-service interaction takes place directly, without an associated UI. If -a UI is required it must be provided by the agent itself or somehow connected to an existing UI manually: - -![A diagram showing an agent communicating with a third-party service directl via MCP](./content/explainer_mcp.png) - -## Goals - -- **Enable human-in-the-loop workflows**: Support cooperative scenarios where users work directly through delegating tasks to AI agents or assistive technologies while maintaining visibility and control over the web page(s). -- **Simplify AI agent integration**: Enable AI agents to be more reliable and helpful by interacting with web sites through well-defined JavaScript tools instead of through UI actuation. -- **Minimize developer burden**: Any task that a user can accomplish through a page's UI can be made into a tool by re-using much of the page's existing JavaScript code. -- **Improve accessibility**: Provide a standardized way for assistive technologies to access web application functionality beyond what's available through traditional accessibility trees which are not widely implemented. - -## Non-Goals - -- **Headless browsing scenarios**: While it may be possible to use this API for headless or server-to-server interactions where no human is present to observe progress, this is not a current goal. Headless scenarios create many questions like the launching of browsers and profile considerations. -- **Autonomous agent workflows**: The API is not intended for fully autonomous agents operating without human oversight, or where a browser UI is not required. This task is likely better suited to existing protocols like [A2A](https://a2aproject.github.io/A2A/latest/). -- **Replacement of backend integrations**: WebMCP works with existing protocols like MCP and is not a replacement of existing protocols. -- **Replace human interfaces**: The human web interface remains primary; agent tools augment rather than replace user interaction. -- **Enable / influence discoverability of sites to agents** - -## Use Cases - -The use cases for WebMCP are ones in which the user is collaborating with the agent, rather than completely -delegating their goal to it. They can also be helpful where interfaces are highly specific or complicated. - -### Example - Creative - -_Jen wants to create an invitation to her upcoming yard sale so she uses her browser to navigate to -`http://easely.example`, her favorite graphic design platform. However, she's rather new to it and sometimes struggles -to find all the functionality needed for her task in the app's extensive menus. She creates a "yard sale flyer" design -and opens up a "templates" panel to look for a premade design she likes. There's so many templates and she's not sure -which to choose from so she asks her browser agent for help._ - -**Jen**: Show me templates that are spring themed and that prominently feature the date and time. They should be on a -white background so I don't have to print in color. - -_The current document has registered a WebMCP tool that the agent notices may be relevant to this query:_ - -```js -/** - * Filters the list of templates based on a description. - * - * description - A visual description of the types of templates to show, in natural language (English). - */ - filterTemplates(description) -``` - -_The agent invokes the tool: `filterTemplate("spring themed, date and time displayed prominently, white background")`. -The UI updates to show a filtered list matching this description._ - -**Agent**: Ok, the remaining templates should now match your description. - -_Jen picks a template and gets to work._ - -_The agent notices a new tool was registered when the design was loaded:_ - -```js -/** - * Makes changes to the current design based on instructions. Possible actions include modifications to text - * and font; insertion, deletion, transformation of images; placement and scale of elements. The instructions - * should be limited a single task. Here are some examples: - - * editDesign("Change the title's font color to red"); - * editDesign("Rotate each picture in the background a bit to give the design a less symmetrical feel"); - * editDesign("Add a text field at the bottom of the design that reads 'example text'"); - * - * instructions - A description of how the design should be changed, in natural language (English). - */ - editDesign(instructions) -``` - -_With all the context of Jen's prompts, page state, and this editDesign tool, the agent is able to make helpful -suggestions on next steps:_ - -**Agent**: Would you like me to make the time/date font larger? - -**Jen**: Sure. Could you also swap out the clipart for something more yard-sale themed? - -**Agent**: Sure, let me do that for you. - -**Jen**: Please fill in the time and place using my home address. The time should be in my e-mail in a message from my -husband. - -**Agent**: Ok, I've found it - I'll fill in the flyer with Aug 5-8, 2025 from 10am-3pm | 123 Queen Street West. - -_Jen is almost happy with the current design but think the heading could be better_ - -**Jen**: Help me come up with a more attention grabbing headline for the call to action and title. - -**Agent**: Of course! Here are some more attention-grabbing headlines for your yard sale flyer, broken down by title and -call to action: - -To Create Excitement: - * Yard Sale Extravaganza! - * The Ultimate Clear-Out Sale - * Mega Garage & Yard Sale - -... - -**Jen**: Lets use "Yard Sale Extravaganza!" as the title. Create copies of this page with each of the call to action -suggestions. - -_The agent takes this action using a sequence of tool calls which might look something like:_ - -* `EditDesign("Change the title text to 'Yard Sale Extravaganza!'")` -* `EditDesign("Change the call-to-action text to 'The hunt is on!'")` -* `AddPage("DUPLICATE")` -* `EditDesign("Change the call-to-action text to 'Ready, set, shop!'")` -* `AddPage("DUPLICATE")` -* `EditDesign("Change the call-to-action text to 'Come for the bargains, stay for the cookies'")` - -_Jen now has 3 versions of the same yard sale flyer. Easely implements these WebMCP tools using AI-based techniques on -their backend to allow a natural language interface. Additionally, the UI presents these changes to Jen as an easily -reversible batch of "uncommitted" changes, allowing her to easily review the agent's actions and make changes or undo as -necessary. While the site could also implement a chat interface to expose this functionality with their own agent, the -browser's agent provides a seamless journey by using tools across multiple sites/services. For example, pulling up -information from the user's email service._ - -**Agent**: Done! I've created three variations of the original design, each with a unique call to action. - -_Jen is now happy with these flyers. Normally she'd print to PDF and then take the file to a print shop. However, Easely -has a new print service that Jen doesn't know about and doesn't notice in the UI. However, the agent knows the page has -an `orderPrints` tool:_ - -```js -/** - * Orders the current design for printing and shiping to the user. - * - * copies - A number between 0 and 1000 indicating how many copies of the design to print. Required. - * page_size - The paper type to use. Available options are [Legal, Letter, A4, A5]. Default is "Letter". - * page_finish - What kind of paper finish to use. Available options are [Regular, Glossy Photo, Matte Photo]. - * Default is "Regular" - */ -orderPrints(copies, page_size, page_finish); -``` - -_The agent understands the user's intent and so surfaces a small chip in it's UI:_ - -**Agent**: `` - -_Jen is delighted she saved a trip to the store and clicks the button_. - -**Agent**: How many copies would you like? I'll request 8.5x11 sized regular paper but there are other options available. - -**Jen**: Please print 10 copies. - -**Agent**: Done! The order is ready for your review. - -_The site navigates to the checkout page where Jen can review the order and click the checkout button to confirm._ - +#### Direct backend MCP flow +![A diagram showing an agent communicating with a third-party service directly via MCP](./content/explainer_mcp.png) -### Example - Shopping +Many challenges faced by assistive technology also apply to AI agents that struggle to navigate existing human-first interfaces when agent-first "tools" are not available. Even when agents succeed, simple operations often require multiple steps and can be slow or unreliable. -Shopping often includes exploration, browsing, and reviewing. The agent can help and remove the tedious parts but the -user wants to be involved. +Web pages that use WebMCP can be thought of as in-page [Model Context Protocol (MCP)](https://modelcontextprotocol.io/introduction) servers that implement tools exposing client-side logic and DOM interaction rather than server-side APIs. WebMCP enables collaborative workflows where users and agents work together within the same web interface, leveraging existing application logic while maintaining shared context and user control. -_Maya has a few favorite clothing vendors but doesn't get formal clothes very often so she opens her AI assistant_ +### Existing web actuation techniques -**Maya**: Suggest a few formal or semi-formal dress places where I can shop for my friend’s wedding. Please make sure -they are ecofriendly and not too expensive. - -**Agent**: Sure, here's a few suggestions - -1. Elara - uses innovative recycled and plant-based fabrics to minimize waste for a modern, chic wardrobe that allows - you to embrace the latest trends, responsibly. Price Range: Dresses typically \$90 - \$220 CAD +One of the scenarios we want to enable is making the web more accessible to general-purpose AI-based agents. In the absence of alternatives like MCP servers to accomplish their goals, these general-purpose agents often rely on observing the browser state through a combination of screenshots, and DOM and accessibility tree snapshots, and then interact with the page by simulating human user input. We believe that WebMCP will give these tools an alternative means to interact with the web that give the web developer more control over whether and how an AI-based agent interacts with their site. -2. Linden Harbour - creates wonderfully crafted womenswear designed to last a lifetime. Carries everything from classic -shirt dresses to elegant knitwear, meticulously made from the finest organic cotton, linen, and responsibly sourced -wool. Price Range: Dresses typically \$250 - \$550 CAD +The proposed API will not conflict with these existing automation techniques. If an agent or assistive tool finds that the task it is trying to accomplish is not achievable through the WebMCP tools that the page provides, then it can fall back to general-purpose browser automation to try and accomplish its task. -3. Wildebloom - Flowing dresses brought to life through artisanal collaborations, using natural plant dyes and panels of - upcycled vintage textiles. Price Range: Dresses typically \$180 - \$450 CAD -**Maya**: Lets take a look at Wildebloom. +## Goals & Non-Goals -_The agent now opens Maya's web browser, which it is integrated with. i.e. the agent can observe and control the -browser. It navigates to the dresses page on `http://wildebloom.example/shop`_ +### Goals -**Agent**: Ok, here are the dresses that Wildebloom carries. +- **Enable human-in-the-loop workflows**: Support cooperative scenarios where users delegate tasks to AI agents while maintaining visibility, history, and control over web pages. +- **Simplify AI agent integration**: Enable AI agents to be more reliable and helpful by interacting with web sites through well-defined client-side tools instead of through brittle UI actuation (DOM scraping, simulated clicks). +- **Prevent web content disintermediation**: Prevent disintermediation of web apps by backend integrations by adapting front-ends for use by agents, rather than replacing them. +- **Code reuse**: Any task that a user can accomplish through a page's UI can be turned into a tool by reusing much of the page's existing client-side code. +- **Improve accessibility through agents**: Enable agents to assist users of accessibility technology. WebMCP itself is not designed for ingestion by accessibility technology, nor is it designed to interact directly with a page's accessibility tree; rather, it enables agents to act as highly capable intermediaries (see [Issue #91](https://github.com/webmachinelearning/webmcp/issues/91)). -_Maya is immediately overwhelmed. There are so many options! Moreover, when she looks at filters she sees they're -quite limited with only colour and size as options._ +### Non-Goals -**Maya**: Show me only dresses available in my size, and also show only the ones that would be appropriate for a -cocktail-attire wedding. +- **Headless browsing scenarios**: While it may be possible to run these tools in headless environments, this API is primarily designed for local browser workflows with a human in the loop. +- **Fully autonomous workflows**: The API is not intended for fully autonomous agents operating without human oversight or where a browser UI is not present. +- **Replacement of backend integrations**: WebMCP is designed to complement, not replace, existing backend-focused protocols like MCP. +- **Replacement of human interfaces**: The human web interface remains primary; agent tools augment rather than replace user interaction. -_The agent notices the dresses page registers several tools:_ -```js -/* - * Returns an array of product listings containing an id, detailed description, price, and photo of each - * product - * - * size - optional - a number between 2 and 14 to filter the results by EU dress size - * size - optional - a color from [Red, Blue, Green, Yellow, Black, White] to filter dresses by - */ -getDresses(size, color) - -/* - * Displays the given products to the user - * - * product_ids - An array of numbers each of which is a product id returned from getDresses - */ -showDresses(product_ids) -``` - -_The agent calls `getDresses(6)` and receives a JSON object:_ +## Use Cases -```json -{ +WebMCP enables cooperative workflows where the user collaborates with the agent rather than completely delegating their goal to it. + +### Creative & Graphic Design + +Jen wants to create a yard sale flyer on `https://easely.example`. She wants to filter templates and make visual edits. Instead of navigating menus, she interacts with her browser's agent: +- **Jen**: "Show me templates that are spring themed and that prominently feature the date and time. They should be on a white background so I don't have to print in color." +- The website has already registered the following tools: + ```js + navigator.modelContext.registerTool({ + name: "filter-templates", + description: "Filters the list of templates based on a natural language visual description.", + inputSchema: { + type: "object", + properties: { + description: { type: "string", description: "A visual description of templates to show." } + }, + required: ["description"] + }, + execute({ description }) { + filterTemplatesInUI(description); + } + }); + ``` +- The agent invokes `filter-templates` tool, and the UI instantly updates to show matching layouts. +- Once Jen selects a template, the agent notices another tool that was dynamically registered: `edit-design(instructions)`. +- **Jen**: "Please fill in the time and place using my home address. The time should be in my e-mail in a message from my husband." +- **Agent**: "Ok, I've found it—I'll fill in the flyer with: *Aug 5-8, 2025 from 10am-3pm | 123 Queen Street West*. Would you like me to make the date font larger and swap out the clipart for yard-sale illustrations?" +- **Jen**: "Yes, please. Also, let's use 'Yard Sale Extravaganza!' as the title, and create duplicate pages comparing different calls to action." +- The agent automates this by executing a sequence of tool calls to `edit-design`. The graphic design page applies these edits as a batch of "uncommitted" changes in the UI, allowing Jen to review or adjust them. +- **Agent**: "Done! I've created three variations of your design, each with a unique call to action." +- **Jen is ready to finalize the flyers**. Normally, she would export a PDF and find a local print shop. However, the page has also registered an `order-prints` tool: + ```js + navigator.modelContext.registerTool({ + name: "order-prints", + description: "Orders the current design for printing and shipping to the user.", + inputSchema: { + type: "object", + properties: { + copies: { type: "number", description: "Number of copies between 1 and 1000." }, + pageSize: { type: "string", enum: ["Letter", "Legal", "A4"], default: "Letter" } + }, + required: ["copies"] + }, + execute({ copies, pageSize }) { + initiatePrintCheckout(copies, pageSize); + } + }); + ``` +- Spotting this tool, the agent offers to help and surfaces an inline print option. Jen specifies she wants 10 copies, and the agent executes the tool, automatically navigating the browser tab to the secure checkout page where Jen can complete the order with a single click. + +### E-Commerce & Tailored Shopping + +Maya is shopping for dresses on `http://wildebloom.example/shop`. +- **Maya**: "Show me only dresses available in my size, and also show only the ones that would be appropriate for a cocktail-attire wedding." +- The page has already registered tools to search and display products: + ```js + navigator.modelContext.registerTool({ + name: "get-dresses", + description: "Returns an array of product listings containing id, description, price, and photo.", + inputSchema: { + type: "object", + properties: { + size: { type: "number", description: "Optional EU dress size to filter by." }, + color: { type: "string", description: "Optional color to filter by." } + } + }, + async execute({ size, color }) { + const response = await fetchDresses(size, color); + return response.json(); + } + }); + navigator.modelContext.registerTool({ + name: "show-dresses", + ... + }); + navigator.modelContext.registerTool({ + name: "filter-products", + ... + }); + ``` +- The agent calls `get-dresses(6)` (automatically translating Maya's size into EU units from her browser profile context) and receives a JSON array of detailed product listings: + ```json + { "products": [ - { - "id": 1021, - "description": "A short sleeve long dress with full length button placket...", - "price": "€180", - "image": "img_1024.png" - }, - { - "id": 4320, - "description": "A straight midi dress in organic cotton...", - "price": "€140", - "image": "img_4320.png" - }, - ... + { + "id": 1021, + "description": "A short sleeve midi dress in organic cotton with a floral print...", + "price": "€180", + "image": "img_1021.png" + }, + { + "id": 4320, + "description": "A straight-cut formal linen gown on plant-based dyes...", + "price": "€220", + "image": "img_4320.png" + }, + { + "id": 684, + "description": ... + }, + ... ] -} -``` - -> [!Note] -> How to pass images and other non-textual data is something we should improve (See [Issue #41](https://github.com/webmachinelearning/webmcp/issues/41)) - -_The agent can now process this list, fetching each image, and using the user's criteria to filter the list. When -completed it makes another call, this time to `showDresses([4320, 8492, 5532, ...])`. This call updates the UI on the -page to show only the requested dresses._ - -_This is still too many dresses so Maya finds an old photo of herself in a summer dress that she really likes and shares -it with her agent._ - -**Maya**: Are there any dresses similar to the dress worn in this photo? Try to match the colour and style, but continue -to show me dresses appropriate for cocktail-attire. - -_The agent uses this image to identify several new parameters including: the colour, the fit, and the neckline and -narrows down the list to just a few dresses. Maya finds and clicks on a dress she likes._ - -_Notice, the user did not give their size, but the agent knows this from personalization and may even translate the stored -size into EU units to use it with this site._ - -### Example - Code Review + } + ``` +- The agent processes this list, fetching each image and using the user's criteria to filter the dresses. It then calls the tool `show-dresses([1021, 4320, 684, ...])`. This updates the UI on the page to show only the requested dresses. +- **Maya** uploads a photo of a favorite summer dress she owns: "Are there any dresses similar to the color and style of the one in this photo?" +- **Agent**: "I've analyzed your photo's color tone and A-line cut. Let me filter the store grid to show options matching that style." +- The agent uses its vision capabilities to match the product images against Maya's photo, compiles the list of matching IDs, and runs the tool `filter-products([1021, 684])`, instantly updating the site's UI with relevant dresses. + +### Specialized Developer Workflows + +John is a software developer performing a code review in [Gerrit](https://www.gerritcodereview.com/). The interface is complex, but the page registers helpful tools to inspect trybot statuses and retrieve logs, perfect for agents that are typically trained on everyday usage, and may otherwise do a poor job actuating such complicated interfaces. + +- **John**: "Why are the Mac and Android trybots failing?" +- The page has already registered the following tools: + ```js + navigator.modelContext.registerTool({ + name: "get-trybot-statuses", + description: "Returns the current status of all trybot runs for the active patch.", + execute() { + return activePatch.getStatuses(); + } + }); + + navigator.modelContext.registerTool({ + name: "get-trybot-failure-snippet", + description: "If a bot failed, returns the tail log snippet describing the error.", + inputSchema: { + type: "object", + properties: { + botName: { type: "string", description: "The bot name to query." } + }, + required: ["botName"] + }, + execute({ botName }) { + return activePatch.getFailureSnippet(botName); + } + }); + ``` +- The agent invokes `get-trybot-statuses` and receives a JSON array representing the trybot statuses: + ```json + [ + { "botName": "mac-x64-rel", "status": "FAIL" }, + { "botName": "android-15-rel", "status": "FAIL" } + ] + ``` +- The agent then automatically calls `get-trybot-failure-snippet` for each failing bot. After ingesting the logs, it reports back: + - **Agent**: "The Mac bot is failing with an 'Out of Space' infrastructure error. The Android bot is failing while linking with a missing symbol `gfx::DisplayCompositor`." + - **John**: "Ah! BUILD.gn is missing `display_compositor_android.cc`. Please add a suggested edit to the build file adding it to the Android sources." +- The agent uses a registered `add-suggested-edit(filename, patch)` tool to apply the diff. The Gerrit UI instantly displays the suggested patch as a code-review diff for John to accept, modify, or reject. -Some services are very domain specific and/or provide a lot of functionality. A real world example is the Chromium code -review tool: Gerrit. See [CL#5142508](https://crrev.com/c/5142508). Gerrit has many features but they're not obvious just by -looking at the UI (you can press the '?' key to show a shortcut guide). In order to add a comment to a line, the user -must know to press the 'c' key. The user can suggest edits but has to open a comment to do so. Results from test runs -are available but are hidden in a generically-named "Checks" tab. -Agents are typically trained on everyday usage so may do a poor job on more specialized, complex interfaces. However, -such sites could provide the agent with tools which serve as both a shortcut and a user manual for the agent. +## Detailed Design -_John is a software developer and opens a code review sent from his colleague. He notices there's two red bots -indicating test failures on this patch._ +WebMCP introduces an imperative API on the web platform under `navigator.modelContext`. This interface allows pages to expose client-side actions that agents can discover and invoke in a secure, browser-mediated environment. -**John**: Why are the Mac and Android bots failing? +### Imperative Tool Registration: `navigator.modelContext` -_The site includes some relevant tools:_ +A Model Context Provider registers tools by calling the `navigator.modelContext.registerTool()` method. ```js -/** - * Returns the status of each bot run in a try run job. Use the returned bot_name in getTryRunFailureSnippet - * to get more details about a run. - */ -getTryRunStatuses(); - -/** - * If a bot failed, returns the TAIL snippet of the log containing the error message describing the failure. - * - * bot_name - The name of the bot, as returned from getTryRunStatuses, to get failure snippet from - */ -getTryRunFailureSnippet(bot_name) -``` - -_The agent calls `getTryRunStatuses()` to find the names of the failing bots. It returns:_ - -```json -[ - { - "bot_name": "mac-x86-rel", - "status": "PASSED" +const controller = new AbortController(); + +navigator.modelContext.registerTool({ + name: "add-todo", + description: "Add a new item to the user's active todo list", + inputSchema: { + type: "object", + properties: { + text: { type: "string", description: "The text content of the todo item" } }, - { - "bot_name": "mac-x64-rel", - "status": "FAIL" - }, - { - "bot_name": "android-15-rel", - "status": "PASSED" - }, - ... -] -``` - -**Agent**: I see that the Mac x86 Android 10 bots are failing. Let me get more details... - -_The agent now calls `getTryRunFailureSnippet` on each of the failing bots and receives a snippet of the failing log of -each._ - -**Agent**: The Mac bot is failing to start with an "Out of Space" error. This is probaby just an ifrastructure issue. - -The Android bot is failing while linking with a missing symbol `gfx::DisplayCompositor`. + required: ["text"] + }, + async execute({ text }) { + // Reuse existing client-side application logic and update UI. + await addTodoItemToCollection(text); + + return { + content: [ + { + type: "text", + text: `Added todo item: "${text}" successfully.` + } + ] + }; + } +}, { signal: controller.signal }); -_John realizes his colleague forgot to add the Android implementation of this class to the build file._ +// To unregister the tool later, abort the signal. +// controller.abort(); +``` -**John**: I think BUILD.gn is missing `display_compositor_android.cc`, please add a suggested edit adding it in the -Android section. +### Lifecycle of a Tool Call +1. **Registration**: The web page registers one or more tools using `navigator.modelContext.registerTool()`. +2. **Discovery**: An agent connected to the page queries the browser to discover the active list of tools and their schemas. +3. **Invocation**: The agent requests a tool call, sending structured arguments matching the tool's `inputSchema`. +4. **Execution**: The browser mediates the call, invokes the tool's `execute` callback with the provided arguments, and executes client-side logic on the page. +5. **Response**: The page's callback returns structured results back to the agent, which processes them to continue collaborating with the user. -_The agent has access to BUILD.gn from the main UI as well as in DOM. It notices it also has access to a relevant tool -which is uses to add the change:_ +### Declarative API -```js -/* - * Adds a suggested edit to the review - * - * filename - the name of the file in which to make the edit - * patch - the edit diff in unidiff format - */ -addSuggestedEdit(filename, patch) -``` +For forms and standard HTML inputs, a declarative counterpart to the imperative API allows the browser to automatically synthesize tool definitions from `` elements. This is detailed in the [Declarative API Explainer](./declarative-api-explainer.md). It will be soon folded into this explainer document. -**Agent**: Ok, I've added `display_compositor_android.cc` to the BUILD.gn file. +### Permissions policy and iframes -_The UI displays the suggested diff with an option for the user to accept, modify, or reject the change. John accepts -the change._ +While much of this explainer assumes integration with built-in browser agents, WebMCP also supports **author-provided agents**, such as agents embedded directly on a page or running in an iframe, that can collaborate with parent frames and nested contexts. See: + - [Issue #57](https://github.com/webmachinelearning/webmcp/issues/57) + - [Issue #117](https://github.com/webmachinelearning/webmcp/issues/117) + - [Issue #159](https://github.com/webmachinelearning/webmcp/issues/159) + - [Issue #160](https://github.com/webmachinelearning/webmcp/issues/160) + - [Issue #178](https://github.com/webmachinelearning/webmcp/issues/178) -_Reading the rest of the review, John notices a small issue repeated across multiple files._ +By default, WebMCP is enabled in top-level `Window`s and its same-origin iframes, but access can be delegated to cross-origin iframes using the [Permissions Policy](https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Permissions_Policy) `allow="tools"`: -**John**: Add a polite comment to the review that we should use "PointF" rather than "Point" for input coordinates since -the latter can cause unintended rounding. Then add suggested edits changing all instances where Point was added to -PointF. + ```html + + ``` -_The agent automates the repetitive task of making all the simple changes. The UI provides John with a visual way to -quickly review the agent's actions and accept/modify/reject them._ +Calls to `navigator.modelContext.registerTool()` will throw a `NotAllowedError` DOMException when the permission is disabled, whether by the `allow` attribute or the `Permissions-Policy: tools=()` header. Handling of declarative tool registration errors, including when the permisssion is disabled is TBD; see [Issue #182](https://github.com/webmachinelearning/webmcp/issues/182). -## Assumptions +#### Cross-origin iframe exposure: `exposedTo` -* For many sites wanting to integrate with agents quickly - augmenting their existing UI with WebMCP tools will be - easier vs. backend integration -* Agents will perform quicker and more successfully with specific tools compared to using a human interface. -* Users might use an agent for a direct action query (e.g. “create a 30 minute meeting with Pat at 3:00pm”), complex - cross-site queries (e.g. “Find the 5 highest rated restaurants in Toronto, pin them in my Map, and book a table at - each one over the next 5 weeks”) and everything in between. +By default, tools registered by a document are only exposed to itself, same-origin documents in the same tree, and built-in browser agents (see this discussion). To support author-provided agents running in frames, developers can selectively share tools with secure origins of their choice, `exposedTo` option during registration: -## Prior Art +```js +navigator.modelContext.registerTool({ + name: "share-location", + description: "Returns the user's office location.", + execute() { return { office: "Building 4" }; } +}, { exposedTo: ["https://trusted-partner.example"] }); +``` -### Model Context Protocol (MCP) +Any document in the tree matching these origins (and allowed to use `tools` permission) will: -MCP is a protocol for applications to interface with an AI model. Developed by Anthropic, MCP is supported by Claude -Desktop and Open AI's Agents SDK as well as a growing ecosystem of clients and servers. +- Receive the `toolchange` event on its `navigator.modelContext` when the tool is registered or unregistered. +- Be able to discover and run the tools -In MCP, an application can expose tools, resources, and more to an AI-enabled application by implementing an MCP server. -The server can be implemented in various languages, as long as it conforms to the protocol. For example, here’s an -implementation of a tool using the Python SDK from the MCP quickstart guide: +#### Discovering and running tools -```python -@mcp.tool() -async def get_alerts(state: str) -> str: - """Get weather alerts for a US state. +TODO: Spec and describe the `modelContext.getTools()` and `modelContext.executeTool()` APIs. - Args: - state: Two-letter US state code (e.g. CA, NY) - """ - url = f"{NWS_API_BASE}/alerts/active/area/{state}" - data = await make_nws_request(url) - if not data or "features" not in data: - return "Unable to fetch alerts or no alerts found." +## Alternatives Considered - if not data["features"]: - return "No active alerts for this state." +### 1. Direct Adoption of the Backend MCP Specification +We considered directly adopting the full Model Context Protocol (MCP) spec in the browser without creating a web-native API. However: +- MCP was built primarily for server-to-client and stdio/SSE process communication. It lacks native web concepts like origins, standard browser permissions, DOM integration, and tab-level lifecycle management. +- Coupling a web API directly to an actively evolving backend protocol would hinder backward compatibility and platform stability. - alerts = [format_alert(feature) for feature in data["features"]] - return "\n---\n".join(alerts) -``` +Instead, WebMCP derives direct inspiration and shares a **common vocabulary** with MCP (e.g., tools, schemas, parameters), but provides a form-fitting, client-safe solution designed natively for the web platform. -A client application implements a matching MCP client which takes a user’s query, communicates with one or more MCP -servers to enumerate their capabilities, and constructs a prompt to the AI platform, passing along any server-provided -tools or data. +### 2. Static Declarative Manifests +We considered declaring tools solely inside static manifest files (like the Web App Manifest). While useful for offline or background discovery: +- Static manifests prevent web developers from dynamically adding, updating, or removing tools based on the active page state or user authentication status. +- Manifests cannot contain executable code, meaning developers would still need an imperative way to register execution handlers. -The MCP protocol defines how this client-server communication happens. For example, a client can ask the server to list -all tools which might return a response like this: +Our current approach allows imperative script-based registration, with the potential for static declarations to be layered on in the future. -```json -{ - "jsonrpc": "2.0", - "id": 1, - "result": { - "tools": [ - { - "name": "get_weather", - "description": "Get current weather information for a location", - "inputSchema": { - "type": "object", - "properties": { - "location": { - "type": "string", - "description": "City name or zip code" - } - }, - "required": ["location"] - } - } - ], - "nextCursor": "next-page-cursor" +### 3. Event-Based Tool Execution (`'toolcall'`) +Another alternative was to handle tool execution exclusively via window-level events: +```js +navigator.agent.addEventListener('toolcall', async (e) => { + if (e.name === 'add-todo') { + e.respondWith(handleAddTodo(e.arguments)); } -} -``` - -Unlike OpenAPI, MCP is transport-agnostic. It comes with two built in transports: stdio which uses the systems standard -input/output, well suited for local communication between apps, and Server-Sent Events (SSE) which uses HTTP commands -for remote execution. - -### WebMCP (MCP-B) - -[MCP-B](https://mcp-b.ai/), or Model Context Protocol for the Browser, is an open source project found on GitHub [here](https://github.com/MiguelsPizza/WebMCP) and has much the same motivation and solution as described in this proposal. MCP-B's underlying protocol, also named WebMCP, extends MCP with tab transports that allow in-page communication between a website's MCP server and any client in the same tab. It also extends MCP with extension transports that use Chromium's runtime messaging to make a website's MCP server available to other extension components within the browser (background, sidebar, popup), and to other external MCP clients running on the same machine. MCP-B enables tools from different sites to work together, and for sites to cache tools so that they are discoverable even if the browser isn't currently navigated to the site. - -### OpenAPI - -OpenAPI is a standard for describing HTTP based APIs. Here’s an example in YAML (from the ChatGPT Actions guide): - -```yaml -openapi: 3.1.0 -info: - title: NWS Weather API - description: Access to weather data including forecasts, alerts, and observations. - version: 1.0.0 -servers: - - url: https://api.weather.gov - description: Main API Server -paths: - /points/{latitude},{longitude}: - get: - operationId: getPointData - summary: Get forecast grid endpoints for a specific location - parameters: - - name: latitude - in: path - required: true - schema: - type: number - format: float - description: Latitude of the point - - name: longitude - in: path - required: true - schema: - type: number - format: float - description: Longitude of the point - responses: - '200': - description: Successfully retrieved grid endpoints - content: - application/json: - schema: - type: object - properties: - properties: - type: object - properties: - forecast: - type: string - format: uri - forecastHourly: - type: string - format: uri - forecastGridData: - type: string - format: uri +}); ``` +- *Disadvantages*: This approach separates a tool's schema declaration from its implementation, making it harder to keep definitions and code in sync. It also leads to large `switch-case` statement blocks in event handlers. +- *Hybrid Approach*: We may still consider a hybrid model where a `"toolcall"` event is dispatched on the window *before* falling back to executing the registered imperative `execute` callback, allowing advanced interception. -A subset of the OpenAPI specification is used for function-calling / tool use for various AI platforms, such as ChatGPT -Actions and Gemini Function Calling. A user or developer on the AI platform would provide the platform with the OpenAPI -schema for an API they wish to provide as a “tool”. The AI is trained to understand this schema and is able to select -the tool and output a “call” to it, providing the correct arguments. Typically, some code external to the AI itself -would be responsible for making the API call and passing the returned result back to the AI’s conversation context to -reply to the user’s query. - -### Agent2Agent Protocol - -The Agent2Agent Protocol is another protocol for communication between agents. While similar in structure to MCP (client -/ server concepts that communicate via JSON-RPC), A2A attempts to solve a different problem. MCP (and OpenAPI) are -generally about exposing traditional capabilities to AI models (i.e. “tools”), A2A is a protocol for connecting AI -agents to each other. It provides some additional features to make common tasks in this scenario more streamlined, such -as: capability advertisement, long running and multi-turn interactions, and multimodal input/output. -## Open topics +## Prior Art -### Security considerations +- **Model Context Protocol (MCP)**: Developed by Anthropic, MCP is supported by Claude Desktop and enables applications to connect with AI models. +- **WebMCP (MCP-B)**: An open-source project (see [MCP-B](https://mcp-b.ai/)) implementing browser tab and extension transports for local in-page communication. +- **OpenAPI**: The standard specification for describing HTTP APIs, used in platform-specific extensions like ChatGPT Actions. +- **Agent2Agent (A2A) Protocol**: A protocol focused on connecting distinct autonomous AI agents to one another. -There are security considerations that will need to be accounted for, especially if the WebMCP API is used by semi-autonomous systems like LLM-based agents. Engagement from the community is welcome. -### Model poisoning +## Security and Privacy Considerations -Explorations should be made on the potential implications of allowing web developers to create tools in their front-end code for use in AI agents and LLMs. For example, vulnerabilities like being able to access content the user would not typically be able to see will need to be investigated. +Interacting with AI agents crosses traditional trust boundaries. Security, privacy, permissions policy, and origin isolation are crucial aspects of this proposal. -### Cross-Origin Isolation +For detailed discussions, see [Security & Privacy Considerations](./docs/security-privacy-considerations.md) and the active community updates in [PR #181](https://github.com/webmachinelearning/webmcp/pull/181). -Client applications would have access to many different web sites that expose tools. Consider an LLM-based agent. It is possible and even likely that data output from one application's tools could find its way into the input parameters for a second application's tool. There are legitimate reasons for the user to want to send data across origins to achieve complex tasks. Care should be taken to indicate to the user which web applications are being invoked and with what data so that the user can intervene. -### Permissions +## Open Questions -A trust boundary is crossed both when a web site first registers tools via WebMCP, and when a new client agent wants to use these tools. When a web site registers tools, it exposes information about itself and the services it provides to the host environment (i.e. the browser). When agents send tool calls, the site receives untrusted input in the parameters and the outputs in turn may contain sensitive user information. The browser should prompt the user at both points to grant permission and also provide a means to see what information is being sent to and from the site when a tool is called. To streamline workflows, browsers may give users the choice to always allow tool calls for a specific web app and client app pair. +As the WebMCP proposal continues to evolve with community and stakeholder feedback, we are tracking several active design discussions and technical challenges: -### Model Context Protocol (MCP) without WebMCP +- **Multimodal input/output**: AI agents are increasingly multimodal, and we should consider how tools can consume binary media as inputs and how to return them as outputs (e.g., audio, streams, media blobs, etc.). See [Issue #41](https://github.com/webmachinelearning/webmcp/issues/41), [Issue #86](https://github.com/webmachinelearning/webmcp/issues/86), and [Issue #81](https://github.com/webmachinelearning/webmcp/issues/81), and [Prompt API: Multimodal inputs](https://github.com/webmachinelearning/prompt-api#multimodal-inputs). -MCP has quickly garnered wide interest from the developer community, with hundreds of MCP servers being created. WebMCP is designed to work well with MCP, so that developers can reuse many of the MCP topics with their front-end website using JavaScript. We originally planned to propose an explainer very tightly aligned with MCP, providing all the same concepts supported by MCP at the time of writing, including tools, resources, and prompts. Since MCP is still actively changing, matching its exact capabilities would be an ongoing effort. Aligning the WebMCP API tightly with MCP would also make it more difficult to tailor WebMCP for non-LLM scenarios like OS and accessibility assistant integrations. Keeping the WebMCP API as agnostic as possible increases the chance of it being useful to a broader range of potential clients. +- **Cross-document tool response**: How should WebMCP handle tool responses when a tool (a form submission, for example) causes the page to navigate to another document? See [Issue #135](https://github.com/webmachinelearning/webmcp/issues/135). -We expect some web developers will continue to prefer standalone MCP instead of WebMCP if they want to have an always-on MCP server running that does not require page navigation in a full browser process. For example, server-to-server scenarios such as fully autonomous agents will likely benefit more from MCP servers. WebMCP is best suited for local browser workflows with a human in the loop. +- **Built-in agent exposure by default**:

The [`exposedTo`](https://webmachinelearning.github.io/webmcp/#dom-modelcontextregistertooloptions-exposedto) array only takes origins, but we're considering introducing a new keyword like `native-agent`, letting authors control a tool's exposure to a built-in agent. The running idea is that by default in the top-level document, a missing `exposedTo` array would expose tools to the built-in agent, and in iframes, a missing `exposedTo` array would not expose tools to the built-in agent

-The WebMCP API still maps nicely to MCP, and exposing WebMCP tools to external applications via an MCP server is still a useful scenario that a browser implementation may wish to enable. +- **Transferable/streamable tool inputs and outputs**: AI models inherently support streaming data. WebMCP should consider enabling streaming tool inputs and outputs (such as chunked generation or large data transfers) without blocking on a massive copy. See [Issue #82](https://github.com/webmachinelearning/webmcp/issues/82). See also [MCP discussion](https://github.com/modelcontextprotocol/modelcontextprotocol/discussions/263) and [MCP Apps streaming tool inputs](https://github.com/modelcontextprotocol/ext-apps/blob/main/specification/draft/apps.mdx#notifications-host--view). -### Existing web automation techniques (DOM, accessibility tree) +- **Input and output schema validation**: Investigating native validation of tool inputs and outputs against declared JSON schemas before invoking the page's JS execution callback, or letting the output reach the model. See [Issue #92](https://github.com/webmachinelearning/webmcp/issues/92). -One of the scenarios we want to enable is making the web more accessible to general-purpose AI-based agents. In the absence of alternatives like MCP servers to accomplish their goals, these general-purpose agents often rely on observing the browser state through a combination of screenshots, and DOM and accessibility tree snapshots, and then interact with the page by simulating human user input. We believe that WebMCP will give these tools an alternative means to interact with the web that give the web developer more control over whether and how an AI-based agent interacts with their site. +- **Skills Integration**: Determining if the author should expose a higher-level "skill" to help the agent coordinate multiple related tools to fulfill a user journey. See [Issue #161](https://github.com/webmachinelearning/webmcp/issues/161). -The proposed API will not conflict with these existing automation techniques. If an agent or assistive tool finds that the task it is trying to accomplish is not achievable through the WebMCP tools that the page provides, then it can fall back to general-purpose browser automation to try and accomplish its task. +- **Output schema**: Supporting structured `outputSchema` contracts (complementing `inputSchema`) to help LLMs reliably reason about the return values of tools. See [Issue #9](https://github.com/webmachinelearning/webmcp/issues/9). -## Future explorations +- **User prompting and elicitation**: Exploring a way for a tool to prompt the user for confirmation when tools require explicit user authorization. This could be done by delegating to the agent and its harness, or by invoking native browser permission dialogue outside of the agent loop. See [Issue #165](https://github.com/webmachinelearning/webmcp/issues/165) and [Issue #50](https://github.com/webmachinelearning/webmcp/issues/50) for discussion about the [`ModelContextClient`](https://webmachinelearning.github.io/webmcp/#modelcontextclient) interface. -### Progressive web apps (PWA) +- **Tool progress reporting**: For long-running tasks (e.g., batch processing or generating content), the agent may want a way to track a tool's progress. We are exploring how this intersects with the established [MCP Progress](https://modelcontextprotocol.io/specification/2025-11-25/basic/utilities/progress) specification. -PWAs should also be able to use the WebMCP API as described in this proposal. There are potential advantages to installing a site as a PWA. In the current proposal, tools are only discoverable once a page has been navigated to and only persist for the lifetime of the page. A PWA with an app manifest could declare tools that are available "offline", that is, even when the PWA is not currently running. The host system would then be able to launch the PWA and navigate to the appropriate page when a tool call is requested. +- **Service workers integration**: Extending WebMCP to background Service Workers to allow agents to discover and invoke tools on sites the user doesn't currently have open. This is detailed in the supplementary [Service Workers Explainer](./docs/service-workers.md), which proposes background discovery mechanisms, session identification, and JIT worker installation. -### Background model context providers -Some tools that a web app may want to provide for agents and assistive technologies may not require any web UI. For example, a web developer building a "To Do" application may want to expose a tool that adds an item to the user's todo list without showing a browser window. The web developer may be content to just show a notification that the todo item was added. +## Acknowledgments -For scenarios like this, it may be helpful to combine tool call handling with something like the ['launch'](https://github.com/WICG/web-app-launch/blob/main/sw_launch_event.md) event. A client application might attach a tool call to a "launch" request which is handled entirely in a service worker without spawning a browser window. +> First published August 13, 2025 +> +> Brandon Walderman <brwalder@microsoft.com>
+> Leo Lee <leo.lee@microsoft.com>
+> Andrew Nolan <annolan@microsoft.com>
+> David Bokan <bokan@google.com>
+> Khushal Sagar <khushalsagar@google.com>
+> Hannah Van Opstal <hvanopstal@google.com> -## Acknowledgments +Since then, the specification draft has evolved significantly, primarily driven by [Dominic Farolino](https://github.com/domfarolino). -Many thanks to [Alex Nahas](https://github.com/MiguelsPizza) and [Jason McGhee](https://github.com/jasonjmcghee/) for sharing related [implementation](https://github.com/MiguelsPizza/WebMCP) [experience](https://github.com/jasonjmcghee/WebMCP). +Many thanks to [Alex Nahas](https://github.com/MiguelsPizza) and [Jason McGhee](https://github.com/jasonjmcghee/) for sharing their valuable [implementation](https://github.com/MiguelsPizza/WebMCP) [experience](https://github.com/jasonjmcghee/WebMCP). diff --git a/docs/explainer.md b/docs/explainer.md deleted file mode 100644 index 3a46787..0000000 --- a/docs/explainer.md +++ /dev/null @@ -1,3 +0,0 @@ -# WebMCP đź§Ş - -The latest WebMCP explainer draft can be found in this repo's README.md file [here](https://github.com/webmachinelearning/webmcp/blob/main/README.md) \ No newline at end of file diff --git a/docs/proposal.md b/docs/proposal.md deleted file mode 100644 index 9ebd3f0..0000000 --- a/docs/proposal.md +++ /dev/null @@ -1,317 +0,0 @@ -# WebMCP API Proposal - -> August 13, 2025 -> -> Brandon Walderman <brwalder@microsoft.com>
-> Andrew Nolan <annolan@microsoft.com>
-> David Bokan <bokan@google.com>
-> Khushal Sagar <khushalsagar@google.com>
-> Hannah Van Opstal <hvanopstal@google.com> - -## Definitions - -- **Model context provider**: A single top-level browsing context navigated to a page that uses the WebMCP API to provide context (i.e. tools) to agents. -- **Agent**: An application that uses the provided context. This may be something like an AI assistant integrated into the browser, or possibly a native/desktop application. - -## Understanding WebMCP - -Only a top-level browsing context, such as a browser tab can be a model context provider. A page calls the WebMCP API's methods to register tools with the browser. An agent requires some information from the tool in order to use it. A simple, common subset emerges from [existing AI integration APIs](explainer.md#prior-art): - -* A natural language description of the tool / function -* For each parameter: - * A natural language description of the parameter - * The expected type (e.g. Number, String, Enum, etc) - * Any restrictions on the parameter (e.g. integers greater than 0) -* A JS callback function that implements the tool and returns a result - -When an agent that is connected to the page sends a tool call, the JavaScript callback is invoked, where the page can handle the tool call and respond to the agent. The function can be asynchronous and return a promise, in which case the agent will receive the result once the promise is resolved. Simple applications can handle tool calls entirely in page script, but more complex applications may choose to delegate computationally heavy operations to workers and respond to the agent asynchronously. - -Handling tool calls in the main thread with the option of delegating to workers serves a few purposes: - -- Ensures tool calls run one at a time and sequentially. -- The page can update UI to reflect state changes performed by tools. -- Handling tool calls in page script may be sufficient for simple applications. - -## Benefits of this design - -- **Familiar language/tools**: Lets a web developer implement their tools in JavaScript. -- **Code reuse**: A web developer may only need to make minimal changes to expose existing functionality as tools if their page already has an appropriate JavaScript function. -- **Local tool call handling**: Enables web developers to integrate their pages with AI-based agents by working with, but not solely relying on, techniques like Model Context Protocol that require a separate server and authentication. A web developer may only need to maintain one codebase for their frontend UI and agent integration, improving maintainability and quality-of-life for the developer. Local handling also potentially reduces network calls and enhances privacy/security. -- **Fine-grained permissions**: Tool calls are mediated through the browser, so the user has the opportunity to review the requesting client apps and provide consent. -- **Developer involvement**: Encourages developer involvement in the agentic web, required for a thriving web. Reduces the need for solutions like UI automation where the developer is not involved, improving privacy, reducing site expenses, and a better customer experience. -- **Seamless integration**: Since tool calls are handled locally on a real browser, the agent can interleave these calls with human input when necessary (e.g. for consent, auth flows, dialogs, etc.). -- **Accessibility**: Bringing tools to webpages via WebMCP may help users with accessibility needs by allowing them to complete the same job-to-be-done via agentic or conversational interfaces instead of relying on the accessibility tree, which many websites have not implemented. - -## Limitations of this design - -- **Browsing context required**: Since tool calls are handled in JavaScript, a browsing context (i.e. a browser tab or a webview) must be opened. There is currently no support for agents or assistive tools to call tools "headlessly" without visible browser UI. This is a future consideration which is discussed further below. -- **UI synchronization**: For a satisfactory end user experience, web developers need to ensure their UI is updated to reflect the current app state, regardless of whether the state updates came from human interaction or from a tool call. -- **Complexity overhead**: In cases where the site UI is very complex, developers will likely need to do some refactoring or add JavaScript that handles app and UI state with appropriate outputs. -- **Tool discoverability**: There is no built-in mechanism for client applications to discover which sites provide callable tools without visiting or querying them directly. Search engines, or directories of some kind may play a role in helping client applications determine whether a site has relevant tools for the task it is trying to perform. - -## API - -### modelContext -The `window.navigator.modelContext` interface is introduced for the site to declare functionality that can be used by an AI Agent. Access to these tools is arbitrated by the browser. - -The `modelContext`'s `registerTool()` method is used to add and remove tools from the agent's context. - -```js -const addTodoTool = { - execute: ({ text }, agent) => { - // Add todo item and update UI. - return /* structured content response */ - }, - name: "add-todo", - description: "Add a new todo item to the list", - inputSchema: { - type: "object", - properties: { - text: { type: "string", description: "The text of the todo item" } - }, - required: ["text"] - }, -}; -const controller = new AbortController(); - -window.navigator.modelContext.registerTool(addTodoTool, { signal: controller.signal }); - -// Unregister tool later... -controller.abort(); -``` - -### agent -The `agent` interface is introduced to represent an AI Agent using the functionality declared by the site through the `modelContext`. The lifetime of this interface is scoped to the execution of a tool. It is passed as a parameter when executing a tool's function. This interface provides the dependencies required by the site from the Agent. - -The `agent` provides a `requestUserInteraction` API to asynchronously seek user input during the execution of a tool. The API can be invoked multiple times during the execution of a tool. - -```js - window.navigator.modelContext.registerTool({ - execute: buyProduct, - name: "buyProduct", - description: "Use this tool to purchase a product given its unique product_id.", - inputSchema: { - type: "object", - properties: { - "product_id": { - description: "The unique identifier for the product to be purchased.", - type: "string", - } - }, - required: ["product_id"] - }, - }); -async function buyProduct({ product_id }, agent) { - // Request user confirmation before executing the action. - const confirmed = await agent.requestUserInteraction(async () => { - return new Promise((resolve) => { - const confirmed = confirm(`Buy product ${product_id}?\nClick OK to confirm, Cancel to abort.`); - resolve(confirmed); - }); - }); - - if (!confirmed) { - throw new Error("Purchase cancelled by user."); - } - - executePurchase(product_id); - return `Product ${product_id} purchased.`; -} -``` - -## Alternatives Considered -One disadvantage of the current registration approach is that the browser must navigate to the page and run JavaScript to discover tools. If WebMCP gains traction in the web developer community, it will become important to have a way to discover which sites have tools that are relevant to a user's request. Discovery is a topic that may warrant its own explainer, but suffice to say, it may be beneficial to have a way to know what capabilities a page offers without having to navigate to the web site first. As an example, a future iteration of this feature could introduce declarative tools definitions that are placed in an app manifest so that agents would only need to fetch the manifest with a simple HTTP GET request. Agents will of course still need to navigate to the site to actually use its tools, but a manifest makes it far less costly to discover these tools and reason about their relevance to the user's task. - -To make such a scenario easier, it would be beneficial to consider an alternate means of tool call execution; one that separates the tool defintion and schema (which may exist in an external manifest file) from the implementation function. - -One way to do this is to handle tool calls as events, as shown below: - -```json -// 1. manifest.json: Define tools declaratively. Exact syntax TBD. - -{ - // .. other manifest fields .. - "tools": [ - { - "name": "add-todo", - "description": "Add a new todo item to the list", - "inputSchema": { - "type": "object", - "properties": { - "text": { "type": "string", "description": "The text of the todo item" } - }, - "required": ["text"] - }, - } - ] -} -``` - -```js -// 2. script.js: Handle tool calls as events. - -window.agent.addEventListener('toolcall', async e => { - if (e.name === "add-todo") { - // Add todo item and update UI. - e.respondWith(/* structured content response */); - return; - } // etc... -}); -``` - -Tool calls are handled as events. Since event handler functions can't respond to the agent by returning a value directly, the `'toolcall'` event object has a `respondWith()` method that needs to be called to signal completion and respond to the agent. This is based on the existing service worker `'fetch'` event. - -**Advantages:** - -- Allows additional context different discovery mechanisms without rendering a page. - -**Disadvantages:** - -- Slightly harder to keep definition and implementation in sync. -- Potentially large switch-case in event handler. - -### Open Question - -A **hybrid** approach of both of the examples above should be considered as this would make it easy for web developers to get started adding tools to their page, while leaving open the possibility of manifest-based approaches in the future. To implement this hybrid approach, a `"toolcall"` event is dispatched on every incoming tool call _before_ executing the tool's `execute` function. The event handler can handle the tool call by calling the event's `preventDefault()` method, and then responding to the agent with `respondWith()` as shown above. If the event handler does not call `preventDefault()` then the browser's default behavior for tool calls will occur. The `execute` function for the requested tool is called. If a tool with the requested name does not exist, then the browser responds to the agent with an error. - -## Example of WebMCP API usage - -Consider a web application like an example Historical Stamp Database. TODO(brwalder): Port the source code for example here. - -Screenshot of Historical Stamp Database - -The page shows the stamps currently in the database and has a form to add a new stamp to the database. The author of this app is interested in leveraging the WebMCP API to enable agentic scenarios like: - -- Importing multiple stamps from outside data sources -- Back-filling missing images -- Populating/correcting descriptions with deep research -- Adding information to descriptions about rarity -- Allowing end users to engage in a conversational interface about the stamps on the site and use that information in agentic flows - -Using the WebMCP API, the author can add just a few simple tools to the page for adding, updating, and retrieving stamps. With these relatively simple tools, an AI agent would have the ability to perform complex tasks like the ones illustrated above on behalf of the user. - -The example below walks through adding one such tool, the "add-stamp" tool, using the WebMCP API, so that AI agents can update the stamp collection. - -The webpage today is designed with a visual UX in mind. It uses simple JavaScript with a `'submit'` event handler that reads the form fields, adds the new record, and refreshes the UI: - -```js -document.getElementById('addStampForm').addEventListener('submit', (event) => { - event.preventDefault(); - - const stampName = document.getElementById('stampName').value; - const stampDescription = document.getElementById('stampDescription').value; - const stampYear = document.getElementById('stampYear').value; - const stampImageUrl = document.getElementById('stampImageUrl').value; - - addStamp(stampName, stampDescription, stampYear, stampImageUrl); -}); -``` - -To facilitate code reuse, the developer has already extracted the code to add a stamp and refresh the UI into a helper function `addStamp()`: - -```js -function addStamp(stampName, stampDescription, stampYear, stampImageUrl) { - // Add the new stamp to the collection - stamps.push({ - name: stampName, - description: stampDescription, - year: stampYear, - imageUrl: stampImageUrl || null - }); - - // Confirm addition and update the collection - document.getElementById('confirmationMessage').textContent = `Stamp "${stampName}" added successfully!`; - renderStamps(); -} -``` - -To let AI agents use this functionality, the author defines the available tools. The `modelContext` property on the `Window` is checked to ensure the browser supports WebMCP. If supported, the `registerTool()` method is called with an object describing the new "Add Stamp" tool. The tool accepts as parameters the same set of fields that are present in the HTML form, since this tool and the form should be functionally equivalent. - -```js -if ("modelContext" in window.navigator) { - window.navigator.modelContext.registerTool({ - { - name: "add-stamp", - description: "Add a new stamp to the collection", - inputSchema: { - type: "object", - properties: { - name: { type: "string", description: "The name of the stamp" }, - description: { type: "string", description: "A brief description of the stamp" }, - year: { type: "number", description: "The year the stamp was issued" }, - imageUrl: { type: "string", description: "An optional image URL for the stamp" } - }, - required: ["name", "description", "year"] - }, - execute({ name, description, year, imageUrl }, agent) { - // TODO - } - } - }); -} -``` - -Now the author needs to implement the tool. The tool needs to update the stamp database, and refresh the UI to reflect the change to the database. Since the code to do this is already available in the `addStamp()` function written earlier, the tool implementation is very simple and just needs to call this helper when an "add-stamp" tool call is received. After calling the helper, the tool needs to signal completion and should also provide some sort of feedback to the client application that requested the tool call. It returns a text message indicating the stamp was added: - -```js -execute({ name, description, year, imageUrl }, agent) { - addStamp(name, description, year, imageUrl); - - return { - content: [ - { - type: "text", - text: `Stamp "${name}" added successfully! The collection now contains ${stamps.length} stamps.`, - }, - ] - }; -} -``` -### Future improvements to this example - -#### Use a worker - -To improve the user experience and make it possible for the stamp application to handle a large number of tool calls without tying up the document's main thread, the web developer may choose to move the tool handling into a dedicated worker script. Handling tool calls in a worker keeps the UI responsive, and makes it possible to handle potentially long-running operations. For example, if the user asks an AI agent to add a list of hundreds of stamps from an external source such as a spreadsheet, this will result in hundreds of tool calls. - -#### Adaptive UI - -The author may also wish to change the on-page user experience when a client is connected. For example, if the user is interacting with the page primarily through an AI agent or assistive tool, then the author might choose to disable or hide the HTML form input and use more of the available space to show the stamp collection. - -## Intersection with MCP - -MCP is a layered protocol enabling client-server communication. The client owns the AI Agent connecting to external systems using this protocol and the server is the external system. The protocol has the following layers: - -- Primitives like tools (executable APIs), resources (static context) and prompts (templates for system prompts). -- Data layer for control messages between the client and server. For example, the client sends a `tools/list` message to request the set of tools from the server. -- Transport layer to abstract how the control messages are exchanged between the client-server (for example, HTTP POST requests). - -This proposal aligns the Web API closely with MCP primitives. This ensures agentic capabilities on the Web declared via WebMCP can be used by any MCP compatible Agent with minimal translation layers; and makes it easier for web authors to reuse code with their MCP service. - -Implementation of the data layer to arbitrate access to these primitives for an Agent is left to the browser. This design has the following advantages: - -1. It doesn’t directly couple the Web to a specific MCP version. The fact that the control flow is intermediated by the browser, instead of being opaque messages exchanged between the site and the Agent, allows the browser to maintain backwards compatibility as the protocol evolves. -2. The browser can apply security policies unique to the web platform. For example, embedders would need to manage the capabilities provided to iframes. -3. The API ergonomics can align with the Web platform. For example, tool response can use `img` or `video` elements for multi-modal output. -4. There can be a declarative counterpart to imperative tools, see [issue 22](https://github.com/webmachinelearning/webmcp/issues/22). - -## Other API Alternatives considered - -### Web App Manifest, other manifest-based or declarative approaches - -We considered declaring tools statically in a site's Web App Manifest. Declaring tools solely in the Web App Manifest limits WebMCP to PWAs which could impact adoption since users would need to install a site as an app for tools to be available. - -Another type of manifest could be proposed but using this approach also means that only a fixed set of static tools are available and can't be updated dynamically based on application state, which seems like an important ability for web developers. Since manifests can't execute code, it also means manifests are additional work for the developer since they will need to still implement the tool somewhere. - -Our recommended approach above allows for the possibility of declarative tools in the future while giving web developers as much control as possible by defining tools in script. - -### Handling tool calls in worker threads - -Handling tool calls on the main thread raises performance concerns, especially if an agent requests a large amount of tool calls in sequence, and/or the tools are computationally expensive. A design alternative that required tool calls to be handled in workers was considered instead. - -One proposal was to expose the WebMCP API only in service workers and let the service worker post messages to individual client windows/tabs as needed in order to update UI. This would have complicated the architecture and required web developers to add a service worker. This would also have required the Session concept described earlier to help the service worker differentiate between agents that are connected to different windows and dispatch requests from a particular agent to the correct window. - -For long-running, batched, or expensive tool calls, we expect web developers will dynamically update their UI when these are taking place to temporarily cede control to the agent (e.g. disable or remove human form inputs, indicate via UI that an agent is in control), and take advantage of dedicated workers as needed to offload expensive operations. This can be achieved with existing dedicated or shared workers. - -## Acknowledgments - -Many thanks to [Alex Nahas](https://github.com/MiguelsPizza) and [Jason McGhee](https://github.com/jasonjmcghee/) for sharing related [implementation](https://github.com/MiguelsPizza/WebMCP) [experience](https://github.com/jasonjmcghee/WebMCP).