Benchmarking LLM Accuracy in Real-World API Orchestration

Date

Introduction

AI agents are becoming a core part of enterprise integration - not just for pre-built flows like ETL and API aggregation, but for ad-hoc, on-demand orchestration inside real-time agent workflows.

As adoption accelerates, a key question is whether LLMs can reliably navigate the complexity of real-world API estates.

So we set out to answer:

How well can LLMs plan workflows which orchestrate multiple APIs together, under real-world conditions?

We recently completed a detailed study into exactly this - measuring LLM accuracy when planning API orchestration tasks under increasingly realistic conditions.

We gave the agents a task (in the form of an “email from my boss”), and an (ever increasing) collection of OpenAPI specs, and asked them to design a solution.

We then scored them against scoring criteria, running each test 30 times, and averaged the results.

If you’re in a hurry, here’s a link to the results.

Download the results:

Key findings

We found four key facts:

Planning accuracy falls to unusable levels somewhere between 60 and 300 endpoints

Planning accuracy (more on how we measured this below) fell to ~30% and ~49% once we increased the population of endpoints to 300.

Interestingly, the degradation didn’t continue at 600 endpoints. We were surprised by this and don’t have a clear answer as to why.

Note: An “endpoint” is a Path and a Verb (eg: GET /customers/123)

# of endpointsSonnet 4.5ChatGPT 5.1
60 endpoints51.3%62.7%
300 endpoints30.1%49.4%
600 endpoints30.9%46.4%

Adding even minimal semantic metadata improves planning accuracy

  • We added semantic metadata to the OpenAPI specs on fields and input parameters.
  • We updated our prompts to “consider semantic metadata (expressed via OpenAPI x-taxi-type annotations), which explicitly describe the semantic meaning of data contained within each field.”
  • We didn’t provide any additional information about “Taxi”, ”x-taxi-type” or “Semantic Metadata” beyond what’s already present in standard base models

The impact was a meaningful increase in accuracy compared to baseline:

Sonnet 4.5ChatGPT 5.1
600 endpoints with plain OpenAPI30.9%46.1%
600 endpoints with OpenAPI using Taxi annotations46.4%58.8%

Importantly, this improvement came purely from base model knowledge of semantic metadata - we didn’t provide specialized prompts, examples, or training about how to use these annotations.

Example of semantic metadata

We added x-taxi-type annotations to the OpenAPI specs. To reiterate - we didn’t tell the LLMs anything more about what this was. In practice, both models appeared to be familiar with Taxi and TaxiQL from their training data.

/quotes/{symbol}:
  get:
    summary: Get single quote
    operationId: getQuote
    parameters:
      - name: symbol
        in: path
        required: true
        schema:
          type: string
          x-taxi-type: com.bank.trading.Ticker
    responses:
      '200':
        description: OK
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/Quote'

And to response schemas:

components:
  schemas:
    Quote:
      type: object
      properties:
        symbol:
          type: string
          x-taxi-type: com.bank.trading.Ticker
        bid:
          type: number
          x-taxi-type: com.bank.trading.BidPrice
        ask:
          type: number
          x-taxi-type: com.bank.trading.AskPrice

Adopting a declarative query language improves LLM planning accuracy by between 73% - 142%

We asked the agents to express data requirements using TaxiQL queries.

The result was a significant improvement in planning accuracy.

Sonnet 4.5ChatGPT 5.1
600 endpoints with plain OpenAPI30.9%46.1%
600 endpoints with OpenAPI using Taxi + TaxiQL74.7%85.5%

There’s some important caveats here:

  • For this test run, we provided Taxi schemas, not OpenAPI.
  • Critically, this reduced token usage by 80%, which meant much smaller context windows.
  • LLMs were told to “outline the API calls as TaxiQL queries”
  • We didn’t provide specific information about TaxiQL to the LLMs beyond what’s already in their base training

Using Taxi for APIs (instead of OpenAPI) reduced token usage by 80%

In the final tests (TaxiQL focussed) we sent API specs to LLMs in Taxi, rather than OpenAPI.

This dramatically reduced the consumed tokens (and therefore token size). As teams become sensitive to token spend, this is a quick win for budgets.

Tokens consumed
OpenAPI62,337
Taxi12,542

For reference, here’s the same example from above, expressed in Taxi:

service QuotesApi {
   @HttpOperation(method = "GET", path = "/quotes/{symbol}")
   operation getQuote(symbol:Ticker):Quote
}
model Quote {
   ticker: com.bank.trading.Ticker
   bid: com.bank.trading.BidPrice
   ask: com.bank.trading.AskPrice
}

However, because we deferred orchestration to TaxiQL, the LLM doesn’t actually need to know about the service definitions, only the models and types, so the ACTUAL Taxi schemas ended up looking more like this:

model Quote {
   ticker: com.bank.trading.Ticker
   bid: com.bank.trading.BidPrice
   ask: com.bank.trading.AskPrice
}

(That’s 36 tokens vs 177 in the OpenAPI definition)


Scoring

What we measured - and why

We did this research to ask the question:

How good are LLMs at building an orchestration plan that spans multiple APIs?

More specifically - we wanted to ask three questions:

  1. How well do LLMs perform at API orchestration when facing real-world complexity (hundreds of endpoints)?
  2. Does adding semantic metadata improve their accuracy?
  3. Does the adoption of a declarative orchestration language (specifically TaxiQL) make a difference?

Disclosures

It’s important to be transparent about motivations - we did this research specifically because we’re investing in building in this area, and want to be sure that we’re both:

  1. addressing a problem that actually exists, and
  2. proposing a solution that works

However, we took care to test in a balanced way.

We didn’t give specialized training or prompting for the alternative approaches we were testing (which were adding a semantic layer, and using a declarative orchestration language — specifically TaxiQL).

Independently of the main research, we also tested how well the LLMs could produce real compiling TaxiQL queries. For this we did use specialized prompts (and we’re continuing to iterate them), but we’ve called these out separately in the research results.

How we measured

The agents were measured against 4 criteria:

  • API Flow - The agents had to pick the correct 5 endpoints (path and verb), and sequence them in the correct order
  • Correct Identifier - Multiple ID schemes were used. The agents had to recognize this, and call an API endpoint to swap IDs, passing in and reading out the correct fields
  • Business logic - Agents needed to describe (but not implement) how data from API responses would be used to implement business logic checks, or to pass to the next API call
  • Robustness - Data is messy - nulls and partial records exist — agents needed to describe how they’d handle this

Importantly:

  • Agents only had to build an accurate plan. If they produced code, they weren’t penalized if the code didn’t compile.

To score:

  • A scoring sheet defined the key elements agents needed to articulate
  • We used multiple LLMs as judges in a multi-stage review process:
    • Two separate LLMs scored the output against the scoring sheet (GPT-5.1 and Gemini-2.5 Pro)
    • Both scoring results were sent back to an LLM to critique and refine the scores (Gemini-2.5-Pro)
  • The first LLM reviewed critiques and had the opportunity to adjust its scores (GPT-5.1)

Each scenario was run and scored 30 times, and the scores were averaged

Summary

Our research shows that AI agents need a semantic layer to operate reliably in enterprise environments.

Without it, accuracy degrades to unusable levels as API complexity grows - exactly the conditions agents will face in production.

The encouraging result is that even minimal semantic metadata improves outcomes, and moving orchestration into a declarative layer (TaxiQL) improved planning accuracy substantially while reducing token consumption.

Both Taxi and TaxiQL are open source, making this approach accessible to any organization. Taxi is designed to be interoperable with existing schema languages, so you can start adding semantic metadata to your OpenAPI, Avro, SOAP, or Protobuf specs now - without retooling.

If you’d like to discuss applying this approach in your environment, join us on Slack or reach out by Email. Or you can simply try Orbital now.