How we define instruments

How The Grid turns "AI inference" into a tradable contract. What an instrument is, what its Quality Score measures, and what suppliers attest to when they offer supply.

Specifications make a market possible

Every commodity market starts with the same problem: defining the product. Wheat trades on protein content, moisture, and grade. Natural gas trades on BTU per cubic foot. Until the specification is written down, a buyer in one city has no way to transact with a seller in another without inspecting the goods. Specifications are what turn supply and demand into a working market.

Instruments are how we establish specifications for inference. An instrument is a standardized contract defined by measurable performance thresholds and audited continuously. Any model from any supplier that meets every threshold can fill orders for that instrument. You pass the instrument string in the model parameter; the order book routes your request to a qualifying supplier offering the best price.

You access instruments through both the OpenAI Chat Completions API at https://api.thegrid.ai/v1 and the Anthropic Messages beta at https://messages-beta.api.thegrid.ai/v1. The same instrument string works on either surface. Use whichever SDK your code already uses.

Output thresholds, not model names

Every instrument is defined by measurable output thresholds, not by the model running underneath. The specific model that serves your request is an implementation detail. You buy a quality floor measured against independent benchmarks. Any model from any supplier can fill your request as long as its output clears every threshold. That fungibility is what makes a liquid market possible.

Each instrument specifies a core set of parameters:

  • Quality Score. A composite floor on the relevant Artificial Analysis index. Text instruments use the Intelligence Indexarrow-up-right. Code instruments use the Coding Indexarrow-up-right. Agent instruments use the Agentic Indexarrow-up-right.

  • Throughput. Minimum tokens per second the supplier sustains.

  • Time to first token (TTFT). Maximum allowed latency before the first token streams back.

  • Context window. Minimum input token capacity.

  • Output length. Maximum tokens the model produces per response.

  • Uptime and error rates. Compliance bands monitored continuously across the live supplier set.

We do not create or maintain any of these benchmarks. Every evaluation comes from an independent third party (Artificial Analysis, SEAL, Aider, METR, Berkeley). Qualification is verifiable against the public leaderboards.

Each task type uses benchmarks tailored to its workload

The benchmarks that matter depend on the task type. The mix below is what determines whether a model is eligible for entry.

Text instruments

Code instruments

Agent instruments

A high score on a chat benchmark does not predict whether a model can complete a multi-step coding task, and a strong coding model is not necessarily good at long-horizon agent work. Each task type is benchmarked on what its workload actually looks like.

Instruments cover text, code, and agentic use cases

Three task types (text, code, agent), three tiers each (Standard, Prime, Max). Nine instruments total. Text is live today. Code and Agent are in Preview.

The three tiers across each task type follow the same shape:

  • Prime is the production default. Strong reasoning at a fraction of frontier cost. Most workloads belong here.

  • Max is for frontier work. Long context windows, the highest Quality Score floors, and the broadest task-specific evaluations. Use it when getting the wrong answer is expensive.

  • Standard is the high-throughput tier. Lower Quality Score floor, faster TTFT, and per-call costs that hold up at volume.

For the live catalog, with thresholds and when-to-use guidance for each instrument, see Current instruments.

Qualification has two stages

Meeting an instrument specification is not a one-time event. Qualification has two stages, and both must hold for a model and supplier to keep serving orders.

Stage 1: clear every threshold on a fresh benchmark run

A model clears every applicable threshold on a fresh benchmark run before any supplier can offer supply on that instrument. Thresholds are public on Current instruments. The relevant evaluations depend on the task type:

  • Text: Intelligence Index for entry.

  • Code: Coding Index, SWE-bench Pro via SEAL, and Aider Polyglot for entry.

  • Agent: Agentic Index, METR time horizons, and tau-squared-Bench for entry.

Once a model clears every threshold, suppliers can list inventory and start filling orders.

Stage 2: live endpoints are evaluated continuously to catch drift

Passing entry is the start, not the finish line. We run an internal evaluation suite against live supplier endpoints on a recurring basis to catch drift. Drift is when a model that qualified at launch has degraded, or a supplier's serving stack starts missing the latency or throughput floor.

Suppliers that drift below the compliance band get flagged. The supplier gets a remediation window to fix the issue or roll over to a qualifying model. Models that do not recover get removed from that instrument's eligible set. Your code never changes; the instrument string is stable. The supplier mix underneath shifts as the specification gets enforced.

The eligible model list is dynamic

Eligible models change as new releases hit the leaderboards and existing models get updated. The eligible set is dynamic by design. That is what keeps each instrument honest as the frontier moves.

Specific models are not pinned in this section of the docs because the list churns. Treat any names you see as illustrative.

Output-based specifications give every buyer the same floor

A model name does not tell you what you are buying. Two suppliers serving the same model can produce different outputs depending on their serving stack: quantization, batching, context handling, sampling defaults. A specification defined in measurable terms (Quality Score, throughput, latency, context, output length, uptime) gives every buyer the same floor regardless of which supplier fills the order.

This matters for three reasons:

  1. You get benchmarked quality with continuous compliance. Every order gets filled by a model and supplier that has cleared the same evaluations and is monitored against them in production.

  2. Suppliers compete on price within a specification. When multiple suppliers can fill the same instrument, your request gets routed to the one offering the best price at that moment. Competitive pricing comes out of the order book, not from negotiating SKUs.

  3. You avoid vendor lock-in. Your code targets text-prime, not a specific model name from a specific supplier. When a better model joins the eligible set, your traffic benefits without any code change.

Specifications get revised on a cadence

Instrument specifications are not static. As frontier models improve, the thresholds get raised so each tier keeps its meaning. The cycle, change-notice windows, and supplier rollover process live on a dedicated page.

How instrument specifications evolve →

Your code targets an instrument, not a model

You write model: "text-prime" (or any other instrument string) in the request. You do not pick a model. You do not maintain a routing table. You do not re-evaluate every new release.

The order book handles routing. The benchmark suites handle qualification. The compliance suite handles drift. You focus on what your application is doing and let the specification do the work of guaranteeing what you are buying.

A few practical implications:

  • The model behind a request can change between calls. Two consecutive text-prime requests might get filled by different qualifying models from different suppliers. That is the whole point: the specification, not the model, is what you contracted for.

  • Adding a new instrument is a config change, not a migration. When Code and Agent instruments go live, the same SDK calls work with the new instrument strings. No new credentials, no new endpoints.

  • Mixing instruments across an application is the norm. Most production deployments combine two or three instruments. Triage on Standard, reason on Prime, escalate to Max only when context is large or the cost of an error is high. The savings come from this kind of routing.

Next

Last updated

Was this helpful?