> For the complete documentation index, see [llms.txt](https://thegrid.ai/docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://thegrid.ai/docs/integrations-and-best-practices/routing-patterns.md).

# Routing patterns

Default to Prime; route to Max for hard tasks; route to Standard for volume. That single sentence captures the whole game. The bulk of the savings on The Grid comes from following it, not from running everything on Max, and not from compressing everything onto Standard. Most teams who switch already know this in principle. The hard part is acting on it. This page is a working guide for how to think about routing across our [nine instruments](/docs/instrument-specifications/current-instruments.md).

## Default to Prime, escalate to Max, demote to Standard

Start with this four-step heuristic for any new workload:

1. **Default to Prime.** Use `text-prime`, `code-prime`, or `agent-prime` as the starting point for any meaningful workload. Prime covers most everyday work: writing, daily coding, standard agent loops, summarization, Q\&A, analysis.
2. **Identify the harder workloads.** Find the requests with long reasoning chains, very large context, or high error cost. Multi-file refactors, deep research, complex architecture, autonomous long-horizon agents. Escalate those to Max.
3. **Identify the lighter workloads.** Find the requests that are narrow, structured, repetitive, or transformational. Classification, extraction, autocomplete, simple tool calls, format conversions, batch processing. Move those to Standard.
4. **If the prompt is doing heavy lifting, stay on Prime or Max.** When the prompt itself includes policies, retrieval, memory, tool schemas, or extensive few-shot examples, keep it on Prime or Max even if the user query sounds simple. Standard is for genuinely simple tasks, not for compressing complex tasks into a small model.

## Aim for 25–35% Prime, 5–10% Max, 55–70% Standard

A production application that's routing honestly tends to land somewhere in this range:

| Tier                          | Share of requests |
| ----------------------------- | ----------------- |
| Prime (default)               | 25–35%            |
| Max (escalate for hard tasks) | 5–10%             |
| Standard (demote for volume)  | 55–70%            |

If your distribution is heavily skewed to Max, you're probably overpaying. If it's heavily skewed to Standard with no escalation path, you're probably hitting quality issues you haven't measured. The path there starts with Prime as the default, then moves workloads down or up based on evidence.

## Mix instrument types across one application

Our nine instruments cover three task types: text, code (preview), and agent (preview). You can mix them freely within one application. A few patterns hold up well in production.

### Code review with escalation

Run the first pass of code review on Code Prime. For PRs the model flags as high-risk (security-relevant, large diffs, complex refactors), re-run on Code Max. Most reviews never escalate. The ones that do justify the higher cost.

```python
review_result = client.chat.completions.create(
    model="code-prime",
    messages=review_messages,
)

if review_result.choices[0].message.flagged_high_risk:
    review_result = client.chat.completions.create(
        model="code-max",
        messages=review_messages,
    )
```

### Agent loops with cheap subtasks

Run the agent loop on Agent Prime for the planning and tool-calling steps. When a tool call hands off to a subtask that's mostly text generation, like summarizing a doc, extracting a field, or formatting output, call Text Standard for that subtask. The agent stays coherent on Agent Prime; the subtasks run cheaply on Text Standard.

A common mistake is running the entire agent loop on Agent Max to avoid managing tier transitions. That works, but you pay frontier prices for every step including the trivial ones.

### Code generation with text fallback

For applications where you sometimes generate code and sometimes generate prose, use Code Prime for code paths and Text Prime for everything else. The `model` parameter is per-request, so the routing logic lives in your application, not in your auth setup.

### Batch processing with a quality probe

For a batch job that processes 100k items, run the bulk through Text Standard. Sample 1% and re-process on Text Prime. Compare outputs. If the Standard outputs match Prime closely on the sample, ship the Standard run. If they diverge, escalate the affected segments.

## Common routing mistakes

**Running everything on Max.** The most common pattern in early Grid usage. Easy to set up, easy to leave in place. The savings from honest routing across tiers usually fall in the 70–90% range relative to running everything on a frontier instrument.

**Refusing to use Standard.** Standard isn't a worse Prime. It's a different tier with different specifications (≥100 tok/s throughput, optimized for high-volume structured tasks). For classification, extraction, and simple generation, Standard is usually the right tool, not a downgrade.

**Skipping the eval.** Most teams find Prime is enough where they assumed they needed Max. The only way to know is to run a representative eval across tiers and measure. Generic benchmarks don't tell you whether your specific workload can run on Standard or Prime. Your own eval does.

**Hardcoding the tier.** If switching tiers requires a code change, it doesn't get optimized. Map workload classes to instruments in configuration. Log which instrument served each request. Make tier changes a config change.

**Routing based on cost alone.** A workload that fails on Standard 5% of the time can cost more in retries, manual review, and downstream errors than running it on Prime to begin with. Track quality and latency per route, not just cost.

## Log workload class, instrument, latency, and outcome

For each request, log:

* The workload class (the route, not the prompt)
* The instrument that served the request
* TTFT and total latency
* Input and output tokens
* The supplier (returned in the response, useful for diagnostics)
* The Quality Score returned for the request, where applicable
* The downstream outcome (was the response used, did it pass schema validation, did the user accept it)

Aggregate by route. That's how you learn which routes can move down a tier and which need to move up. Our dashboard surfaces per-instrument usage and spend; route-level attribution lives in your application logs.

Quality Score is our per-request quality signal, derived from the [AA Index](https://artificialanalysis.ai/) and our own benchmarks. Use it as one input alongside latency and downstream outcomes when deciding which routes to move.

## Pick Code and Agent tiers the same way

The same heuristic, Prime first then Max for hard tasks then Standard for volume, applies inside each task type.

Within Code:

* `code-prime`. Daily coding, code review, standard debugging, function-level generation.
* `code-max`. Multi-file refactors, complex architecture, deep debugging, security review.
* `code-standard`. Autocomplete, linting, fast inline suggestions, single-line generation.

Within Agent:

* `agent-prime`. Reliable multi-step tool use, standard agent loops, customer-facing agents.
* `agent-max`. Autonomous long-horizon tasks, deep tool chains, agents that need to recover from errors without human intervention.
* `agent-standard`. Fast tool calls, simple loops, high-throughput orchestration, deterministic workflows.

For full instrument specifications, see [Current instruments](/docs/instrument-specifications/current-instruments.md). For a shorter decision guide on picking a starting tier, see [Choose an instrument](/docs/start-here/choose-an-instrument.md).


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://thegrid.ai/docs/integrations-and-best-practices/routing-patterns.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.