> For the complete documentation index, see [llms.txt](https://thegrid.ai/docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://thegrid.ai/docs/integrations-and-best-practices/best-practices.md).

# Best practices for development

These are the patterns that hold up across production usage of The Grid. They map onto the platform's behavior (instrument tiers, two-balance accounting, FIFO token metering, Auto Mode purchasing), so following them keeps your application stable and your bill lower.

## 1. Use the right tier per task

Don't run everything on Max. Standard for high-throughput pipelines and structured tasks (classification, extraction, simple generation). Prime for reliable daily work (writing, daily coding, standard agent loops, Q\&A). Max for deep reasoning, long context, and tasks where getting it wrong has real consequences.

You can switch per-request. Call Text Standard for classification and Text Max for analysis in the same application. The `model` parameter is just a string. Honest routing across tiers cuts inference spend by 70–90% relative to running everything on a frontier instrument, with no meaningful hit to quality on the workloads that don't need it. See [Routing patterns](/docs/integrations-and-best-practices/routing-patterns.md) for the decision flow and healthy distribution targets, and [Current instruments](/docs/instrument-specifications/current-instruments.md) for the full set of nine.

## 2. Set the input and output token limits in your tool

Most agent harnesses decide how much of the conversation to send by reading two numbers from their model config: the context window and a maximum output (sometimes called max completion or max tokens). If a tool exposes an explicit maximum input field, set it. If it doesn't, the tool derives the input budget as context minus output, so a real output cap below the context window is the only lever you have to stop it from overrunning.

The conservative input limits to configure per tier:

| Tier                                                          | Maximum input tokens |
| ------------------------------------------------------------- | -------------------- |
| Standard (`text-standard`, `code-standard`, `agent-standard`) | 120,000              |
| Prime (`text-prime`, `code-prime`, `agent-prime`)             | 120,000              |
| Max (`text-max`, `code-max`, `agent-max`)                     | 922,000              |

These are conservative floors. Many requests can send more, but configuring these values keeps a harness from optimistically packing a request that overflows. Always keep your configured maximum output below the instrument's context window so input plus output fits in a single request. When a tool sends more than the window because no limit was set, the request fails with a 400 before it ever reaches a supplier.

The Standard and Prime instruments carry a 128K context window; the Max instruments carry 1M. Tool-specific config keys are in each [integration guide](/docs/integrations-and-best-practices/integrations.md). If you use an instrument not listed here, get its current context and output limits from [Current instruments](/docs/instrument-specifications/current-instruments.md).

## 3. Monitor consumption balance and credits separately

There are two balances to watch. They serve different functions and they fail in different ways.

* **USD credits.** Funds available for buying tokens. If this hits zero, Auto Buy can't replenish your consumption tokens, and requests start failing with 402.
* **Per-instrument consumption balance.** Tokens available for API calls on that specific instrument. Each of [our nine instruments](/docs/instrument-specifications/current-instruments.md) has its own balance.

Auto-Reload keeps your credits topped up by charging your saved payment method when they drop below a threshold. Auto Buy keeps the per-instrument balances stocked by buying tokens automatically when they get low. Both should be enabled for uninterrupted usage.

The dashboard shows both balances. Wire alerts to whichever one matters more for your application. For high-throughput batch jobs, the per-instrument balance; for low-volume agents, the credits balance.

## 4. Retry on retryable errors

Three error codes are safe to retry: `429` (rate limited), `500` (transient server error), and `503` (balance replenishment in progress). `402` is retryable when Auto Buy is active and a buy is in progress.

Use exponential backoff with jitter. Start at 1 second, double on each retry, add random jitter (10–25% of the wait). Cap at a sensible maximum (30–60 seconds). Failed requests are not billed, so retries don't cost you anything beyond latency.

```python
import time, random
from openai import OpenAI, APIStatusError

client = OpenAI(base_url="https://api.thegrid.ai/v1", api_key="...")

def call_with_retry(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="text-prime",
                messages=messages,
            )
        except APIStatusError as e:
            if e.status_code in (402, 429, 500, 503) and attempt < max_retries - 1:
                wait = (2 ** attempt) + random.uniform(0, 1)
                time.sleep(wait)
                continue
            raise
```

For agent loops specifically: when the per-instrument consumption balance hits zero, the first request triggers an asynchronous market buy and returns 402. The agent should retry after a short delay. By the second or third attempt, the new tokens are usually in the account and the request succeeds.

## 5. Use streaming where it matters

Streaming reduces perceived time-to-first-token. For interactive applications (chat UIs, IDE assistants, anything a user is watching), turn it on. Set `stream=True` in the request body and consume the SSE chunks as they arrive.

Streaming doesn't make the total response faster. It makes the start of the response faster. If your downstream code needs the full response anyway (parsing JSON, running validators, computing aggregates), streaming adds complexity without value. For batch jobs, classification pipelines, and tool calls that block on the full response, leave streaming off.

## 6. Tokens consume oldest-first (FIFO)

Tokens are consumed FIFO from the oldest lot in your consumption account. Every API call is attributed to a specific instrument and a specific trade. Usage records track input tokens, output tokens, the supplier that served the request, time-to-first-token, and throughput.

This matters for two reasons. First, your cost per request is set by the price of the lot the tokens are drawn from, not the current market price. Second, when sweeps run on instruments with consumption windows, the soonest-expiring tokens are consumed first. Plan your buying around the workloads you actually run.

## 7. Set sensible Auto Mode limits

Auto Mode handles purchasing, balance management, and top-ups. Most developers never need to touch Advanced Mode. Auto Mode has a few configurable limits worth setting deliberately:

* **Auto-Reload threshold.** The credits balance below which the system charges your saved payment method. Set this high enough that you don't run dry between reloads, low enough that you're not holding more credit than you need.
* **Auto Buy trigger.** The per-instrument consumption balance below which the system buys more tokens. Set this based on your typical request rate and the lead time for a market buy to settle.
* **Per-instrument caps.** Limits on how much can be spent per instrument in a given window. Useful for protecting against runaway agent loops.

If you want price control, like setting price ceilings or timing your purchases, flip to Advanced Mode in [profile settings](https://app.thegrid.ai/profile). Advanced Mode gives you direct order book access via the Trading API.

## 8. Pin instruments by string

The `model` parameter takes an instrument string: `text-prime`, `code-max`, `agent-prime`, and so on. Don't hardcode underlying model names. They're not how we route. The market routes to whichever qualifying supplier best meets the specification, and the pool of qualifying suppliers changes over time as new ones qualify and others fall below threshold.

This is the abstraction. Your code says "I need Prime-tier text generation" and the market handles which supplier serves it. Quality Score on each request, derived from the [AA Index](https://artificialanalysis.ai/) and our own benchmarks, gives you a per-request quality signal independent of which supplier got the routing. If you depend on specific model behavior (a particular tokenizer, a particular failure mode, a particular phrasing), you've coupled your code to something the specification doesn't promise. Build evals against your real workload, not against a specific model's quirks. Use structured outputs and schema validation to make routing-induced variance show up loudly rather than silently.

## 9. Plan for instrument tier changes

Code and Agent instruments are currently in preview. They move to general availability on their own timeline as suppliers qualify and benchmarks stabilize. Specifications evolve as entry thresholds tighten, additional benchmarks get added, or market parameters change.

Treat tier strings as a contract you're consuming, not as static labels. When `code-prime` moves from preview to GA, your code still works. The string is stable. But the underlying specification and the supplier pool mature. New instruments may launch (more tiers, more task types). Your routing config should be easy to update.

Keep your tier-to-workload mapping in configuration. Log which instrument served each request. Re-run your evals when specifications change. Text Standard, Text Prime, and Text Max are live; the rest are preview. See [Instrument specifications → How specifications evolve](/docs/instrument-specifications/how-specifications-evolve.md) for the policy on specification changes, and [Current instruments](/docs/instrument-specifications/current-instruments.md) for the live list.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://thegrid.ai/docs/integrations-and-best-practices/best-practices.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.