Rate limits & quotas

When RB_ENABLE_QUOTAS=true, two per-tenant quotas cap what can be stored and queried, and a short-term token-bucket rate limiter protects the service from bursts. These are the default per-tenant limits when RB_ENABLE_QUOTAS=true.

Default limits

What a default tenant gets. The daily query counter resets at UTC midnight (00:00:00 UTC).

Quota	Default	Resets
Vectors stored	100,000	never (cumulative)
Queries per day	10,000	daily, at UTC midnight

GET /auth/usage

Check usage

Check current usage against your quotas at any time from any JWT-authenticated client.

curl -s http://localhost:8080/auth/usage \
  -H "Authorization: Bearer $RB_API_KEY"

Response (HTTP 200):

{
  "vectors_used": 12500,
  "vector_quota": 100000,
  "queries_today": 342,
  "daily_query_quota": 10000,
  "queries_reset_at": "2026-05-16"
}

queries_reset_at is a YYYY-MM-DD date string — the UTC day the daily counter was last reset (effectively "today"). queries_today is lazily zeroed on the first usage call of a new UTC day. The call performs that reset before reading, so the values are never stale from a previous day.

What happens at the limit

Quota breaches return HTTP 429 with the standard error envelope (see Authentication).

vector_quota_exceeded — an upload to POST /v1/datasets/{name}/vectors (or a bulk POST /v1/datasets/{name}/imports) would push stored vectors past the 100,000 cap. The upload is rejected whole — there is no partial acceptance up to the cap. details carries limit and used.
query_quota_exceeded — a POST /v1/query after the daily 10,000-query cap is reached. details carries limit and reset_at.

Sample 429 query_quota_exceeded body:

{
  "error": {
    "code": "query_quota_exceeded",
    "message": "Daily query quota exceeded for this tenant",
    "details": {
      "limit": 10000,
      "reset_at": "2026-05-16"
    }
  }
}

Per-key rate limit

Separate from the quotas above: a short-term limit on request rate per API key. Each API key gets roughly 50 requests/second sustained. Bursting past that returns 429 rate_limited:

{
  "error": {
    "code": "rate_limited",
    "message": "Rate limit exceeded; slow down and retry",
    "details": { "limit_rps": 50, "burst": 100 }
  }
}

Treat rate_limited as transient: retry the request with exponential backoff and jitter. The limiter is an in-memory token bucket and best-effort in the MVP — it is process-local, so it resets on restart and is not shared across pods. It is applied to the customer-facing /v1/* endpoints only; the /auth/* surface is not rate-limited. Requests authenticated with a JWT are bucketed per tenant rather than per key.