CogsLayer/Docs

Track and reduce the cost of every AI call

Drop in provider clients for Python or TypeScript. CogsLayer records model, tokens, latency, and cost with the feature and team that caused the spend.

pip install cogslayernpm install cogslayerPython 3.10+Node 20+

Installation

CogsLayer has no required dependencies. It uses Python's standard library for HTTP transport, so it won't interfere with your existing dependency tree.

Terminal

pip install cogslayer

Requires Python 3.10 or later. Works on Linux, macOS, and Windows.

For TypeScript, install the CogsLayer SDK and whichever provider SDK you use. Provider packages are optional peer dependencies.

npm

npm install cogslayer openai
# or
npm install cogslayer @anthropic-ai/sdk
# or
npm install cogslayer @google/genai

Requires Node.js 20 or later. The SDK is ESM-only and ships TypeScript declarations.

Quick Start

Two changes to existing code: swap the provider import, then put attribution around the function or block you want to measure. Every LLM call inside that context is recorded with tokens, cost, and latency.

app.py

import cogslayer
from cogslayer.openai import OpenAI          # 1. Replace the import

cogslayer.init(api_key="cl_live_xxx",        # 2. Init once at startup
             service="my-api")
client = OpenAI()

@cogslayer.track(feature="chat", team="growth")
def ask(prompt: str) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )
    return resp.choices[0].message.content

answer = ask("Explain quantum computing")
# -> Event captured: model=gpt-4o, cost=$0.0032, feature=chat, team=growth

answer = ask("Explain quantum computing")
# -> Deduped! Returned cached response, saved $0.0032

app.ts

import * as cogslayer from "cogslayer";
import { OpenAI } from "cogslayer/openai";

await cogslayer.init({
  apiKey: process.env.COGSLAYER_API_KEY!,
  service: "my-api",
  environment: "production",
});

const client = new OpenAI();

export async function ask(prompt: string): Promise<string> {
  return cogslayer.track({ feature: "chat", team: "growth" }, async () => {
    const response = await client.chat.completions.create({
      model: "gpt-4o",
      messages: [{ role: "user", content: prompt }],
    });

    return response.choices[0]?.message?.content ?? "";
  });
}

What happens under the hood

init() verifies your API key, stores service metadata, and starts the event transport.
Provider clients read the active attribution context and record model, tokens, latency, and estimated cost.
Python also enables automatic optimization: response dedup, retry dedup, prompt caching, and output capping.
Optimization savings are logged with the exact dollar amount saved, visible in your dashboard.

Provider Wrappers

CogsLayer provides drop-in replacements for every major LLM provider. The wrapped clients are API-identical to the originals. You change the import line and nothing else breaks. Async clients are included.

OpenAI + OpenAI-compatible (Python)

from cogslayer.openai import OpenAI, AsyncOpenAI

# Standard OpenAI
client = OpenAI()

# Groq, Together, Fireworks, xAI, Ollama: anything OpenAI-compatible
groq = OpenAI(base_url="https://api.groq.com/openai/v1")
together = OpenAI(base_url="https://api.together.xyz/v1")

OpenAI + OpenAI-compatible (TypeScript)

import { OpenAI } from "cogslayer/openai";

const client = new OpenAI();

const groq = new OpenAI({
  baseURL: "https://api.groq.com/openai/v1",
  apiKey: process.env.GROQ_API_KEY,
  provider: "groq",
});

Anthropic (Python)

from cogslayer.anthropic import Anthropic, AsyncAnthropic

client = Anthropic()

Anthropic (TypeScript)

import { Anthropic } from "cogslayer/anthropic";

const client = new Anthropic();

Google Gemini (Python)

from cogslayer.gemini import Client

client = Client(api_key="...")

Google Gemini (TypeScript)

import { Client } from "cogslayer/gemini";

const client = new Client({ apiKey: process.env.GEMINI_API_KEY! });

Python includes sync and async provider clients. The TypeScript SDK follows the provider SDKs' Promise-based APIs and records usage after the real provider response is returned.

TypeScript SDK

The TypeScript SDK uses named exports or a namespace import for configuration and attribution. Provider clients come from subpath imports like cogslayer/openai.

Scoped tracking

export async function classify(ticket: string, user: User) {
  return cogslayer.track(
    { feature: "classification", team: "support", userId: user.id },
    async () => {
      return client.chat.completions.create({
        model: "gpt-4o-mini",
        messages: [{ role: "user", content: ticket }],
      });
    }
  );
}

TypeScript decorator

class ChatService {
  @cogslayer.track({ feature: "chat", team: "growth" })
  async ask(prompt: string): Promise<string> {
    const response = await client.chat.completions.create({
      model: "gpt-4o",
      messages: [{ role: "user", content: prompt }],
    });

    return response.choices[0]?.message?.content ?? "";
  }
}

Decorators require TypeScript 5 standard decorators, not legacy experimental decorators. If your toolchain does not support them, use cogslayer.track(options, fn).

Sessions

await cogslayer.session("agent-run").run(async () => {
  const draft = await client.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: userInput }],
  });

  return client.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: draft.choices[0]?.message?.content ?? "" }],
  });
});

Automatic Optimization: Guaranteed Savings

CogsLayer automatically applies four zero-risk optimizations to every LLM call. These never change your model or affect output quality. They only prevent wasted spend. Enabled by default when you call cogslayer.init().

The guarantee

CogsLayer saves you money from day one. Every prevented call, every cached response, every capped output is logged with the exact dollar amount saved. You can see the running total in your dashboard's Savings page.

Response Dedup

When your application sends the same prompt twice, CogsLayer returns the cached response instead of calling the provider. The cache is keyed on the SHA-256 fingerprint of the prompt content with a configurable TTL (default: 5 minutes).

How it works

import cogslayer
from cogslayer.openai import OpenAI

cogslayer.init(api_key="cl_live_xxx")
client = OpenAI()

@cogslayer.track(feature="faq")
def answer(question: str) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": question}],
    )
    return resp.choices[0].message.content

# First call: hits OpenAI ($0.003)
answer("What is your return policy?")

# Second call: returns cached response ($0.00)
answer("What is your return policy?")
# -> CogsLayer logs: saved $0.003 via response_dedup

Prompt Caching

Providers offer discounted pricing for cached input tokens. Anthropic gives a 90% discount, OpenAI 50%, and Gemini 75%. CogsLayer automatically injects the required cache control headers so your repeated system prompts and prefixes get the discount without any code changes.

Anthropic

90% off cached tokens

cache_control header injected on system message

OpenAI

50% off cached tokens

Automatic when prefix matches (CogsLayer ensures system-first ordering)

Gemini

75% off cached tokens

CachedContent support

Retry Elimination

When the same user triggers the same feature with the same prompt within a short window (default: 5 seconds), CogsLayer recognizes it as a retry and returns the original response. This catches application-level retries, double-clicks, and race conditions.

# User clicks "Generate" twice rapidly
# First click: calls GPT-4o ($0.004)
# Second click: CogsLayer returns cached response ($0.00)
# -> Saved $0.004 via retry_dedup

Output Capping

Most applications don't set max_tokens, so models generate until they're done, often producing 2 to 3x more output than needed. CogsLayer learns your output patterns per (feature, model) pair and automatically sets a max_tokens ceiling at 120% of the historical p95 output length, after observing at least 10 calls for that pair. You can override the learned ceiling with explicit max_tokens_profiles per feature.

Manual ceiling

from cogslayer._optimizer import configure

# Set explicit ceilings per feature
configure(max_tokens_profiles={
    "summarization": 300,
    "classification": 50,
    "chat": 1000,
})

Optimizer Configuration

All optimizations are enabled by default. You can disable the entire optimizer or configure individual techniques.

Disable entirely

cogslayer.init(api_key="cl_live_xxx", optimize=False)

Fine-tune settings

from cogslayer._optimizer import configure

configure(
    dedup_ttl=600.0,           # Cache responses for 10 minutes
    dedup_max_size=2000,       # LRU cache holds 2000 entries
    retry_dedup_window=10.0,   # Detect retries within 10 seconds
)

# Disable specific techniques
configure(
    dedup_enabled=False,       # Disable response dedup
    cache_headers_enabled=False,  # Disable cache header injection
)

Wrap your agent run with cogslayer.session()

Group every LLM call in one user or agent flow into a single run in the dashboard. CogsLayer does not store prompt text, only fingerprints and technical metadata.

Multi-call workflow

import cogslayer
from cogslayer.openai import OpenAI

cogslayer.init(api_key="cl_live_xxx")
client = OpenAI()

with cogslayer.session():
    draft = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": user_input}],
    )
    refined = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": draft.choices[0].message.content}],
    )
# Both calls appear under one run: total cost, timeline, and call breakdown.

View run costs in the dashboard

Open Runsto see each flow’s total cost, call count, tokens, and duration. Expand a run for a run summary, timeline of calls (duration and cost), and a nested call breakdown when your SDK emits span metadata. Older data without spans still shows accurate costs; structure appears for newer SDK versions.

Estimate savings with model swaps and cache assumptions

On a run, use Estimate savings to map models to cheaper alternatives and optionally assume a cache hit rate. Results are hypothetical using published list prices; your actual invoice may differ. The Savings page ranks org-wide opportunities so you know what to validate first.

API: POST /v1/sessions/:session_id/replay with model_overrides and optional cache_rate (0 to 1).

Attribution with @track()

Attribution is what separates CogsLayer from billing scrapers. Instead of seeing “you spent $400 on GPT-4o this month”, you see “the summarization feature in the growth team spent $180 on GPT-4o for user X.” Every LLM call inside a tracked function inherits the attribution context.

Attribution decorator

@cogslayer.track(
    feature="summarization",   # Which product feature
    team="growth",             # Which team owns it
    user_id="usr_123",         # Which end-user triggered it
    tenant="acme-corp",        # Multi-tenant: which customer
)
def summarize(text: str) -> str:
    # Every LLM call in here is tagged with all four labels
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": text}],
    )
    return resp.choices[0].message.content

feature

Product feature name. Use this to answer "which feature costs the most?"

team

Team ownership. Use this for departmental cost allocation and chargebacks.

user_id

End-user identifier. Track per-user spend for abuse detection or usage-based billing.

tenant

Customer/tenant in multi-tenant apps. See cost per customer.

You can pass any additional keyword arguments. They're stored as custom metadata on the event and available in exports and the dashboard.

Streaming Support

Streaming responses work out of the box. CogsLayer captures usage from the final chunk, which contains the complete token counts. Both sync generators and async generators are supported with no additional configuration.

Streaming with CogsLayer

@cogslayer.track(feature="chat")
def stream_response(prompt: str):
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )
    for chunk in stream:
        yield chunk.choices[0].delta.content or ""

# Async streaming works the same way
@cogslayer.track(feature="chat")
async def stream_async(prompt: str):
    stream = await async_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )
    async for chunk in stream:
        yield chunk.choices[0].delta.content or ""

Custom Model Pricing

CogsLayer ships with a built-in pricing registry for all major models. For fine-tuned models, self-hosted endpoints, or new models not yet in the registry, register custom pricing so cost estimates stay accurate.

cogslayer.register_model(
    "ft:gpt-4o:my-org:custom-model:abc123",
    prompt_price_per_1k=0.005,     # $ per 1K input tokens
    completion_price_per_1k=0.015, # $ per 1K output tokens
)

# Now calls to this model are priced correctly
@cogslayer.track(feature="extraction")
def extract(text: str):
    return client.chat.completions.create(
        model="ft:gpt-4o:my-org:custom-model:abc123",
        messages=[{"role": "user", "content": text}],
    )

Configuration

All configuration is passed to cogslayer.init(). There are no config files or environment variable conventions. Everything is explicit.

api_keystrrequired

Your CogsLayer API key. Starts with cl_live_ or cl_test_.

servicestr

Name of your application or microservice. Useful when multiple services share an org. Defaults to empty string.

environmentstr

Deployment environment (e.g. production, staging). Defaults to empty string.

base_urlstr

Platform API base URL. Override this for self-hosted deployments. Defaults to https://api.cogslayer.com.

flush_intervalfloat

How often the background transport flushes batched events to the platform, in seconds. Defaults to 5.0.

enforce_budgetsbool

When True, the SDK raises BudgetExceededError on calls that would push a person, team, or feature over its configured budget. Requires the budget_enforcement entitlement on your plan. Defaults to False.

optimizebool

When True, enables the four automatic optimizations (response dedup, prompt caching, retry dedup, output capping). Pass False to disable all of them. Defaults to True.

Full init example

cogslayer.init(
    api_key="cl_live_abc123def456",
    service="recommendation-api",
    environment="production",
    base_url="https://cogslayer.example.com",  # self-hosted
)

Budget Alerts

Set spend thresholds per person, feature, team, model, or total spend. Alerts are evaluated on-demand. Call check_alerts() from the SDK, hit the API endpoint, or use the “Run Check” button in the dashboard. When spend exceeds a threshold, the alert triggers and logs to the alert history.

Create and check alerts from the SDK

import cogslayer

cogslayer.init(api_key="cl_live_xxx")

# Check which alerts are currently triggered
result = cogslayer.check_alerts()
for alert in result["triggered"]:
    name = alert["alert_name"]
    spend = alert["spend_usd"]
    limit = alert["threshold_usd"]
    print(f"Warning: {name}: spent {spend:.2f} / {limit:.2f} USD")

Create alert via API

POST /v1/alerts
Content-Type: application/json
Authorization: Bearer cl_live_xxx

{
  "name": "Daily GPT-4o budget",
  "threshold_usd": 50.00,
  "period": "daily",
  "scope": "model",
  "scope_value": "gpt-4o"
}

Period

daily, weekly, monthly

Scope

total, user_id, feature, team, model, service

Evaluation

On-demand via SDK, API, or dashboard

Cost Insights

CogsLayer analyzes your usage patterns and generates actionable recommendations to reduce spend. Because it tracks code-level attribution, recommendations are scoped to specific features, not generic “use a cheaper model” advice.

Model downgrade

Identifies features calling expensive models (GPT-4o, Claude Opus) for tasks where cheaper alternatives (GPT-4o-mini, Haiku) would work. Estimates savings based on actual token volumes.

Caching opportunity

Detects features with low cached_tokens ratios and high request volumes. Suggests enabling prompt caching for repeated system prompts.

Anomaly detection

Flags features with sudden cost spikes compared to their historical baseline. Catches runaway loops, unexpected traffic, or prompt injection attacks.

Reasoning overkill

Identifies calls using reasoning models (o1, o3) where reasoning_tokens dominate the cost. Suggests non-reasoning alternatives for simpler tasks.

Fetch insights from the SDK

insights = cogslayer.get_insights(days=30)

total = insights["total_estimated_savings_usd"]
print(f"Total estimated savings: {total:.2f} USD")

for rec in insights["insights"]:
    print(f"[{rec['severity']}] {rec['message']}")
    print(f"  -> Save ~{rec['estimated_savings_usd']:.2f} USD")
    print()

CSV Export

Export raw events as CSV for offline analysis, compliance, or piping into your own BI tools. Every column, including all attribution fields, is included. Available from the SDK, the API, or the dashboard's Events page.

Export from the SDK

from datetime import datetime

csv_data = cogslayer.export_csv(
    start=datetime(2025, 1, 1),
    end=datetime(2025, 1, 31),
    feature="chat",
)

with open("january_chat_events.csv", "w") as f:
    f.write(csv_data)

Export via API

GET /v1/events/export?start=2025-01-01&end=2025-01-31&feature=chat
Authorization: Bearer cl_live_xxx

# Response: text/csv with all event columns

Time-series Data

Query bucketed cost data over time for building charts, tracking trends, or feeding into monitoring pipelines. Supports hourly, daily, and weekly intervals with optional grouping.

Fetch time-series from the SDK

from datetime import datetime

data = cogslayer.get_timeseries(
    interval="day",
    group_by="feature",
    start=datetime(2025, 1, 1),
    end=datetime(2025, 1, 31),
)

for bucket in data["data"]:
    cost = bucket["cost_usd"]
    count = bucket["event_count"]
    print(f"{bucket['bucket']}: {cost:.4f} USD ({count} events)")

SDK Methods

All public methods available after import cogslayer. Platform query methods require a valid API key set via init().

Initialization

cogslayer.init(api_key, *, service='', environment='', base_url='https://api.cogslayer.com', flush_interval=5.0, enforce_budgets=False, optimize=True)

Initialize the SDK. Validates your API key, starts the event transport, and activates automatic optimizations. Pass optimize=False to disable all automatic savings, flush_interval to tune transport batching, and enforce_budgets=True to raise BudgetExceededError on calls that would exceed a budget.

cogslayer.shutdown()

Flush all pending events and stop the background transport. Call this before your process exits to avoid losing data.

Tracking

@cogslayer.track(feature, team, user_id, tenant, ...)

Decorator that attaches attribution context to every LLM call within the decorated function. Supports arbitrary keyword arguments for custom dimensions.

cogslayer.session(name, session_id=None)

Context manager that groups all LLM calls into a single agent run. Works with both sync (with) and async (async with). All calls inside share the same session_id in the dashboard.

cogslayer.estimate(model, messages=None, prompt=None, max_tokens=500) -> dict

Estimate cost before making a call. Pass either messages (list of role/content dicts) or prompt (raw string). Returns a dict with model, prompt_tokens, completion_tokens, and estimated_cost_usd. Uses the local pricing registry, no API call needed.

cogslayer.register_model(model, prompt_price_per_1k, completion_price_per_1k)

Optimizer

from cogslayer._optimizer import configure

Fine-tune optimizer settings: dedup_ttl, dedup_max_size, retry_dedup_window, cache_headers_enabled, max_tokens_profiles.

from cogslayer._optimizer import get_total_savings

Returns USD saved by automatic optimizations currently buffered in this process. The buffer is flushed to the platform on each transport tick, so this reflects recent savings, not a process-lifetime total. For lifetime totals query the /v1/savings/realized endpoint or the dashboard.

Platform queries

cogslayer.get_insights(*, days=30) -> dict

Fetch cost optimization recommendations. Returns insights with type, severity, estimated savings, and affected features.

cogslayer.check_alerts() -> dict

Evaluate all active budget alerts against current period spend. Returns which alerts triggered and their spend vs threshold.

cogslayer.get_cost_summary(*, group_by, start, end) -> dict

Fetch aggregated cost breakdown from the platform. Group by model, provider, feature, team, or service.

cogslayer.get_timeseries(*, interval, group_by, start, end) -> dict

Fetch time-series cost data bucketed by hour, day, or week. Useful for building custom dashboards.

cogslayer.export_csv(*, start=None, end=None, limit=50000, **filters) -> str

Export events as a CSV string with all attribution columns. Pass limit to cap the number of rows (default 50000). Filters by model, provider, feature, team.

Initialization (TypeScript)

await init({ apiKey, service, environment, baseUrl, flushInterval, enforceBudgets, optimize })

Initialize the SDK. It verifies the API key, stores service metadata, and starts the background transport. flushInterval is in seconds.

await shutdown()

Flush pending events and stop the transport before process exit.

Tracking (TypeScript)

track(options, fn)

Run a sync or async block with attribution context. This is the canonical TypeScript API for standalone functions, route handlers, jobs, and scripts.

@track(options)

Standard TypeScript method decorator for class methods. It uses the same scoped attribution behavior as track(options, fn).

session(name, sessionId?).run(fn)

Group multiple LLM calls into one dashboard run.

Pricing and platform queries (TypeScript)

estimate({ model, messages, prompt, maxTokens })

Estimate cost locally with the pricing registry before making a provider call.

registerModel(model, promptPricePer1k, completionPricePer1k)

await getInsights({ days })

Fetch cost optimization recommendations from the platform.

await checkAlerts()

Evaluate budget alerts for the current organization.

await getCostSummary({ groupBy, start, end })

Fetch aggregated spend by model, provider, feature, team, or service.

await getTimeseries({ interval, groupBy, start, end })

Fetch bucketed cost data for charts and reporting.

await exportCsv({ start, end, feature, team, model, provider, limit })

Export raw events as CSV.

API Endpoints

The CogsLayer platform exposes a REST API. All endpoints require authentication via the Authorization header: either an API key (cl_live_...) or a JWT from the dashboard. All request and response bodies are JSON unless noted otherwise.

Authentication

GET

/v1/auth/verify

Validate an API key. Called automatically by the SDK during init.

Events

POST

/v1/events

Ingest a batch of events (max 1000 per request). Used internally by the SDK transport.

GET

/v1/events

Query events with filters. Supports model, provider, feature, team, start, end, limit, offset.

GET

/v1/events/export

Download all matching events as a CSV file with full attribution columns.

Costs

GET

/v1/costs/summary

Aggregated cost breakdown. Group by model, provider, feature, team, or service.

GET

/v1/costs/timeseries

Bucketed cost data over time. Supports hourly, daily, and weekly intervals.

Insights

GET

/v1/insights

Returns actionable cost optimization recommendations based on your usage patterns.

Savings

GET

/v1/savings/opportunities

Ranked savings opportunities (model swaps, expensive runs, cache hints). Query: days, limit.

GET

/v1/savings/realized

Total savings realized: includes both rule-based savings and automatic SDK optimizations (dedup, caching, retry, output cap). Returns per-technique breakdown.

Budget alerts

POST

/v1/alerts

Create a new budget alert with threshold, period, and scope.

GET

/v1/alerts

List all budget alerts for your organization.

PATCH

/v1/alerts/:id

Update an alert's name, threshold, period, scope, or enabled state.

DELETE

/v1/alerts/:id

Permanently delete a budget alert.

POST

/v1/alerts/check

Evaluate all active alerts against current spend. Returns which ones triggered.

GET

/v1/alerts/history

View a log of past alert triggers with spend vs threshold data.

API keys

POST

/v1/keys

Create a new API key. The full key is only returned once.

GET

/v1/keys

List all API keys (prefix only, not the full key).

DELETE

/v1/keys/:id

Revoke an API key. Revoked keys stop working immediately.

Tracked Metrics

Every LLM call captured by CogsLayer records these fields. Usage metrics are extracted automatically from provider responses. Attribution fields are set through @cogslayer.track() and cogslayer.init().

cost_usdEstimated cost in USD from the built-in pricing registry

prompt_tokensInput tokens sent to the model

completion_tokensOutput tokens generated by the model

total_tokensSum of prompt + completion tokens

reasoning_tokensThinking/reasoning tokens (o1, o3, o4-mini)

cached_tokensTokens served from prompt cache rather than recomputed

latency_msRound-trip response time in milliseconds

modelModel identifier (e.g. gpt-4o, claude-sonnet-4-20250514)

providerProvider name (openai, anthropic, google, etc.)

featureBusiness feature attribution, set via @track()

teamTeam attribution, set via @track()

user_id / userIdEnd-user attribution, set via @track() in Python or track options in TypeScript

serviceApplication or microservice name, set via init()

environmentDeployment environment (production, staging), set via init()

Error Handling

CogsLayer is designed to never break your application. Tracking failures are silently swallowed. If the platform is unreachable or returns an error, your LLM calls still execute normally. Failed events are logged at debug level and dropped to avoid unbounded memory growth.

401

Invalid or revoked API key. Check the key passed to init().

429

Rate limit exceeded (1000 req/min per org). The SDK will drop events that fail to send.

400

Malformed event payload. Usually means a provider wrapper bug. Please report it.

500

Platform error. Failed events are logged at debug level and dropped.

Platform query methods (get_insights, check_alerts, etc.) do raise exceptions on failure, since they're explicit data fetches. Wrap them in try/except if your application needs to handle platform downtime gracefully.