Track and reduce the cost of every AI call

2 lines of code. Full cost attribution by feature and team. 4 automatic optimizations that cut waste with zero quality risk. CogsLayer shows you where every dollar goes and eliminates what's wasted.

pip install cogslayerPython 3.10+

Installation

CogsLayer has no required dependencies. It uses Python's standard library for HTTP transport, so it won't interfere with your existing dependency tree.

Terminal
pip install cogslayer

Requires Python 3.10 or later. Works on Linux, macOS, and Windows.

Quick Start

Two changes to your existing code: swap the provider import, and decorate functions you want to track. CogsLayer handles the rest. Every call inside a tracked function is captured with full token counts, cost, and latency.

app.py
import cogslayer
from cogslayer.openai import OpenAI          # 1. Replace the import

cogslayer.init(api_key="cl_live_xxx",        # 2. Init once at startup
             service="my-api")
client = OpenAI()

@cogslayer.track(feature="chat", team="growth")
def ask(prompt: str) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )
    return resp.choices[0].message.content

answer = ask("Explain quantum computing")
# -> Event captured: model=gpt-4o, cost=$0.0032, feature=chat, team=growth

answer = ask("Explain quantum computing")
# -> Deduped! Returned cached response, saved $0.0032

What happens under the hood

  1. init() validates your API key, starts the event transport, and activates 4 automatic optimizations.
  2. Before each call, the optimizer checks for cached responses and retry duplicates.
  3. If no optimization applies, the call goes through. Cache headers are injected, max_tokens may be set.
  4. After the call, the response is cached for future dedup.
  5. Every optimization is logged with the exact dollar amount saved, visible in your dashboard.

Provider Wrappers

CogsLayer provides drop-in replacements for every major LLM provider. The wrapped clients are API-identical to the originals. You change the import line and nothing else breaks. Async clients are included.

OpenAI + OpenAI-compatible
from cogslayer.openai import OpenAI, AsyncOpenAI

# Standard OpenAI
client = OpenAI()

# Groq, Together, Fireworks, xAI, Ollama: anything OpenAI-compatible
groq = OpenAI(base_url="https://api.groq.com/openai/v1")
together = OpenAI(base_url="https://api.together.xyz/v1")
Anthropic
from cogslayer.anthropic import Anthropic, AsyncAnthropic

client = Anthropic()
Google Gemini
from cogslayer.gemini import Client

client = Client(api_key="...")

All wrappers support both sync and async usage. The async variants (AsyncOpenAI, AsyncAnthropic) work seamlessly with asyncio and the decorator is async-safe with no special configuration needed.

Alternative: cogslayer.wrap()
import cogslayer
from openai import OpenAI  # Keep your original import

cogslayer.init(api_key="cl_live_xxx")
client = cogslayer.wrap(OpenAI())  # Same client, now tracked + optimized

cogslayer.wrap() patches the underlying provider SDK and returns the same client instance. Use it when you prefer not to change your import lines.

Automatic Optimization: Guaranteed Savings

CogsLayer automatically applies four zero-risk optimizations to every LLM call. These never change your model or affect output quality. They only prevent wasted spend. Enabled by default when you call cogslayer.init().

The guarantee

CogsLayer saves you money from day one. Every prevented call, every cached response, every capped output is logged with the exact dollar amount saved. You can see the running total in your dashboard's Savings page.

Response Dedup

When your application sends the same prompt twice, CogsLayer returns the cached response instead of calling the provider. The cache is keyed on the SHA-256 fingerprint of the prompt content with a configurable TTL (default: 5 minutes).

How it works
import cogslayer
from cogslayer.openai import OpenAI

cogslayer.init(api_key="cl_live_xxx")
client = OpenAI()

@cogslayer.track(feature="faq")
def answer(question: str) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": question}],
    )
    return resp.choices[0].message.content

# First call: hits OpenAI ($0.003)
answer("What is your return policy?")

# Second call: returns cached response ($0.00)
answer("What is your return policy?")
# -> CogsLayer logs: saved $0.003 via response_dedup

Prompt Caching

Providers offer discounted pricing for cached input tokens. Anthropic gives a 90% discount, OpenAI 50%, and Gemini 75%. CogsLayer automatically injects the required cache control headers so your repeated system prompts and prefixes get the discount without any code changes.

Anthropic

90% off cached tokens

cache_control header injected on system message

OpenAI

50% off cached tokens

Automatic when prefix matches (CogsLayer ensures system-first ordering)

Gemini

75% off cached tokens

CachedContent support

Retry Elimination

When the same user triggers the same feature with the same prompt within a short window (default: 5 seconds), CogsLayer recognizes it as a retry and returns the original response. This catches application-level retries, double-clicks, and race conditions.

# User clicks "Generate" twice rapidly
# First click: calls GPT-4o ($0.004)
# Second click: CogsLayer returns cached response ($0.00)
# -> Saved $0.004 via retry_dedup

Output Capping

Most applications don't set max_tokens, so models generate until they're done, often producing 2 to 3x more output than needed. CogsLayer learns your feature's output patterns and automatically sets a max_tokens ceiling at 120% of the historical p95 output length, after observing at least 10 calls.

Manual ceiling
from cogslayer._optimizer import configure

# Set explicit ceilings per feature
configure(max_tokens_profiles={
    "summarization": 300,
    "classification": 50,
    "chat": 1000,
})

Optimizer Configuration

All optimizations are enabled by default. You can disable the entire optimizer or configure individual techniques.

Disable entirely
cogslayer.init(api_key="cl_live_xxx", optimize=False)
Fine-tune settings
from cogslayer._optimizer import configure

configure(
    dedup_ttl=600.0,           # Cache responses for 10 minutes
    dedup_max_size=2000,       # LRU cache holds 2000 entries
    retry_dedup_window=10.0,   # Detect retries within 10 seconds
)

# Disable specific techniques
configure(
    dedup_enabled=False,       # Disable response dedup
    cache_headers_enabled=False,  # Disable cache header injection
)

Wrap your agent run with cogslayer.session()

Group every LLM call in one user or agent flow into a single run in the dashboard. CogsLayer does not store prompt text, only fingerprints and technical metadata.

Multi-call workflow
import cogslayer
from cogslayer.openai import OpenAI

cogslayer.init(api_key="cl_live_xxx")
client = OpenAI()

with cogslayer.session():
    draft = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": user_input}],
    )
    refined = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": draft.choices[0].message.content}],
    )
# Both calls appear under one run: total cost, timeline, and call breakdown.

View run costs in the dashboard

Open Runs to see each flow’s total cost, call count, tokens, and duration. Expand a run for a run summary, timeline of calls (duration and cost), and a nested call breakdown when your SDK emits span metadata. Older data without spans still shows accurate costs; structure appears for newer SDK versions.

Estimate savings with model swaps and cache assumptions

On a run, use Estimate savings to map models to cheaper alternatives and optionally assume a cache hit rate. Results are hypothetical using published list prices; your actual invoice may differ. The Savings page ranks org-wide opportunities so you know what to validate first.

API: POST /v1/sessions/:session_id/replay with model_overrides and optional cache_rate (0 to 1).

Attribution with @track()

Attribution is what separates CogsLayer from billing scrapers. Instead of seeing “you spent $400 on GPT-4o this month”, you see “the summarization feature in the growth team spent $180 on GPT-4o for user X.” Every LLM call inside a tracked function inherits the attribution context.

Attribution decorator
@cogslayer.track(
    feature="summarization",   # Which product feature
    team="growth",             # Which team owns it
    user_id="usr_123",         # Which end-user triggered it
    tenant="acme-corp",        # Multi-tenant: which customer
)
def summarize(text: str) -> str:
    # Every LLM call in here is tagged with all four labels
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": text}],
    )
    return resp.choices[0].message.content
feature

Product feature name. Use this to answer "which feature costs the most?"

team

Team ownership. Use this for departmental cost allocation and chargebacks.

user_id

End-user identifier. Track per-user spend for abuse detection or usage-based billing.

tenant

Customer/tenant in multi-tenant apps. See cost per customer.

You can pass any additional keyword arguments. They're stored as custom metadata on the event and available in exports and the dashboard.

Streaming Support

Streaming responses work out of the box. CogsLayer captures usage from the final chunk, which contains the complete token counts. Both sync generators and async generators are supported with no additional configuration.

Streaming with CogsLayer
@cogslayer.track(feature="chat")
def stream_response(prompt: str):
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )
    for chunk in stream:
        yield chunk.choices[0].delta.content or ""

# Async streaming works the same way
@cogslayer.track(feature="chat")
async def stream_async(prompt: str):
    stream = await async_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )
    async for chunk in stream:
        yield chunk.choices[0].delta.content or ""

Custom Model Pricing

CogsLayer ships with a built-in pricing registry for all major models. For fine-tuned models, self-hosted endpoints, or new models not yet in the registry, register custom pricing so cost estimates stay accurate.

cogslayer.register_model(
    "ft:gpt-4o:my-org:custom-model:abc123",
    prompt_price_per_1k=0.005,     # $ per 1K input tokens
    completion_price_per_1k=0.015, # $ per 1K output tokens
)

# Now calls to this model are priced correctly
@cogslayer.track(feature="extraction")
def extract(text: str):
    return client.chat.completions.create(
        model="ft:gpt-4o:my-org:custom-model:abc123",
        messages=[{"role": "user", "content": text}],
    )

Configuration

All configuration is passed to cogslayer.init(). There are no config files or environment variable conventions. Everything is explicit.

api_keystrrequired

Your CogsLayer API key. Starts with cl_live_ or cl_test_.

servicestr

Name of your application or microservice. Useful when multiple services share an org.

environmentstr

Deployment environment (e.g. production, staging). Defaults to None.

base_urlstr

Platform API base URL. Override this for self-hosted deployments.

Full init example
cogslayer.init(
    api_key="cl_live_abc123def456",
    service="recommendation-api",
    environment="production",
    base_url="https://cogslayer.example.com",  # self-hosted
)

Budget Alerts

Set spend thresholds per feature, team, model, or total spend. Alerts are evaluated on-demand. Call check_alerts() from the SDK, hit the API endpoint, or use the “Run Check” button in the dashboard. When spend exceeds a threshold, the alert triggers and logs to the alert history.

Create and check alerts from the SDK
import cogslayer

cogslayer.init(api_key="cl_live_xxx")

# Check which alerts are currently triggered
result = cogslayer.check_alerts()
for alert in result["triggered"]:
    name = alert["alert_name"]
    spend = alert["spend_usd"]
    limit = alert["threshold_usd"]
    print(f"Warning: {name}: spent {spend:.2f} / {limit:.2f} USD")
Create alert via API
POST /v1/alerts
Content-Type: application/json
Authorization: Bearer cl_live_xxx

{
  "name": "Daily GPT-4o budget",
  "threshold_usd": 50.00,
  "period": "daily",
  "scope": "model",
  "scope_value": "gpt-4o"
}

Period

daily, weekly, monthly

Scope

total, feature, team, model, service

Evaluation

On-demand via SDK, API, or dashboard

Cost Insights

CogsLayer analyzes your usage patterns and generates actionable recommendations to reduce spend. Because it tracks code-level attribution, recommendations are scoped to specific features, not generic “use a cheaper model” advice.

Model downgrade

Identifies features calling expensive models (GPT-4o, Claude Opus) for tasks where cheaper alternatives (GPT-4o-mini, Haiku) would work. Estimates savings based on actual token volumes.

Caching opportunity

Detects features with low cached_tokens ratios and high request volumes. Suggests enabling prompt caching for repeated system prompts.

Anomaly detection

Flags features with sudden cost spikes compared to their historical baseline. Catches runaway loops, unexpected traffic, or prompt injection attacks.

Reasoning overkill

Identifies calls using reasoning models (o1, o3) where reasoning_tokens dominate the cost. Suggests non-reasoning alternatives for simpler tasks.

Fetch insights from the SDK
insights = cogslayer.get_insights(days=30)

total = insights["total_estimated_savings_usd"]
print(f"Total estimated savings: {total:.2f} USD")

for rec in insights["insights"]:
    print(f"[{rec['severity']}] {rec['message']}")
    print(f"  -> Save ~{rec['estimated_savings_usd']:.2f} USD")
    print()

CSV Export

Export raw events as CSV for offline analysis, compliance, or piping into your own BI tools. Every column, including all attribution fields, is included. Available from the SDK, the API, or the dashboard's Events page.

Export from the SDK
from datetime import datetime

csv_data = cogslayer.export_csv(
    start=datetime(2025, 1, 1),
    end=datetime(2025, 1, 31),
    feature="chat",
)

with open("january_chat_events.csv", "w") as f:
    f.write(csv_data)
Export via API
GET /v1/events/export?start=2025-01-01&end=2025-01-31&feature=chat
Authorization: Bearer cl_live_xxx

# Response: text/csv with all event columns

Time-series Data

Query bucketed cost data over time for building charts, tracking trends, or feeding into monitoring pipelines. Supports hourly, daily, and weekly intervals with optional grouping.

Fetch time-series from the SDK
from datetime import datetime

data = cogslayer.get_timeseries(
    interval="day",
    group_by="feature",
    start=datetime(2025, 1, 1),
    end=datetime(2025, 1, 31),
)

for bucket in data["data"]:
    cost = bucket["cost_usd"]
    count = bucket["event_count"]
    print(f"{bucket['bucket']}: {cost:.4f} USD ({count} events)")

SDK Methods

All public methods available after import cogslayer. Platform query methods require a valid API key set via init().

Initialization

cogslayer.init(api_key, *, service, environment, base_url, optimize=True)

Initialize the SDK. Validates your API key, starts the event transport, and activates automatic optimizations. Pass optimize=False to disable all automatic savings.

cogslayer.wrap(client)

Wrap any LLM client to enable tracking and optimization. Returns the same client instance. Equivalent to calling cogslayer.patch() but with a fluent API: client = cogslayer.wrap(OpenAI()).

cogslayer.shutdown()

Flush all pending events and stop the background transport. Call this before your process exits to avoid losing data.

Tracking

@cogslayer.track(feature, team, user_id, tenant, ...)

Decorator that attaches attribution context to every LLM call within the decorated function. Supports arbitrary keyword arguments for custom dimensions.

cogslayer.session(name, session_id=None)

Context manager that groups all LLM calls into a single agent run. Works with both sync (with) and async (async with). All calls inside share the same session_id in the dashboard.

cogslayer.estimate(model, messages, max_tokens=500)

Estimate cost before making a call. Returns prompt_tokens, completion_tokens, estimated_cost_usd. Uses the local pricing registry, no API call needed.

cogslayer.register_model(model, prompt_price_per_1k, completion_price_per_1k)

Register custom per-token pricing for fine-tuned or self-hosted models. Prices are per 1K tokens.

Optimizer

from cogslayer._optimizer import configure

Fine-tune optimizer settings: dedup_ttl, dedup_max_size, retry_dedup_window, cache_headers_enabled, max_tokens_profiles.

from cogslayer._optimizer import get_total_savings

Returns the total USD saved by automatic optimizations in the current process. Useful for logging or monitoring.

Platform queries

cogslayer.get_insights(*, days=30) -> dict

Fetch cost optimization recommendations. Returns insights with type, severity, estimated savings, and affected features.

cogslayer.check_alerts() -> dict

Evaluate all active budget alerts against current period spend. Returns which alerts triggered and their spend vs threshold.

cogslayer.get_cost_summary(*, group_by, start, end) -> dict

Fetch aggregated cost breakdown from the platform. Group by model, provider, feature, team, or service.

cogslayer.get_timeseries(*, interval, group_by, start, end) -> dict

Fetch time-series cost data bucketed by hour, day, or week. Useful for building custom dashboards.

cogslayer.export_csv(*, start, end, **filters) -> str

Export events as a CSV string with all attribution columns. Filters by model, provider, feature, team.

API Endpoints

The CogsLayer platform exposes a REST API. All endpoints require authentication via the Authorization header: either an API key (cl_live_...) or a JWT from the dashboard. All request and response bodies are JSON unless noted otherwise.

Authentication

GET
/v1/auth/verify

Validate an API key. Called automatically by the SDK during init.

Events

POST
/v1/events

Ingest a batch of events (max 1000 per request). Used internally by the SDK transport.

GET
/v1/events

Query events with filters. Supports model, provider, feature, team, start, end, limit, offset.

GET
/v1/events/export

Download all matching events as a CSV file with full attribution columns.

Costs

GET
/v1/costs/summary

Aggregated cost breakdown. Group by model, provider, feature, team, or service.

GET
/v1/costs/timeseries

Bucketed cost data over time. Supports hourly, daily, and weekly intervals.

Insights

GET
/v1/insights

Returns actionable cost optimization recommendations based on your usage patterns.

Savings

GET
/v1/savings/opportunities

Ranked savings opportunities (model swaps, expensive runs, cache hints). Query: days, limit.

GET
/v1/savings/realized

Total savings realized: includes both rule-based savings and automatic SDK optimizations (dedup, caching, retry, output cap). Returns per-technique breakdown.

Budget alerts

POST
/v1/alerts

Create a new budget alert with threshold, period, and scope.

GET
/v1/alerts

List all budget alerts for your organization.

PATCH
/v1/alerts/:id

Update an alert's name, threshold, period, scope, or enabled state.

DELETE
/v1/alerts/:id

Permanently delete a budget alert.

POST
/v1/alerts/check

Evaluate all active alerts against current spend. Returns which ones triggered.

GET
/v1/alerts/history

View a log of past alert triggers with spend vs threshold data.

API keys

POST
/v1/keys

Create a new API key. The full key is only returned once.

GET
/v1/keys

List all API keys (prefix only, not the full key).

DELETE
/v1/keys/:id

Revoke an API key. Revoked keys stop working immediately.

Tracked Metrics

Every LLM call captured by CogsLayer records these fields. Usage metrics are extracted automatically from provider responses. Attribution fields are set through @cogslayer.track() and cogslayer.init().

cost_usdEstimated cost in USD from the built-in pricing registry
prompt_tokensInput tokens sent to the model
completion_tokensOutput tokens generated by the model
total_tokensSum of prompt + completion tokens
reasoning_tokensThinking/reasoning tokens (o1, o3, o4-mini)
cached_tokensTokens served from prompt cache rather than recomputed
latency_msRound-trip response time in milliseconds
modelModel identifier (e.g. gpt-4o, claude-sonnet-4-20250514)
providerProvider name (openai, anthropic, google, etc.)
featureBusiness feature attribution, set via @track()
teamTeam attribution, set via @track()
user_idEnd-user attribution, set via @track()
serviceApplication or microservice name, set via init()
environmentDeployment environment (production, staging), set via init()

Error Handling

CogsLayer is designed to never break your application. Tracking failures are silently swallowed. If the platform is unreachable or returns an error, your LLM calls still execute normally. Failed events are logged at debug level and dropped to avoid unbounded memory growth.

401

Invalid or revoked API key. Check the key passed to init().

429

Rate limit exceeded (1000 req/min per org). The SDK will drop events that fail to send.

400

Malformed event payload. Usually means a provider wrapper bug. Please report it.

500

Platform error. Failed events are logged at debug level and dropped.

Platform query methods (get_insights, check_alerts, etc.) do raise exceptions on failure, since they're explicit data fetches. Wrap them in try/except if your application needs to handle platform downtime gracefully.