Track and reduce the cost of every AI call
2 lines of code. Full cost attribution by feature and team. 4 automatic optimizations that cut waste with zero quality risk. CogsLayer shows you where every dollar goes and eliminates what's wasted.
Installation
CogsLayer has no required dependencies. It uses Python's standard library for HTTP transport, so it won't interfere with your existing dependency tree.
pip install cogslayerRequires Python 3.10 or later. Works on Linux, macOS, and Windows.
Quick Start
Two changes to your existing code: swap the provider import, and decorate functions you want to track. CogsLayer handles the rest. Every call inside a tracked function is captured with full token counts, cost, and latency.
import cogslayer
from cogslayer.openai import OpenAI # 1. Replace the import
cogslayer.init(api_key="cl_live_xxx", # 2. Init once at startup
service="my-api")
client = OpenAI()
@cogslayer.track(feature="chat", team="growth")
def ask(prompt: str) -> str:
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
)
return resp.choices[0].message.content
answer = ask("Explain quantum computing")
# -> Event captured: model=gpt-4o, cost=$0.0032, feature=chat, team=growth
answer = ask("Explain quantum computing")
# -> Deduped! Returned cached response, saved $0.0032What happens under the hood
init()validates your API key, starts the event transport, and activates 4 automatic optimizations.- Before each call, the optimizer checks for cached responses and retry duplicates.
- If no optimization applies, the call goes through. Cache headers are injected, max_tokens may be set.
- After the call, the response is cached for future dedup.
- Every optimization is logged with the exact dollar amount saved, visible in your dashboard.
Provider Wrappers
CogsLayer provides drop-in replacements for every major LLM provider. The wrapped clients are API-identical to the originals. You change the import line and nothing else breaks. Async clients are included.
from cogslayer.openai import OpenAI, AsyncOpenAI
# Standard OpenAI
client = OpenAI()
# Groq, Together, Fireworks, xAI, Ollama: anything OpenAI-compatible
groq = OpenAI(base_url="https://api.groq.com/openai/v1")
together = OpenAI(base_url="https://api.together.xyz/v1")from cogslayer.anthropic import Anthropic, AsyncAnthropic
client = Anthropic()from cogslayer.gemini import Client
client = Client(api_key="...")All wrappers support both sync and async usage. The async variants (AsyncOpenAI, AsyncAnthropic) work seamlessly with asyncio and the decorator is async-safe with no special configuration needed.
import cogslayer
from openai import OpenAI # Keep your original import
cogslayer.init(api_key="cl_live_xxx")
client = cogslayer.wrap(OpenAI()) # Same client, now tracked + optimizedcogslayer.wrap() patches the underlying provider SDK and returns the same client instance. Use it when you prefer not to change your import lines.
Automatic Optimization: Guaranteed Savings
CogsLayer automatically applies four zero-risk optimizations to every LLM call. These never change your model or affect output quality. They only prevent wasted spend. Enabled by default when you call cogslayer.init().
The guarantee
CogsLayer saves you money from day one. Every prevented call, every cached response, every capped output is logged with the exact dollar amount saved. You can see the running total in your dashboard's Savings page.
Response Dedup
When your application sends the same prompt twice, CogsLayer returns the cached response instead of calling the provider. The cache is keyed on the SHA-256 fingerprint of the prompt content with a configurable TTL (default: 5 minutes).
import cogslayer
from cogslayer.openai import OpenAI
cogslayer.init(api_key="cl_live_xxx")
client = OpenAI()
@cogslayer.track(feature="faq")
def answer(question: str) -> str:
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": question}],
)
return resp.choices[0].message.content
# First call: hits OpenAI ($0.003)
answer("What is your return policy?")
# Second call: returns cached response ($0.00)
answer("What is your return policy?")
# -> CogsLayer logs: saved $0.003 via response_dedupPrompt Caching
Providers offer discounted pricing for cached input tokens. Anthropic gives a 90% discount, OpenAI 50%, and Gemini 75%. CogsLayer automatically injects the required cache control headers so your repeated system prompts and prefixes get the discount without any code changes.
Anthropic
90% off cached tokens
cache_control header injected on system message
OpenAI
50% off cached tokens
Automatic when prefix matches (CogsLayer ensures system-first ordering)
Gemini
75% off cached tokens
CachedContent support
Retry Elimination
When the same user triggers the same feature with the same prompt within a short window (default: 5 seconds), CogsLayer recognizes it as a retry and returns the original response. This catches application-level retries, double-clicks, and race conditions.
# User clicks "Generate" twice rapidly
# First click: calls GPT-4o ($0.004)
# Second click: CogsLayer returns cached response ($0.00)
# -> Saved $0.004 via retry_dedupOutput Capping
Most applications don't set max_tokens, so models generate until they're done, often producing 2 to 3x more output than needed. CogsLayer learns your feature's output patterns and automatically sets a max_tokens ceiling at 120% of the historical p95 output length, after observing at least 10 calls.
from cogslayer._optimizer import configure
# Set explicit ceilings per feature
configure(max_tokens_profiles={
"summarization": 300,
"classification": 50,
"chat": 1000,
})Optimizer Configuration
All optimizations are enabled by default. You can disable the entire optimizer or configure individual techniques.
cogslayer.init(api_key="cl_live_xxx", optimize=False)from cogslayer._optimizer import configure
configure(
dedup_ttl=600.0, # Cache responses for 10 minutes
dedup_max_size=2000, # LRU cache holds 2000 entries
retry_dedup_window=10.0, # Detect retries within 10 seconds
)
# Disable specific techniques
configure(
dedup_enabled=False, # Disable response dedup
cache_headers_enabled=False, # Disable cache header injection
)Wrap your agent run with cogslayer.session()
Group every LLM call in one user or agent flow into a single run in the dashboard. CogsLayer does not store prompt text, only fingerprints and technical metadata.
import cogslayer
from cogslayer.openai import OpenAI
cogslayer.init(api_key="cl_live_xxx")
client = OpenAI()
with cogslayer.session():
draft = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": user_input}],
)
refined = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": draft.choices[0].message.content}],
)
# Both calls appear under one run: total cost, timeline, and call breakdown.View run costs in the dashboard
Open Runs to see each flow’s total cost, call count, tokens, and duration. Expand a run for a run summary, timeline of calls (duration and cost), and a nested call breakdown when your SDK emits span metadata. Older data without spans still shows accurate costs; structure appears for newer SDK versions.
Estimate savings with model swaps and cache assumptions
On a run, use Estimate savings to map models to cheaper alternatives and optionally assume a cache hit rate. Results are hypothetical using published list prices; your actual invoice may differ. The Savings page ranks org-wide opportunities so you know what to validate first.
API: POST /v1/sessions/:session_id/replay with model_overrides and optional cache_rate (0 to 1).
Attribution with @track()
Attribution is what separates CogsLayer from billing scrapers. Instead of seeing “you spent $400 on GPT-4o this month”, you see “the summarization feature in the growth team spent $180 on GPT-4o for user X.” Every LLM call inside a tracked function inherits the attribution context.
@cogslayer.track(
feature="summarization", # Which product feature
team="growth", # Which team owns it
user_id="usr_123", # Which end-user triggered it
tenant="acme-corp", # Multi-tenant: which customer
)
def summarize(text: str) -> str:
# Every LLM call in here is tagged with all four labels
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": text}],
)
return resp.choices[0].message.contentfeatureProduct feature name. Use this to answer "which feature costs the most?"
teamTeam ownership. Use this for departmental cost allocation and chargebacks.
user_idEnd-user identifier. Track per-user spend for abuse detection or usage-based billing.
tenantCustomer/tenant in multi-tenant apps. See cost per customer.
You can pass any additional keyword arguments. They're stored as custom metadata on the event and available in exports and the dashboard.
Streaming Support
Streaming responses work out of the box. CogsLayer captures usage from the final chunk, which contains the complete token counts. Both sync generators and async generators are supported with no additional configuration.
@cogslayer.track(feature="chat")
def stream_response(prompt: str):
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True,
)
for chunk in stream:
yield chunk.choices[0].delta.content or ""
# Async streaming works the same way
@cogslayer.track(feature="chat")
async def stream_async(prompt: str):
stream = await async_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True,
)
async for chunk in stream:
yield chunk.choices[0].delta.content or ""Custom Model Pricing
CogsLayer ships with a built-in pricing registry for all major models. For fine-tuned models, self-hosted endpoints, or new models not yet in the registry, register custom pricing so cost estimates stay accurate.
cogslayer.register_model(
"ft:gpt-4o:my-org:custom-model:abc123",
prompt_price_per_1k=0.005, # $ per 1K input tokens
completion_price_per_1k=0.015, # $ per 1K output tokens
)
# Now calls to this model are priced correctly
@cogslayer.track(feature="extraction")
def extract(text: str):
return client.chat.completions.create(
model="ft:gpt-4o:my-org:custom-model:abc123",
messages=[{"role": "user", "content": text}],
)Configuration
All configuration is passed to cogslayer.init(). There are no config files or environment variable conventions. Everything is explicit.
api_keystrrequiredYour CogsLayer API key. Starts with cl_live_ or cl_test_.
servicestrName of your application or microservice. Useful when multiple services share an org.
environmentstrDeployment environment (e.g. production, staging). Defaults to None.
base_urlstrPlatform API base URL. Override this for self-hosted deployments.
cogslayer.init(
api_key="cl_live_abc123def456",
service="recommendation-api",
environment="production",
base_url="https://cogslayer.example.com", # self-hosted
)Budget Alerts
Set spend thresholds per feature, team, model, or total spend. Alerts are evaluated on-demand. Call check_alerts() from the SDK, hit the API endpoint, or use the “Run Check” button in the dashboard. When spend exceeds a threshold, the alert triggers and logs to the alert history.
import cogslayer
cogslayer.init(api_key="cl_live_xxx")
# Check which alerts are currently triggered
result = cogslayer.check_alerts()
for alert in result["triggered"]:
name = alert["alert_name"]
spend = alert["spend_usd"]
limit = alert["threshold_usd"]
print(f"Warning: {name}: spent {spend:.2f} / {limit:.2f} USD")POST /v1/alerts
Content-Type: application/json
Authorization: Bearer cl_live_xxx
{
"name": "Daily GPT-4o budget",
"threshold_usd": 50.00,
"period": "daily",
"scope": "model",
"scope_value": "gpt-4o"
}Period
daily, weekly, monthly
Scope
total, feature, team, model, service
Evaluation
On-demand via SDK, API, or dashboard
Cost Insights
CogsLayer analyzes your usage patterns and generates actionable recommendations to reduce spend. Because it tracks code-level attribution, recommendations are scoped to specific features, not generic “use a cheaper model” advice.
Model downgrade
Identifies features calling expensive models (GPT-4o, Claude Opus) for tasks where cheaper alternatives (GPT-4o-mini, Haiku) would work. Estimates savings based on actual token volumes.
Caching opportunity
Detects features with low cached_tokens ratios and high request volumes. Suggests enabling prompt caching for repeated system prompts.
Anomaly detection
Flags features with sudden cost spikes compared to their historical baseline. Catches runaway loops, unexpected traffic, or prompt injection attacks.
Reasoning overkill
Identifies calls using reasoning models (o1, o3) where reasoning_tokens dominate the cost. Suggests non-reasoning alternatives for simpler tasks.
insights = cogslayer.get_insights(days=30)
total = insights["total_estimated_savings_usd"]
print(f"Total estimated savings: {total:.2f} USD")
for rec in insights["insights"]:
print(f"[{rec['severity']}] {rec['message']}")
print(f" -> Save ~{rec['estimated_savings_usd']:.2f} USD")
print()CSV Export
Export raw events as CSV for offline analysis, compliance, or piping into your own BI tools. Every column, including all attribution fields, is included. Available from the SDK, the API, or the dashboard's Events page.
from datetime import datetime
csv_data = cogslayer.export_csv(
start=datetime(2025, 1, 1),
end=datetime(2025, 1, 31),
feature="chat",
)
with open("january_chat_events.csv", "w") as f:
f.write(csv_data)GET /v1/events/export?start=2025-01-01&end=2025-01-31&feature=chat
Authorization: Bearer cl_live_xxx
# Response: text/csv with all event columnsTime-series Data
Query bucketed cost data over time for building charts, tracking trends, or feeding into monitoring pipelines. Supports hourly, daily, and weekly intervals with optional grouping.
from datetime import datetime
data = cogslayer.get_timeseries(
interval="day",
group_by="feature",
start=datetime(2025, 1, 1),
end=datetime(2025, 1, 31),
)
for bucket in data["data"]:
cost = bucket["cost_usd"]
count = bucket["event_count"]
print(f"{bucket['bucket']}: {cost:.4f} USD ({count} events)")SDK Methods
All public methods available after import cogslayer. Platform query methods require a valid API key set via init().
Initialization
cogslayer.init(api_key, *, service, environment, base_url, optimize=True)Initialize the SDK. Validates your API key, starts the event transport, and activates automatic optimizations. Pass optimize=False to disable all automatic savings.
cogslayer.wrap(client)Wrap any LLM client to enable tracking and optimization. Returns the same client instance. Equivalent to calling cogslayer.patch() but with a fluent API: client = cogslayer.wrap(OpenAI()).
cogslayer.shutdown()Flush all pending events and stop the background transport. Call this before your process exits to avoid losing data.
Tracking
@cogslayer.track(feature, team, user_id, tenant, ...)Decorator that attaches attribution context to every LLM call within the decorated function. Supports arbitrary keyword arguments for custom dimensions.
cogslayer.session(name, session_id=None)Context manager that groups all LLM calls into a single agent run. Works with both sync (with) and async (async with). All calls inside share the same session_id in the dashboard.
cogslayer.estimate(model, messages, max_tokens=500)Estimate cost before making a call. Returns prompt_tokens, completion_tokens, estimated_cost_usd. Uses the local pricing registry, no API call needed.
cogslayer.register_model(model, prompt_price_per_1k, completion_price_per_1k)Register custom per-token pricing for fine-tuned or self-hosted models. Prices are per 1K tokens.
Optimizer
from cogslayer._optimizer import configureFine-tune optimizer settings: dedup_ttl, dedup_max_size, retry_dedup_window, cache_headers_enabled, max_tokens_profiles.
from cogslayer._optimizer import get_total_savingsReturns the total USD saved by automatic optimizations in the current process. Useful for logging or monitoring.
Platform queries
cogslayer.get_insights(*, days=30) -> dictFetch cost optimization recommendations. Returns insights with type, severity, estimated savings, and affected features.
cogslayer.check_alerts() -> dictEvaluate all active budget alerts against current period spend. Returns which alerts triggered and their spend vs threshold.
cogslayer.get_cost_summary(*, group_by, start, end) -> dictFetch aggregated cost breakdown from the platform. Group by model, provider, feature, team, or service.
cogslayer.get_timeseries(*, interval, group_by, start, end) -> dictFetch time-series cost data bucketed by hour, day, or week. Useful for building custom dashboards.
cogslayer.export_csv(*, start, end, **filters) -> strExport events as a CSV string with all attribution columns. Filters by model, provider, feature, team.
API Endpoints
The CogsLayer platform exposes a REST API. All endpoints require authentication via the Authorization header: either an API key (cl_live_...) or a JWT from the dashboard. All request and response bodies are JSON unless noted otherwise.
Authentication
GET/v1/auth/verifyValidate an API key. Called automatically by the SDK during init.
Events
POST/v1/eventsIngest a batch of events (max 1000 per request). Used internally by the SDK transport.
GET/v1/eventsQuery events with filters. Supports model, provider, feature, team, start, end, limit, offset.
GET/v1/events/exportDownload all matching events as a CSV file with full attribution columns.
Costs
GET/v1/costs/summaryAggregated cost breakdown. Group by model, provider, feature, team, or service.
GET/v1/costs/timeseriesBucketed cost data over time. Supports hourly, daily, and weekly intervals.
Insights
GET/v1/insightsReturns actionable cost optimization recommendations based on your usage patterns.
Savings
GET/v1/savings/opportunitiesRanked savings opportunities (model swaps, expensive runs, cache hints). Query: days, limit.
GET/v1/savings/realizedTotal savings realized: includes both rule-based savings and automatic SDK optimizations (dedup, caching, retry, output cap). Returns per-technique breakdown.
Budget alerts
POST/v1/alertsCreate a new budget alert with threshold, period, and scope.
GET/v1/alertsList all budget alerts for your organization.
PATCH/v1/alerts/:idUpdate an alert's name, threshold, period, scope, or enabled state.
DELETE/v1/alerts/:idPermanently delete a budget alert.
POST/v1/alerts/checkEvaluate all active alerts against current spend. Returns which ones triggered.
GET/v1/alerts/historyView a log of past alert triggers with spend vs threshold data.
API keys
POST/v1/keysCreate a new API key. The full key is only returned once.
GET/v1/keysList all API keys (prefix only, not the full key).
DELETE/v1/keys/:idRevoke an API key. Revoked keys stop working immediately.
Tracked Metrics
Every LLM call captured by CogsLayer records these fields. Usage metrics are extracted automatically from provider responses. Attribution fields are set through @cogslayer.track() and cogslayer.init().
cost_usdEstimated cost in USD from the built-in pricing registryprompt_tokensInput tokens sent to the modelcompletion_tokensOutput tokens generated by the modeltotal_tokensSum of prompt + completion tokensreasoning_tokensThinking/reasoning tokens (o1, o3, o4-mini)cached_tokensTokens served from prompt cache rather than recomputedlatency_msRound-trip response time in millisecondsmodelModel identifier (e.g. gpt-4o, claude-sonnet-4-20250514)providerProvider name (openai, anthropic, google, etc.)featureBusiness feature attribution, set via @track()teamTeam attribution, set via @track()user_idEnd-user attribution, set via @track()serviceApplication or microservice name, set via init()environmentDeployment environment (production, staging), set via init()Error Handling
CogsLayer is designed to never break your application. Tracking failures are silently swallowed. If the platform is unreachable or returns an error, your LLM calls still execute normally. Failed events are logged at debug level and dropped to avoid unbounded memory growth.
401Invalid or revoked API key. Check the key passed to init().
429Rate limit exceeded (1000 req/min per org). The SDK will drop events that fail to send.
400Malformed event payload. Usually means a provider wrapper bug. Please report it.
500Platform error. Failed events are logged at debug level and dropped.
Platform query methods (get_insights, check_alerts, etc.) do raise exceptions on failure, since they're explicit data fetches. Wrap them in try/except if your application needs to handle platform downtime gracefully.