In Short: AI Agents Cost More Than Most Organisations Budget For
The total cost of operating a production AI agent is typically two to five times higher than the raw model API cost that most organisations estimate during planning. Token costs are real and significant, but they represent only one dimension of a cost structure that also includes infrastructure, orchestration, storage, retrieval, observability, and ongoing maintenance.
This guide breaks down the full cost structure of production AI agents on Microsoft Azure and Foundry, so you can build a realistic total cost of ownership model before committing to production infrastructure.
The Four Cost Dimensions of Production AI Agents
Production AI agent costs fall into four categories:
- Token costs - model input and output tokens consumed per agent turn
- Infrastructure costs - compute, storage, retrieval, and orchestration
- Microsoft platform costs - Foundry, Fabric, and Microsoft 365 components
- Operational costs - monitoring, maintenance, prompt engineering, and iteration
Most cost estimates focus only on the first. All four need to be included in a credible TCO model.
Token Costs: The Largest Variable
Token costs are driven by three factors: model tier, context window size, and the number of agent turns per task.
Model Tier
Azure OpenAI pricing varies significantly across models. As of mid-2026 approximate pricing (USD per million tokens):
- GPT-4o: $2.50 input / $10.00 output
- GPT-4o mini: $0.15 input / $0.60 output
- o3: $10.00 input / $40.00 output
- o1: $15.00 input / $60.00 output
The difference between using GPT-4o for every turn versus using GPT-4o mini for routing and GPT-4o only for complex reasoning is often a 5–10x reduction in model costs. Model tiering is the single highest-leverage cost optimisation available.
Context Window Size
Each agent turn passes the full conversation history (or a summarised version of it) plus all tool results plus the system prompt to the model. Context accumulates quickly in multi-turn conversations. An agent handling a task that requires ten tool calls and responses could be passing 20,000–50,000 tokens of context per turn by the end of the workflow.
Prompt caching (available on GPT-4o and o1 in Azure) reduces the input cost for repeated prompt prefixes by approximately 50–75%. For agents with long, stable system prompts, caching should always be enabled.
Turns Per Task
An agent that requires three tool calls to complete a task consumes roughly three times the tokens of an agent that requires one. Planning agent architectures to minimise unnecessary tool calls - through better tool design, better system prompt structure, and appropriate task decomposition - directly reduces token costs.
Infrastructure Costs
Vector Store and Retrieval
Agents that use retrieval-augmented generation (RAG) - querying a knowledge base or document store to retrieve relevant context - incur Azure AI Search costs in addition to model costs:
- Azure AI Search: Charged per search unit per hour (S1 tier: approximately $0.10/hour = ~$73/month per search unit) plus per-query execution costs
- Embedding model calls: Converting documents and queries to vector embeddings uses an additional model (typically text-embedding-3-small or text-embedding-3-large) at a separate token cost
For agents with large knowledge bases queried frequently, retrieval infrastructure costs can match or exceed model token costs.
Thread Storage and Memory
Persistent thread storage in Foundry Agent Service is billed per GB stored. For agents handling hundreds of concurrent users with long conversation histories, thread storage can become a material line item. Implement thread pruning policies - deleting or summarising old threads above a defined age threshold - to control this cost.
Code Interpreter and File Processing
The code interpreter tool (for data analysis, calculations, and code execution) is billed per session. Agents that use code interpreter for every task will accumulate session costs quickly. Use code interpreter only where it provides genuine capability advantage over a custom function tool.
Compute (for Custom Orchestration)
Agents using Azure Container Apps, Azure Functions, or AKS for custom orchestration logic incur compute costs independent of the model costs. These are typically modest for low-to-medium volume agents but can become significant at high request volumes.
Microsoft Platform-Specific Costs
Azure AI Foundry
Foundry Agent Service itself does not carry a per-agent or per-deployment fee - costs are driven by underlying resource consumption (model tokens, search, storage). The Foundry hub and project infrastructure has a small base cost from associated Azure resources (Key Vault, Storage Account, Container Registry), typically under $50/month for a standard configuration.
Microsoft Fabric
Agents that query Fabric data consume Fabric capacity (F-SKU compute units). A Fabric Data Agent or a Foundry agent using Fabric function tools will consume CUs from your provisioned F-SKU capacity. For organisations already running Fabric for analytics workloads, agent query consumption can be absorbed by existing capacity if headroom exists, or may require capacity scaling if agents run at high volume.
Microsoft Copilot and Microsoft 365
Agents surfaced through Microsoft 365 Copilot (including Rayfin-powered natural language interfaces) require Microsoft 365 Copilot licences for end users at $30/user/month. For organisation-wide deployment, this is often the largest single line item in the AI stack cost - larger than all infrastructure costs combined.
Agents accessed programmatically (not through the Copilot user interface) do not require Copilot licences.
Cost Optimisation Strategies
Model Tiering
Route simple tasks (intent classification, information extraction, format conversion) to GPT-4o mini. Reserve GPT-4o or reasoning models for complex multi-step analysis. This is the highest-leverage cost reduction available and typically reduces model costs by 50–80% relative to using a single high-tier model for all tasks.
Prompt Caching
Enable prompt caching on Azure OpenAI for agents with long, stable system prompts. Cache hits on the prompt prefix reduce input token costs by 50–75%. This requires structuring your prompts so the stable prefix comes before dynamic content.
Retrieval Precision
Improve retrieval precision in your vector store to reduce the number of chunks returned per query. Returning ten chunks when three are needed inflates context size and token costs. Tune chunk size, overlap, and top-k values to balance recall against context size.
Turn Minimisation
Audit your agent's tool call patterns. Agents that call the same tool multiple times in a single task, or that call tools unnecessarily due to vague system prompt instructions, are burning tokens that better tool design would eliminate.
Batching and Caching at the Application Layer
For agents handling repetitive queries (the same question asked frequently by different users), application-layer response caching can dramatically reduce model calls. Cache responses for parameterised queries where the answer is deterministic given the same inputs.
Building a TCO Model
A realistic AI agent TCO model for 2026 should include:
- Model costs: estimated tokens per task × tasks per month × model tier pricing, with caching adjustments
- Retrieval costs: search units + embedding calls per month
- Storage costs: thread storage + knowledge base storage
- Compute costs: orchestration infrastructure at expected request volume
- Fabric capacity headroom: additional CU consumption from agent queries
- Microsoft 365 Copilot licences: if agents surface through the Copilot interface
- Operational costs: engineering time for prompt maintenance, evaluation, and iteration (typically 0.5–1 FTE equivalent for a production agent in its first year)
For most organisations deploying a single well-scoped production agent at moderate volume (hundreds to low thousands of tasks per day), the monthly infrastructure cost excluding Copilot licences is typically $500–$3,000/month. Copilot licences for broad deployment can add $30,000–$100,000+/month depending on user count.
These numbers vary significantly based on context window size, tool call frequency, retrieval volume, and model tier choices. Build a bottoms-up model specific to your use case before committing to production infrastructure.
Our AI Solutions team runs agent cost modelling workshops with organisations evaluating production agent deployments on Microsoft Foundry. We help you build a realistic TCO before you commit, not after.



