AI Solutions

    The Real Cost of Building AI Agents in 2026: Token Spend, Capacity, and Infrastructure

    8 June 2026
    ·
    9–10 min read read
    ·
    Nick de Vrye, CTO
    A visual cost breakdown diagram for AI agents showing token consumption, infrastructure layers, and Microsoft Azure platform fees contributing to total cost of ownership.
    A visual cost breakdown diagram for AI agents showing token consumption, infrastructure layers, and Microsoft Azure platform fees contributing to total cost of ownership.

    In Short: AI Agents Cost More Than Most Organisations Budget For

    The total cost of operating a production AI agent is typically two to five times higher than the raw model API cost that most organisations estimate during planning. Token costs are real and significant, but they represent only one dimension of a cost structure that also includes infrastructure, orchestration, storage, retrieval, observability, and ongoing maintenance.

    This guide breaks down the full cost structure of production AI agents on Microsoft Azure and Foundry, so you can build a realistic total cost of ownership model before committing to production infrastructure.

    The Four Cost Dimensions of Production AI Agents

    Production AI agent costs fall into four categories:

    • Token costs - model input and output tokens consumed per agent turn
    • Infrastructure costs - compute, storage, retrieval, and orchestration
    • Microsoft platform costs - Foundry, Fabric, and Microsoft 365 components
    • Operational costs - monitoring, maintenance, prompt engineering, and iteration

    Most cost estimates focus only on the first. All four need to be included in a credible TCO model.

    Token Costs: The Largest Variable

    Token costs are driven by three factors: model tier, context window size, and the number of agent turns per task.

    Model Tier

    Azure OpenAI pricing varies significantly across models. As of mid-2026 approximate pricing (USD per million tokens):

    • GPT-4o: $2.50 input / $10.00 output
    • GPT-4o mini: $0.15 input / $0.60 output
    • o3: $10.00 input / $40.00 output
    • o1: $15.00 input / $60.00 output

    The difference between using GPT-4o for every turn versus using GPT-4o mini for routing and GPT-4o only for complex reasoning is often a 5–10x reduction in model costs. Model tiering is the single highest-leverage cost optimisation available.

    Context Window Size

    Each agent turn passes the full conversation history (or a summarised version of it) plus all tool results plus the system prompt to the model. Context accumulates quickly in multi-turn conversations. An agent handling a task that requires ten tool calls and responses could be passing 20,000–50,000 tokens of context per turn by the end of the workflow.

    Prompt caching (available on GPT-4o and o1 in Azure) reduces the input cost for repeated prompt prefixes by approximately 50–75%. For agents with long, stable system prompts, caching should always be enabled.

    Turns Per Task

    An agent that requires three tool calls to complete a task consumes roughly three times the tokens of an agent that requires one. Planning agent architectures to minimise unnecessary tool calls - through better tool design, better system prompt structure, and appropriate task decomposition - directly reduces token costs.

    Infrastructure Costs

    Vector Store and Retrieval

    Agents that use retrieval-augmented generation (RAG) - querying a knowledge base or document store to retrieve relevant context - incur Azure AI Search costs in addition to model costs:

    • Azure AI Search: Charged per search unit per hour (S1 tier: approximately $0.10/hour = ~$73/month per search unit) plus per-query execution costs
    • Embedding model calls: Converting documents and queries to vector embeddings uses an additional model (typically text-embedding-3-small or text-embedding-3-large) at a separate token cost

    For agents with large knowledge bases queried frequently, retrieval infrastructure costs can match or exceed model token costs.

    Thread Storage and Memory

    Persistent thread storage in Foundry Agent Service is billed per GB stored. For agents handling hundreds of concurrent users with long conversation histories, thread storage can become a material line item. Implement thread pruning policies - deleting or summarising old threads above a defined age threshold - to control this cost.

    Code Interpreter and File Processing

    The code interpreter tool (for data analysis, calculations, and code execution) is billed per session. Agents that use code interpreter for every task will accumulate session costs quickly. Use code interpreter only where it provides genuine capability advantage over a custom function tool.

    Compute (for Custom Orchestration)

    Agents using Azure Container Apps, Azure Functions, or AKS for custom orchestration logic incur compute costs independent of the model costs. These are typically modest for low-to-medium volume agents but can become significant at high request volumes.

    Microsoft Platform-Specific Costs

    Azure AI Foundry

    Foundry Agent Service itself does not carry a per-agent or per-deployment fee - costs are driven by underlying resource consumption (model tokens, search, storage). The Foundry hub and project infrastructure has a small base cost from associated Azure resources (Key Vault, Storage Account, Container Registry), typically under $50/month for a standard configuration.

    Microsoft Fabric

    Agents that query Fabric data consume Fabric capacity (F-SKU compute units). A Fabric Data Agent or a Foundry agent using Fabric function tools will consume CUs from your provisioned F-SKU capacity. For organisations already running Fabric for analytics workloads, agent query consumption can be absorbed by existing capacity if headroom exists, or may require capacity scaling if agents run at high volume.

    Microsoft Copilot and Microsoft 365

    Agents surfaced through Microsoft 365 Copilot (including Rayfin-powered natural language interfaces) require Microsoft 365 Copilot licences for end users at $30/user/month. For organisation-wide deployment, this is often the largest single line item in the AI stack cost - larger than all infrastructure costs combined.

    Agents accessed programmatically (not through the Copilot user interface) do not require Copilot licences.

    Cost Optimisation Strategies

    Model Tiering

    Route simple tasks (intent classification, information extraction, format conversion) to GPT-4o mini. Reserve GPT-4o or reasoning models for complex multi-step analysis. This is the highest-leverage cost reduction available and typically reduces model costs by 50–80% relative to using a single high-tier model for all tasks.

    Prompt Caching

    Enable prompt caching on Azure OpenAI for agents with long, stable system prompts. Cache hits on the prompt prefix reduce input token costs by 50–75%. This requires structuring your prompts so the stable prefix comes before dynamic content.

    Retrieval Precision

    Improve retrieval precision in your vector store to reduce the number of chunks returned per query. Returning ten chunks when three are needed inflates context size and token costs. Tune chunk size, overlap, and top-k values to balance recall against context size.

    Turn Minimisation

    Audit your agent's tool call patterns. Agents that call the same tool multiple times in a single task, or that call tools unnecessarily due to vague system prompt instructions, are burning tokens that better tool design would eliminate.

    Batching and Caching at the Application Layer

    For agents handling repetitive queries (the same question asked frequently by different users), application-layer response caching can dramatically reduce model calls. Cache responses for parameterised queries where the answer is deterministic given the same inputs.

    Building a TCO Model

    A realistic AI agent TCO model for 2026 should include:

    • Model costs: estimated tokens per task × tasks per month × model tier pricing, with caching adjustments
    • Retrieval costs: search units + embedding calls per month
    • Storage costs: thread storage + knowledge base storage
    • Compute costs: orchestration infrastructure at expected request volume
    • Fabric capacity headroom: additional CU consumption from agent queries
    • Microsoft 365 Copilot licences: if agents surface through the Copilot interface
    • Operational costs: engineering time for prompt maintenance, evaluation, and iteration (typically 0.5–1 FTE equivalent for a production agent in its first year)

    For most organisations deploying a single well-scoped production agent at moderate volume (hundreds to low thousands of tasks per day), the monthly infrastructure cost excluding Copilot licences is typically $500–$3,000/month. Copilot licences for broad deployment can add $30,000–$100,000+/month depending on user count.

    These numbers vary significantly based on context window size, tool call frequency, retrieval volume, and model tier choices. Build a bottoms-up model specific to your use case before committing to production infrastructure.

    Our AI Solutions team runs agent cost modelling workshops with organisations evaluating production agent deployments on Microsoft Foundry. We help you build a realistic TCO before you commit, not after.

    FAQ

    Frequently Asked Questions

    Quick answers to your questions about AI Solutions.

    The total cost of a production AI agent depends on four dimensions: token costs (model input/output per agent turn), infrastructure costs (vector store, thread storage, compute), Microsoft platform costs (Foundry, Fabric, Copilot licences), and operational costs (ongoing engineering). For a single well-scoped production agent at moderate volume, monthly infrastructure costs excluding Microsoft 365 Copilot licences typically range from $500 to $3,000/month. Copilot licences for broad deployment can add substantially more.

    Model tiering is the practice of using different model tiers for different task types within an agent workflow. Simple tasks like intent classification, routing, and information extraction use cheaper models (GPT-4o mini at $0.15/$0.60 per million tokens). Complex reasoning tasks use more capable models (GPT-4o at $2.50/$10.00, or o3 at $10.00/$40.00). This typically reduces model costs by 50–80% compared to using a single high-tier model for all tasks.

    Prompt caching (available on GPT-4o and o1 in Azure) reduces input token costs for repeated prompt prefixes by approximately 50–75%. For agents with long system prompts that are stable across many calls, structuring the prompt so the stable content comes first (before dynamic user input or tool results) allows the cache to apply to the largest portion of the prompt. On agents with high system prompt volumes, this can reduce model costs by 30–50%.

    Foundry agents that query Microsoft Fabric data through function tools consume Fabric capacity (F-SKU compute units) for each query. For organisations already running Fabric for analytics, agent consumption can often be absorbed by existing capacity if headroom exists. High-volume agent deployments that query Fabric frequently may require capacity scaling. The Fabric capacity consumption should be modelled separately from Azure OpenAI token costs when building an agent TCO.

    The most common reason is underestimating context window growth in multi-turn workflows. Each agent turn passes the full conversation history plus all tool results to the model - context accumulates fast. A task requiring ten tool calls could be passing 30,000–50,000 tokens of context per turn by the end of the workflow. Add retrieval infrastructure costs, thread storage, compute, and operational overhead, and the total is typically two to five times the raw model API cost that teams estimate during planning.

    Microsoft 365 Copilot licences are required when agents surface through the Copilot interface in Teams, Outlook, or other Microsoft 365 applications. As of 2026, Microsoft 365 Copilot is priced at approximately $30/user/month. For organisation-wide deployments of hundreds or thousands of users, this often becomes the largest single line item in the total AI stack cost - larger than all infrastructure costs combined. Agents accessed programmatically (not through the Copilot interface) do not require Copilot licences.

    Build a bottoms-up model covering: estimated tokens per task (input context size + expected output length) × tasks per month × model tier pricing with caching adjustments; retrieval costs (search units + embedding calls); storage costs (thread + knowledge base); compute costs at expected request volume; Fabric capacity headroom; and Microsoft 365 Copilot licences if required. Run the model against three scenarios - conservative (low volume, cheap model), base case, and aggressive (high volume, premium model) - to understand the cost range before committing to infrastructure.

    Want a Realistic AI Agent Cost Model Before You Build?

    We help organisations scope, size, and cost AI agent deployments on Microsoft Foundry and Azure before committing to production infrastructure. Get a clear view of what your specific use case will cost.

    Get in Touch
    Solv.

    Experts in Power BI, Microsoft Fabric & AI Automation Consulting. Empowering businesses through data and AI excellence.

    Navigate

    Office

    1 Crane Ave, Greenshields Park, Gqeberha, South Africa

    info@solv-systems.com

    © 2026 Solv Systems. All rights reserved.