Contact Us

FinOps-enhanced GenAI: Inform, Optimise, Operate, Innovate

Produced by Luca Giannone, supported by Derek Ho and Benjamin Num 

The Next Wave of IT Innovation: Generative AI

“Generative AI could expand to 10-12% of total information-technology hardware, software, services, advertising, and gaming expenditures by 2032 from less than 1% today, according to our analysis. Training of AI platforms (creating a machine-learning model using large datasets) will be key, driven initially by spending on servers and storage and eventually by cloud-related infrastructure” Bloomberg, 8th March 2024

Over the past six months, we have seen an increase in boardroom discussions about AI solutions. This is a clear indication that leadership recognises AI's transformative potential to enhance business operations and deliver significant value. 

However, as exciting as these possibilities are, we must not overlook a critical factor: cost.

Drive cost awareness in generative AI with FinOps

From a cost perspective, generative AI tools and applications follow the same principles as every other digital product implemented in the cloud.

Organisations will use their existing FinOps processes and tools to:

  • Ingest and normalise cost and usage data
  • Allocate and share the cost of cloud services
  • Manage spend anomalies
  • Define a budget for cloud spend and forecast digital product costs

When approaching generative AI, the main cost that a FinOps team should focus on is the one the organisation will be charged for model inference and customisation.
This expense is caused by the consumption of computing resources every time an LLM is given an input (or prompt) in order to produce an output (or completion).

Generative AI models break down text data into units called tokens for processing. The way text data is converted into tokens depends on the tokenizer used. A token can be characters, words, or phrases. Generative AI usually charges by every 1000 tokens of input (prompt) and output (response): i.e.:

An application developer makes the following API calls to Amazon Bedrock in the US West (Oregon) Region: a request to Anthropic’s Claude model to summarise an input of 11K tokens of input text to an output of 4K tokens. 

Total cost incurred = 11K tokens/1000 x $0.008 + 4K tokens/1000 x $0.024 = $0.088 + $0.096 = $0.184

Cost-Effective GenAI: Design for Efficiency

The cost optimisation techniques supported by FinOps translate perfectly to the realm of generative AI. Just like any cloud-based service, GenAI tools incur ongoing expenses that require careful management. 

FinOps tenets like rightsizing resources, leveraging automation for cost control, and negotiating committed-use discounts with cloud providers can effectively be applied to GenAI. However, since inference is a major cost driver for GenAI products, we must architect solutions accurately. This ensures prompts minimise token counts while maintaining desired response accuracy, and uses the appropriate foundational model for maximum value.

To illustrate a poorly architected solution's cost impact, consider 200 large text documents (around 10,000 tokens each) containing detailed information on a specific topic. For each document, we want to distil the information into a concise summary and generate additional content based on that summary.

Lazy option
We decide to utilise the Claude 3 Sonnet model from AWS Bedrock, which would perform the task and produce 1000 tokens output

Total cost = Claude (0.003 x 10 + 0.015 x 1) = 0.045$ for a single inference x 200 = 9$ for 200 summaries

Well-architected option
We decide to implement a model chaining pattern, to separate the summarisation and the content creation stages. A simple Mistral 8*7B would create a 500 output tokens per document, feeding into Claude 3 Sonnet, to produce the same 1000 tokens output

Total cost = Mistral (0.00045 x 10 + 0.0007 x 0.5) + Claude (0.003 x 0.5 + 0.015 x 1) = 0.021$ for a single inference x 200 = 4.2$ for 200 summaries

By carefully designing the solution (model selection and chaining), we achieved a 46% cost saving without compromising accuracy.

When building a chaining model for your AI implementation, using the right LLM at each stage is key to keeping costs low. For example, if you have to implement pre-scanning and post-scanning of a user inference, each potentially requires a separate LLM.  

A poorly architected chaining model can incur LLM charges that are 100 times higher than necessary.

Here are a few ideas for a well-architected AI pattern:

  • Use smaller, task-specific models for pre-scanning (intent) and post-scanning (response generation). This leverages their strengths and avoids paying for unnecessary capabilities.
  • Hard-code common responses (confirmations, refusals, etc.) to eliminate LLM usage entirely.
  • Pre-compute responses for limited user inputs (e.g. category selections) and store them for retrieval, minimising real-time LLM calls.


Let us look at another scenario: 

An e-commerce chatbot is designed to handle customer inquiries about product availability and order status. 

Poorly Architected Approach

  • Single Powerful LLM
  • The developers use a single, powerful LLM for both pre-scanning and post-scanning ($0.06 per 1,000 tokens). This LLM requires processing 100 tokens per user interaction (including the user's question and the system's response). 
  • High Cost: Every user interaction, even simple ones, requires processing by this expensive LLM. This translates to a cost of $0.06 per 1,000 tokens x 100 tokens/interaction = $0.006 per interaction.

Cost Breakdown

  • Assuming the chatbot handles 10,000 interactions per day, the daily cost would be $0.006/interaction x 10,000 interactions = $60.
  • This translates to a monthly cost of $60/day x 30 days = $1800.

Well-Architected Approach

  • Specialised LLMs
  • Pre-scanning: A smaller LLM, optimised for intent recognition, analyses the user's question. It requires just 20 tokens for processing and costs $0.003 per 1,000 tokens.
  • Post-scanning: Another, even smaller LLM, retrieves product information and order details from the database and generates a concise response, requiring 50 tokens. This LLM costs $0.006 per 1,000 tokens.
  • Hard-coding: The chatbot is programmed with pre-defined responses for common inquiries, eliminating LLM processing for these interactions altogether. Cost is null.
  • Pre-computing: For product categories with a limited number of options, pre-generated responses for availability can be stored. The chatbot retrieves the appropriate response based on user selection, further minimising LLM usage. Cost is also null.

Cost Breakdown

  • Pre-scanning LLM: $.003 per 1,000 tokens x 20 tokens/interaction = $0.00006 per interaction 
  • Post-scanning LLM: $.006 per 1,000 tokens x 50 tokens/interaction = $0.0003 per interaction
  • Hard-coded and pre-computed responses: $0 cost

Assuming an even split between pre-scanning and post-scanning interactions (5,000 each per day), the total daily cost becomes:

  • Pre-scanning cost: $0.00006/interaction x 5,000 interactions = $0.3
  • Post-scanning cost: $0.0003/interaction x 5,000 interactions = $1.5
  • Hard-coded/pre-computed responses: $0 cost

Combined, the daily cost is $1.8. This translates to a monthly cost of $1.8/day x 30 days = $54.

By implementing a well-architected design, this example demonstrates a potential 97% reduction in monthly costs ($1800 vs $54) by using specialised LLMs, hard-coded responses, and pre-computed responses. This highlights the substantial financial benefits of a well-architected chaining model for your AI chatbot.

A further approach to enhancing performance and minimising the total cost of ownership (TCO) for GenAI products is Retrieval-Augmented Generation (RAG). RAG optimises the workload for foundational models by pre-selecting relevant information. Instead of requiring the expensive generative model to process information from scratch, RAG efficiently reduces the volume of text the model needs to handle. 

This is achieved by first retrieving pertinent information from a knowledge base or corpus, and then augmenting the generative model with this pre-selected, relevant context only. By avoiding the need to process irrelevant information, RAG leads to fewer tokens being processed by the costly model, subsequently lowering inference costs while maintaining high-quality outputs.

However, to implement RAG a number of Cloud services (compute, storage, network, etc) need to be deployed, and this will increase the overall TCO for the implementation, as explored in detail further below.

The right model for the right application

Various models come with diverse computational needs and functionalities. Opting for a model that aligns well with the project's objectives, without overcomplicating with unnecessary features, can notably reduce overall expenses.

The factors that may influence the choice of a Generative AI foundational model for your application are multiple, i.e.:

  • Task type
  • Accuracy
  • Performance
  • Supported number of input/output tokens
  • Multi-language support

Considering such factors is key to ensure the maximisation of business value of the generative AI application because, even within a single family of models, there is a wide range of options available, each with a significantly different cost associated with it.

Let’s take the AWS offering as an example below which displays On-demand and batch pricing for Anthropic AI models on AWS Bedrock. 

Model

Price per 1,000 input tokens

 Price per 1,000 output tokens

Claude Instant

$0.0008

 $0.0024

Claude 2.0/2.1 

 $0.008  

 $0.024

Claude 3 Opus    

 $0.015 

 $0.075

Claude 3 Sonnet 

 $0.003

 $0.015

Claude 3 Haiku  

 $0.00025  

 $0.00125

 

The price difference between Claude Instant and Opus models is significant. The table below explores the reasons behind this. 

Model Max Tokens Languages Use Cases
Instant  100K English and multiple other languages Casual dialogue, text analysis, summarisation, and document comprehension.
Opus  200K English, Spanish, Japanese, and multiple other languages  Task automation, interactive coding, research review, brainstorming and hypothesis generation, advanced analysis of charts and graphs, financials and market trends, forecasting

 

The pricing data highlights the importance of selecting the appropriate model for your use case. Opting for the wrong model could result in costs that are nearly 20 times higher than necessary, significantly impacting the overall implementation expenses.

Larger and more complex models like GPT-3 or PaLM require significantly more computational resources for training and inference, leading to higher costs. Selecting a smaller, more efficient model can reduce costs if the application does not require the full capabilities of a larger model. 

For instance, Anthropic Claude 3 Sonnet, with its larger context size of 12K tokens, excels at complex tasks like dialogues and creative content generation but costs $0.003 per 1K tokens on Amazon Bedrock. In contrast, the simpler Amazon Titan Text Express, suitable for summarisation and basic text generation, is nearly four times cheaper at $0.0008 per 1K tokens. 

Considering this, if we want to implement a digital platform that aggregates news articles from various sources and delivers curated content to users based on their preferences and interests, the Amazon Titan Text might be the right foundational model of choice to optimise costs.

Another example: let us consider the use case of building an AI assistant for customer service. The OpenAI GPT-3 Davinci model, with its impressive language understanding and generation capabilities, might seem like a natural choice. However, at $0.06 per 1,000 input tokens and $0.06 per 1,000 output tokens, it could quickly become cost-prohibitive for high-volume interactions. On the other hand, the more specialised Anthropic Claude Instant, designed for conversational AI, offers a more cost-effective solution at $0.0008 per 1,000 input tokens and $0.0024 per 1,000 output tokens.   

Given the requirement for real-time, interactive responses in customer service scenarios, the Claude Instant model could potentially deliver the necessary performance at a fraction of the cost, making it the more suitable option for this particular implementation.

There are more costs to consider

Frequently, when integrating generative AI services within an organisation, it is required to provide domain context to a Generative AI foundation model. This need arises in various scenarios where the model must produce outputs tailored to a specific domain or industry.

As explained before, a Retrieval-Augmented Generation (RAG) approach can help provide domain context to a generative AI foundation model. RAG is a technique that combines a pre-trained language model (like GPT-3 or BERT) with a retrieval system (like a search engine or knowledge base). The retrieval system is used to fetch relevant documents or passages from a corpus of domain-specific data, which can then be used to augment the context provided to the language model.

However, it is important to note that the implementation of such process will need several additional cloud services for each step, i.e.:

  • Data storage  AWS S3, Azure Storage Account or GCP bucket 
  • Data Cleanup AWS Glue, Azure ML studio or Vertex AI
  • Vector embedding AWS OpenSearch with k-nn, Azure CosmosDB for PostgreSQL or Vertex AI Vector Search

To manage the TCO of the AI implementation there are standard optimisation patterns that can be applied to these services:

  • Tiered Storage Leverage cost-effective storage options like lifecycle management policies in AWS S3, Azure Blob Storage Archive tier in Azure Storage Account, or Coldline storage class in GCP buckets to archive infrequently accessed data.
  • Serverless Data Processing Explore serverless data processing services like AWS Glue ETL jobs or Azure Data Factory pipelines to clean and prepare data efficiently, minimising resource usage.
  • Pay-per-use Vector Embedding Utilise managed services with pay-per-use pricing models, like Amazon OpenSearch with Elasticsearch Service for vector embedding and k-nearest neighbours search or consider cost-effective alternatives like Faiss for GPU-based similarity search within Azure Cognitive Services.
  • Automated Training Pipelines Consider Vertex AI Pipelines in GCP or Azure Machine Learning pipelines to automate and potentially optimise training workflows for the RAG model, potentially reducing training costs. Explore SageMaker Neo for efficient model deployment on AWS or leverage cost-effective containerisation technologies like Docker for deployment across cloud providers.

Conclusion

As AI workloads continue to grow in complexity and scale, effective FinOps practices become increasingly crucial for organisations to manage their cloud costs and optimise resource utilisation. 

By adopting the right AI architectural patterns, implementing cost monitoring and optimisation strategies, organisations can strike the right balance between innovation and fiscal responsibility.

Embracing FinOps principles enables organisations to future-proof their AI investments, ensuring sustainable growth and a competitive edge in the rapidly evolving AI landscape.

 

How to Gain Control of your Cloud Costs with Airwalk Reply's FinOps Services Learn more

About us

What makes
us different

We thrive in highly complex environments where technology needs to be delivered alongside regulatory restrictions, compliance, security, resource limitations and evolving customer needs.

Learn more

About