Navigating the rising costs of AI inference in the era of large-scale applications

The push for AI-powered applications is accelerating around the world and shows little sign of slowing down. According to data from IBM, 42% of companies with more than 1,000 employees actively use AI in their businesses, and another 40% test and experiment with it.

As AI adoption gains pace, with platforms like OpenAI’s GPT-4o and Google’s Gemini setting new performance benchmarks, organizations are discovering new applications for these technologies that can deliver better results. Faced with the new challenges of implementing technology at scale. More and more enterprise workflows incorporate calls to these AI models and their use is increasing dramatically. Do the use cases justify the increased spending on the latest models?

Embracing AI also means embracing the use of AI models and paying AI inference costs, at a time when many organizations are in cost-cutting mode. With continued economic uncertainty, rising operating costs, and increasing pressure from stakeholders to generate return on investment, companies are looking for ways to optimize their budgets and reduce unnecessary expenses. The rising costs of AI infrastructure can be a cause of tension as organizations want to remain competitive and harness the power of AI, while balancing these investments with financial prudence.

To further complicate matters, AI agents, which McKinsey says are the next frontier of GenAI and expected to form the next wave of applications, will dramatically increase the use of AI models as they rely on them for reflection. and ongoing planning steps. Instead of singular API calls to underlying models like those in OpenAI, agent architectures can make dozens of calls, thus racking up those costs. How can businesses address rising data costs while powering the AI applications they need?

Manvinder Singh

Vice President of AI and Search Products at Redis.

Understanding the cost of AI at scale

The rapid deployment of AI is driving higher costs on multiple fronts. First, organizations are spending on AI inference, which is the process of using a trained model to make predictions or decisions based on the data provided. They often relied on APIs from leading providers like OpenAI, Anthropic, or cloud service providers like AWS or Google and paid based on usage. Alternatively, some organizations run their own inference and purchase or rent GPUs on which they implement open source models such as Meta Flame.

Second, in many cases organizations want to customize their AI models by “tuning” them. This can sometimes be an expensive process that involves data preparation by creating training data sets and requires computing resources for training.

Finally, building AI applications requires additional components, such as vector databases, which help increase inference by helping to retrieve relevant content from designated knowledge bases and thus improve the accuracy and relevance of AI models’ responses. .

By examining the root causes and drivers of their AI costs, such as AI inference, training or tuning, and additional components such as databases, companies can minimize storage costs and improve the performance of their applications. of AI.

Optimizing Efficiency Using Semantic Caching

Semantic caching is a highly effective technique that organizations are implementing to manage the cost of AI inference and increase the speed and responsiveness of their applications. It refers to storing and reusing the results of previous calculations based on their semantic meaning.

In other words, instead of relying on new AI calculations for new queries, a semantic cache can check a database for queries with similar meanings that have been formulated before, thus saving costs. This approach helps reduce redundant calculations and improves efficiency in applications such as inference or search.

In one particular study, researchers showed that up to 31% of queries to AI applications can be repetitive. Every unnecessary AI inference call adds avoidable costs, but by implementing a semantic cache, organizations can significantly reduce these calls, reducing them by 30% to 80%. This method is crucial for creating scalable and responsive chatbots or generative AI applications. This approach not only optimizes costs but also accelerates response times, helping companies achieve more with less investment.

Balance between performance and cost

Organizations need to optimize their technology and operational strategies to be able to deploy cutting-edge AI applications without incurring unsustainable infrastructure costs. This can help them achieve that crucial balance between performance and cost. Techniques like semantic caching can play a vital role in this.

For companies struggling to scale AI applications in an efficient and cost-effective manner, learning how to manage this effectively would become a key differentiator in the market. The key for companies to address the rising cost of generative AI applications and maximize their value could lie in their AI inference strategy. As generative AI systems become increasingly complex, each LLM call must be as efficient as possible. By doing so, customers can get the information they need faster and businesses can minimize their cost footprint.

We have introduced the best AI website builder.

This article was produced as part of TechRadarPro’s Expert Insights channel, where we feature the best and brightest minds in today’s tech industry. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing, find out more here:

Must Read

Leave a Comment Cancel Reply