Reigning conversations from boardrooms to breakrooms, the Generative AI euphoria has led to overzealous investments, unchecked demands on underlying cloud infrastructure, localized use cases, and limited business value. An IBM report states GenAI is a significant factor for rising computing costs, up by 89% from 2023 to 2025, trending toward $76 billion by 2028. Despite this, 90% of CEOs still wait for GenAI to move past experimentation within their organizations.
Significant investments into cloud, teams, and solutions with GenAI as the focus has failed to yield intended returns, compelling more than half the enterprises with active GenAI investments to stall or abandon these projects in the next three years.
Here’s 3 Main Reasons:
High Total Cost of Ownership for LLMs
Enterprises leveraging pre-trained large language models (LLMs) incur inference costs, priced separately for API calls and task completion, depending on model capabilities, which can influence the final cost based on usage, complexity, and data volume. For instance, the GPT-4 model can offer contextually aware, detailed responses on broader topics. It is priced higher than its newer specialized version, GPT-4o, which is optimized for speed but may only offer concise output.
GPT-4 also has robust memory and can keep track of more extended conversations, whereas GPT-4o may be unable to. Regarding pricing, in a 128K context for a 1000-token prompt, the GPT-4 model charges $0.01 for input tokens and $0.03 as completion cost, totaling $0.04. In contrast, for similar parameters, the GPT-4o will cost $0.00250 for input and $0.01000 for completion, totaling $0.0125 – or at least three times cheaper. Enterprises that do not weigh the LLM capabilities with business needs will end up with a higher total cost of ownership.
Overprovisioned Resources
Enterprises that set up their LLMs in their cloud environments for enhanced data privacy invest heavily in a locally hosted model and setting up GenAI applications. This requires a scalable infrastructure that can handle a high volume of requests, large data storage, and deliver high performance. Enterprises tend to provision these cloud resources around the clock to avoid model latency. While it is easier to anticipate the resource requirements for a GenAI proof of concept, most tend to overestimate these while scaling up, leading to an underutilized cloud sprawl that is harder to govern or monitor and leads to overbilling. Cloud costs associated with GenAI are now twice as high as the cost of the underlying model.
Lack of Visibility into Cloud Pricing Models
Enterprises consider the cloud a cost-takeout measure and struggle to understand how cloud resources are billed. This makes it challenging to visualize cloud expenses. Even if they opt for pay-per-use — among the highest-priced cloud subscription options — or choose the right instances, the costs will trend upward, given that enterprises do not fully understand the cost structures or lack the optimal utilization strategy for GenAI deployments.
Running GenAI in the Cloud With a Budget
While GenAI is here to stay, enterprises need a strategy that considers GenAI’s cost complexity and technical challenges while creating a roadmap for scalable, sustainable, and profitable deployment. Here is a three-pronged approach to whiteboarding this strategy, with an explicit focus on cost-benefit analysis and stringent FinOps principles:
- Optimize Cloud Instances: Enterprises should start by understanding workload needs to assess the computational and memory requirements. Hyperscalers offer AI-optimized instances that help reduce the cost of running AI workloads in the cloud. Monitoring and adjusting instance sizes can help avoid overprovisioning and ensure significant cost savings. For instance, spot instances are a viable option if designed to be fault-tolerant since they can be up to 90% cheaper than on-demand instances. Establishing clear scaling policies can help accurately align resources with demand using predictive models to anticipate demand spikes. Implementing budget cut-offs prevents scaling beyond a specified cost threshold.
- Adopt FinOps at Scale: Fostering a cost-conscious culture, which includes regular reviews, audits, and optimization efforts, is vital to ensure ongoing efficiency and effectiveness. GenAI is best run in cloud environments with aggressive FinOps principles. A good start would be a commercial FinOps tool, or the one provided by the hyperscaler, clubbed with FinOps practices, such as right-sizing, cost tracking and auditing, and repurposing available resources. Enhanced cost visibility will facilitate better decision-making, and allocating costs with accurate tagging will provide deeper insights into cost distribution.
- Leverage AI for Cost Control: AI-optimized hardware, such as custom chips like Google Tensor Processing Units (TPUs) and AWS Inferentia, delivers superior computational power tailored for AI tasks without cost overruns. Model efficiency can guide the selection of appropriate function-specific LLMs, further optimizing performance and resource usage. Enterprises can adopt advanced compression techniques to reduce memory and computational needs during training and inference. Serverless AI platforms that run on Function-as-a-Service (FaaS) allow hyperscalers to manage AI workloads flexibly without traditional server infrastructure. To safeguard data, enterprises should adopt substantial regulatory compliance and data governance measures that ensure data remains within local jurisdictions while protecting user privacy through anonymization and encryption.
Platform-Based Approach to GenAI Cost Governance
Understanding cloud cost structures, the computational requirements of GenAI workloads, and the underlying model is key to cost efficiency. Not every workload requires an LLM that handles high cognitive loads, which means not every GenAI application will need the same resources or instances in the cloud or cost the same. Understanding these dynamics and factoring them into the GenAI strategy can help enterprises make informed decisions about budgets, investments, and operating costs.
One way could be a hybrid-cloud environment for GenAI deployments, where enterprises orchestrate a best-fit capability stack that meets their budgetary requirements. A dashboard that collates resource utilization, cost centers, and consumption patterns can make tracking GenAI cloud investments easier. This can help enterprises set cost thresholds or policy-based pricing for LLM deployments across hyperscalers. A cheaper LLM may be used for routine use. In contrast, a higher cognitive load needs a more expensive LLM — all assembled from different hyperscalers via a GenAI platform that uses quantization to create smaller models for low-cost virtual machines.
A cost-efficient GenAI deployment depends on granular visibility into cloud spending and the technical needs of an AI workload — all tied to the business value delivered on the ground.
The post How To Build Cost-Efficient Cloud Architectures for GenAI Workloads appeared first on The New Stack.