New AI techniques promise huge cost savings and improved performance for enterprises

Recent research has unveiled two promising approaches that could dramatically reduce the costs of running large language models (LLMs) while simultaneously improving their performance on complex reasoning tasks. These innovations come at a critical time as enterprises increasingly deploy AI solutions but struggle with computational expenses.

Chain of draft: Less is more

Researchers at Zoom Communications have developed a technique called “chain of draft” (CoD) that enables LLMs to solve problems with minimal text. According to their paper published on arXiv, the method uses as little as 7.6% of the text required by current methods while maintaining or even improving accuracy.

CoD draws inspiration from human problem-solving, where people typically jot down only essential information rather than articulating every detail. This approach dramatically reduces computational overhead without sacrificing performance.

In testing with Claude 3.5 Sonnet on sports-related questions, CoD reduced the average output from 189.4 tokens to just 14.3 tokens—a 92.4% reduction—while simultaneously improving accuracy from 93.2% to 97.3%.

The financial implications are significant. According to AI researcher Ajith Vallath Prabhakar, “For an enterprise processing 1 million reasoning queries monthly, CoD could cut costs from $3,800 to $760, saving over $3,000 per month.”

Chain of experts: Sequential efficiency

Another approach, called “chain of experts” (CoE), addresses efficiency by activating specialized parts of a model sequentially rather than in parallel. This structure allows experts to communicate intermediate results and build on each other’s work.

Conventional mixture-of-experts (MoE) models already improve efficiency by selecting only certain experts for each input, but CoE takes this further. By restructuring how information flows through the model, CoE achieves better results with similar computational overhead.

Researchers found that CoE models outperform both dense LLMs and MoEs when operating with equal computational resources. For example, a CoE with 64 experts, 4 routed experts, and 2 inference iterations outperformed an MoE with 64 experts and 8 routed experts on mathematical benchmarks.

CoE also reduces memory requirements. A certain CoE configuration achieved performance similar to a larger MoE while using 17.6% less memory.

Simple implementation for immediate impact

What makes both techniques particularly valuable for enterprises is their simplicity of implementation. Both CoD and CoE can be deployed with existing models without expensive retraining or architectural changes.

These approaches could prove especially valuable for latency-sensitive applications like real-time customer support, mobile AI, educational tools, and financial services, where even small delays significantly impact user experience.

Related posts:

Stay up-to-date: