TL;DR:
LLMs are incredibly powerful but have varying strengths and weaknesses, and no single model can deliver optimal performance across all tasks while remaining cost-effective. LLM routing is crucial for selecting the most suitable model for each specific use case, exploiting the diversity of the LLM landscape while managing costs and performance. In collaboration with Prof. Kurt Keutzer's lab at Berkeley, we open source RouterBench to provide a standardized benchmark, with the hope that it can do for routing what ImageNet did for computer vision. arXiv | GitHub | Huggingface dataset
1 Introduction Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, with new models being introduced at an unprecedented pace. However, no single model can achieve optimal performance for all applications while remaining cost-effective. Just as in the early days of cloud computing, AI developers face a tradeoff between capability and affordability.
High-end models like GPT-4 may be the Lamborghinis of the AI world, but their high cost and latency makes them unsuitable to many applications. Techniques like prompt engineering, quantization, and systems optimization offer ways to increase performance or reduce cost on a cheaper model. But as the LLM landscape grows more crowded by the day, striking an optimal balance between capability and cost just keeps getting more complex.
2 The Rise of the Router Enter the router. In the simplest terms, LLM routing involves dynamically selecting the optimal model for each prompt based on the nature of the input. Instead of committing to a single model and pushing for incremental gains in performance on that model alone (like with fine-tuning, prompt engineering, RAG), the router looks at the entire LLM ecosystem and identifies the best tool for each prompt.
In fact, if you had a crystal ball that could see into the future and tell you which model would give you the best output for every prompt, you'd have, what we call, the “Oracle Router” - outperforming models on both cost and accuracy .
The advantages are clear. By matching queries to models based on their strengths and weaknesses, the router benefits from the full diversity of today's LLM landscape while sidestepping the intricate infrastructural complexities of any single model. The result is an AI system that can achieve an optimal blend of performance, cost, and speed by dynamically allocating resources in real-time.
3 Introducing RouterBench But for all the promise of LLM routing, the field has been missing a standardized framework for evaluating the efficacy of different routers - analogous to the role ImageNet played in the advancement of computer vision. ImageNet provided a large, diverse dataset that set a clear direction for the field, enabling large-scale experiments and consistent measurement of progress. This benchmark catalyzed the deep learning revolution, exemplified by the breakthrough of AlexNet, which underscored the potency of deep learning in computer vision.
That's where RouterBench comes in. Developed in collaboration with professor Kurt Keutzer's Lab in UC Berkeley, RouterBench aspires to become the standard benchmark for LLM routing. It is the first comprehensive benchmark suite that will systematically assess the strengths and weaknesses of LLM routing systems. By providing a clear, standardized direction for routing research through a massive and diverse dataset, RouterBench aims to enable faster progress in the field, paving the way for innovations like cost-effective, dynamic model selection.
3.1 The Dataset At the core of RouterBench is a massive dataset comprising over 405,000 inference outcomes across eight representative task domains, from commonsense reasoning to math word problems to code synthesis. Crucially, the samples come complete with pre-generated outputs and quality metrics from leading open-source and commercial models, enabling apples-to-apples comparisons of routing strategies without the overhead of live model queries.
For the initial release, we have selected 8 representative datasets from multiple different tasks:
Commonsense Reasoning: Hellaswag (Zellers et al., 2019): A dataset that challenges models to complete realistic, commonsense scenarios, requiring an understanding of everyday activities.Winogrande (Sakaguchi et al., 2021): An improved version of the Winograd Schema Challenge with a focus on large-scale commonsense reasoning.ARC Challenge (Clark et al., 2018): A dataset of difficult multiple-choice science questions aimed at testing advanced reasoning and commonsense knowledge.Knowledge-based Language Understanding: MMLU (Hendrycks et al., 2021): A massive multilingual dataset for evaluating language models across a wide range of subjects, from professional domains to high school topics.Conversation: MT-Bench (Zheng et al., 2023b): A benchmark for evaluating machine translation systems within conversational contexts, focusing on quality and coherence in dialogue.Math: GSM8K (Cobbe et al., 2021): A dataset of grade-school level math problems designed to test the mathematical reasoning abilities of AI systems.Coding: MBPP (Austin et al., 2021): A collection of Python programming problems, paired with unit tests, to evaluate code synthesis capabilities of models.Additionally, we gathered 4000 prompts from different news sources and generated questions with GPT-4 as an evaluation of routers on the retrieval-augmented generation tasks. We partnered with folks to Berkeleys BAIR to help create a representative and unbaised dataset for evaluating routers.
3.2 Predictive and Non-Predictive Routers In our study, we evaluate the following routers:
Predictive Routers
Predictive routers are fascinating because they don't generate LLM outputs beforehand. Instead, they predict which LLM will handle a given prompt best based on performance scores calculated for each model. These scores take into account both the predicted quality of the output and the cost of using the model. Essentially, a predictive router forecasts the most cost-efficient model to use for a particular task.
Two main types of predictive routers are highlighted in the study:
KNN Router : This router uses the k-nearest neighbors algorithm. It looks at similar examples in the training data and selects the LLM that performed best on those examples.MLP Router : Here, a multi-layer perceptron, a type of neural network, predicts performance. It's trained to evaluate the performance of different LLMs on various prompts and then uses this knowledge to route new prompts to the most suitable LLM.Non-Predictive Routers
Non-predictive routers, on the other hand, generate outputs from multiple LLMs and then decide which output to use. They operate in two main ways:
Cascading Router : This router processes a request through a series of LLMs in increasing order of cost and quality. If an early, cheaper model's output meets the quality threshold, the process stops. Otherwise, it continues to the next model.Overgenerate-and-Rerank : This method generates potential answers from all LLMs, evaluates them, and chooses the best one. It's not cost-effective due to the multiple inferences involved but serves as an upper bound on routing performance.4. Comparing Routers In order to compare the routers, we propose a metric and a couple of “baseline routers”.
4.1 The AIQ Criterion Given the non-parametric nature of routers, comparing them is not straightforward: how do you balance the tradeoff between cost and quality? We propose the metric Average Improvement in Quality (AIQ) to address this problem. In essence, AIQ is averagingthe performance of the routing system at different cost levels by taking the area under the routing curve. The larger the area is, the better the router.
The AIQ criterion allows developers to compare different routers objectively. A router with a higher AIQ means it's doing a better job balancing cost and quality, providing better value for the money spent.
4.2 Baseline Routers The Zero Router
The Zero Router provides a conceptual benchmark used to evaluate the efficiency of actual routers. Zero Router represents the performance that could be achieved if we always made the perfect choice among a group of LLMs based solely on their expected cost and quality, without accounting for the specific use case.
To visualize this, imagine plotting all the LLMs on a graph that measures cost versus quality. The Zero Router would be a line connecting the most cost-effective points along that curve, ensuring that for any given cost, you're getting the best possible quality. It's like having an idealized route that always picks the best model for the cost you're willing to pay, creating a baseline for router performance. This is equivalent to taking the non-decreasing convex hull of the performance curves of all the models.
The Oracle Router: The Ideal Router
The Oracle Router is the theoretical best router you could have. If you had a crystal ball that could see into the future and tell you which model would give you the best output for every prompt, you'd have an Oracle Router. In our ROUTERBENCH framework, the Oracle Router is used as the gold standard to showcase the maximum potential of efficient routing.
The Zero and Oracle Routers are critical in benchmarking because they set the upper and lower bounds of routing system performance:
The Zero Router sets a fundamental baseline. If a router can't outperform this conceptual model, it's not adding value. Essentially, the Zero Router is the "do-nothing" strategy that performs as well as the average performance of the available LLMs without any routing logic. The Oracle Router , however, is a likely unattainable ideal. No real-world router can match its performance because it requires omniscience. However, it's useful as a benchmark to show the maximum possible efficiency of a routing system. The closer a router's performance is to the Oracle, the better it is. In practice, while we can't have an Oracle Router, predictive routers aim to approximate its decision-making prowess using algorithms and data-driven insights. The goal is to get as close to Oracle-level performance as possible.
5 High Value of Routing Our experiments, detailed in the paper, have been illuminating. Even relatively naive routing schemes delivered dramatic improvements over single-model baselines.
Notably, in the use case of retrieval-augmented generation (RAG) , our study showed that all the routers significantly outperformed the Zero router baseline. Moreover, the routers exhibited an impressive ability to discern time-sensitive features in user queries, intelligently routing them to the most appropriate models – online models for time-sensitive queries and GPT-4/GPT-3.5 for others. These findings underscore the immense potential of model routing in enhancing LLM applications within "Compound AI Systems" that integrate retrieval and generation capabilities. This is demonstrated in the Figure below (reproduced directly from Figure 6. in the paper).
Figure: Cost vs Performance for five models and four routers on the RAG dataset (Figure 6 of the paper) Moreover, results from cascading routers also suggest that a well-orchestrated sequence of LLMs can achieve superior performance at a lower cumulative cost when compared to using a single high-end model. By intelligently navigating through a series of models and stopping at the one that meets a predetermined quality threshold, cascading routers can optimize the balance between cost and output quality.
Cost vs Performance of Cascading Routers The Complexity of Building Effective Routers
However, making an effective router is difficult. On the majority of tasks in RouterBench, basic routing systems do no better than the Zero router! This highlights the intricate and nuanced nature of router development, where a deep understanding of the LLM landscape is necessary to match queries with the most suitable models effectively.
RouterBench presents itself as a diverse and rigorous benchmark for routing, offering a fresh challenge to the AI community. Much like what ImageNet did for the Computer Vision community, RouterBench offers a standardized suite of tasks and metrics, to stimulate innovation and facilitate collaboration in the pursuit of advanced routing techniques.
Partner with Martian
If the problem of LLM routing is interesting, you should work with Martian! We are actively developing sophisticated methods for routing that have piqued the interest of top-notch talent, including PhDs and professors who have chosen to leave their academic pursuits to contribute to our mission.
If you're part of a company that's using LLMs, reach out to us. We love to collaborate with awesome folks in the AI space. To explore how our technology can streamline and optimize your LLM deployment strategy, we invite you to reach out at meet.withmartian.com/enterprise .