Contact Us

Understanding Request Routing in Generative AI

Authored by Airwalk Reply Senior Consultant Derek Ho

In the rapidly evolving landscape of artificial intelligence, optimising resource allocation and improving user experience are paramount. Directing all user messages to a single large language model (LLM) can be inefficient and costly, particularly as the complexity and volume of queries increase. A single LLM, while versatile, often requires substantial computational power and financial investment, making it an impractical solution for handling all types of user interactions. By intelligently routing requests to the most appropriate AI models, organisations can significantly enhance the efficiency and cost-effectiveness of their AI systems.

Furthermore, while multimodal LLMs are designed to handle various tasks across different domains, they may not always deliver the best results in specialised areas. Selecting LLMs that are specifically tailored to particular fields can yield superior outcomes compared to relying solely on a general-purpose, multimodal model. For instance, a financial-specific LLM can offer more accurate and nuanced insights in finance-related queries than a broader, multimodal counterpart. Additionally, some requests can be addressed more effectively without the use of LLMs at all. Simpler, rule-based systems or traditional algorithms may provide faster and more precise solutions for certain tasks, reducing the unnecessary deployment of complex AI resources and enhancing overall system performance.

Example Request Routing 

User's intention

  • Intention 1. Requests that cannot or are not suitable to be handled by LLM, such as asking about the weather or user greetings.
  • Intention 2. Summarise an article
  • Intention 3. Sensitive data processing, such as asking questions about an internal document
  • Intention 4. General user messages.
  • Intention 5. Requests that can only be handled by internal systems, such as retrieving account balances.

There are numerous ways to utilise routing in AI systems. Requests can be routed based on data classification, such as directing sensitive data to a self-hosted LLM while routing public data to a public LLM. Routing can also be organised by department, with sales-related requests going to an LLM managed by the sales department and complaint-related requests to one managed by customer service. Additionally, requests can be routed based on user tier, such as directing customer queries to a customer-specific LLM and staff queries to a staff-specific LLM. This targeted routing approach ensures that each request is handled by the most suitable AI or non-AI resource, optimising performance and efficiency across the board.

Implementation Techniques

LLM-Based Routing

This approach begins by using a lightweight LLM to analyse the input message. This initial step ensures quick and resource-efficient processing, determining the nature and complexity of the request without abusing the LLM.

Once the message is analysed, it is routed to the most suitable LLM or service. For example, if the request involves video generation, the system can direct it to a specialised model like Stable Video Diffusion, which is optimised for high-quality video outputs. Alternatively, some requests might be best fulfilled by calling an external API directly. If a user inquires about stock prices, the system can route the request to a stock API, ensuring accurate and up-to-date information without the need for complex language processing.

Additionally, for requests involving personally identifiable information (PII) or highly restricted data, routing to a self-hosted LLM ensures data privacy and security. This approach allows organisations to leverage advanced AI capabilities while maintaining control over sensitive information, avoiding the potential risks associated with processing such data through public LLMs. This strategic routing system not only optimises performance but also enhances data security and operational efficiency.

Vector-Based Routing
Vector-Based Routing involves converting each input message into a multidimensional vector, capturing the semantic meaning of the text. Various methods can achieve this transformation, including embeddings from models like BERT, Word2Vec, or FastText. These techniques encode the input message into a numerical format that reflects its contextual meaning, making it suitable for further analysis.

[Text to Vector conversion]

Once the message is represented as a vector, the system can utilize the concept of vector distance to determine the intent of the request. By comparing the vector of the input message with predefined intention vectors, the system can identify the purpose behind the user query. This comparison helps in routing the request to the most suitable service or model.

[Distance Comparison]
Techniques like cosine similarity or Euclidean distance are commonly used for these comparisons. Cosine similarity measures the cosine of the angle between two vectors, providing a metric for how similar the vectors (and thus the messages) are. Euclidean distance, on the other hand, calculates the straight-line distance between two points in the vector space. These methods ensure that the most appropriate response or action is taken based on the identified intent, providing a precise and efficient way to handle diverse requests by leveraging the geometric properties of vector spaces.

Notes: Cosine similarity measures the cosine of the angle (θ) between two vectors. As the angle θ approaches 0, the cosine of θ approaches 1. Thus, a higher value of cosine similarity indicates that the two vectors are closer to each other.

[Intention Comparison]

The Approach

LLM-based Routing is praised for its simplicity and ease of implementation. This method enhances accuracy significantly, although it incurs additional costs due to extra LLM requests. It is a straightforward solution that can be quickly integrated into existing systems, making it an attractive option for scenarios where accuracy is paramount.


Vector-based Routing presents a slightly more complex setup compared to LLM-based routing but remains feasible. This method provides faster response times, which can be crucial for time-sensitive applications. However, it tends to be less accurate than its LLM-based counterpart, making it suitable for contexts where speed is more critical than precision.

Hybrid Routing combines the strengths of both approaches. It uses Vector-based routing as the base layer, supplemented by an LLM escalation mechanism. This hybrid model offers a great balance between cost, performance, and accuracy, making it the preferred choice for systems that require both quick responses and high reliability.

Final Thoughts

Implementing AI Request routing brings a multitude of advantages, optimizing both performance and resource utilization. Here are the key benefits of this approach:

  • Cost Saving: Save unnecessary AI API calls by handling many requests with hard-coded responses or traditional APIs (e.g., greeting messages, real-time data requests).
  • Cost Saving: Simple tasks may not require a powerful LLM; request routing can automatically select a more cost-effective LLM.
  • Faster Response: Reducing unnecessary AI API calls frees up capacity for meaningful AI API calls, which is crucial given the rate limitations of many public LLMs.
  • Better Response Quality: Using tailored LLMs for specific tasks yields better results than relying on a single multimodal LLM.
  • More Secure: Employing self-hosted LLMs ensures sensitive data remains within the organisation, enhancing data security.
  • Scalability: Efficiently managing diverse requests allows the system to scale better, accommodating increased traffic without a proportional cost increase.
  • Improved User Experience: Faster, more accurate responses improve user satisfaction and engagement.

GenAI Agentic Automation with AutoGen Read more

Client Feedback