2024-07-02 00:00:00

Understanding Request Routing in Generative AI

Authored by Airwalk Reply Senior Consultant Derek Ho.

In the rapidly evolving landscape of artificial intelligence, optimising resource allocation and improving user experience are paramount. Directing all user messages to a single large language model (LLM) can be inefficient and costly, particularly as the complexity and volume of queries increase. A single LLM, while versatile, often requires substantial computational power and financial investment, making it an impractical solution for handling all types of user interactions. By intelligently routing requests to the most appropriate AI models, organisations can significantly enhance the efficiency and cost-effectiveness of their AI systems.

Furthermore, while multimodal LLMs are designed to handle various tasks across different domains, they may not always deliver the best results in specialised areas. Selecting LLMs that are specifically tailored to particular fields can yield superior outcomes compared to relying solely on a general-purpose, multimodal model. For instance, a financial-specific LLM can offer more accurate and nuanced insights in finance-related queries than a broader, multimodal counterpart. Additionally, some requests can be addressed more effectively without the use of LLMs at all. Simpler, rule-based systems or traditional algorithms may provide faster and more precise solutions for certain tasks, reducing the unnecessary deployment of complex AI resources and enhancing overall system performance.

Example Request Routing

User's intention

Intention 1. Requests that cannot or are not suitable to be handled by LLM, such as asking about the weather or user greetings.
Intention 2. Summarise an article
Intention 3. Sensitive data processing, such as asking questions about an internal document
Intention 4. General user messages.
Intention 5. Requests that can only be handled by internal systems, such as retrieving account balances.

There are numerous ways to utilise routing in AI systems. Requests can be routed based on data classification, such as directing sensitive data to a self-hosted LLM while routing public data to a public LLM. Routing can also be organised by department, with sales-related requests going to an LLM managed by the sales department and complaint-related requests to one managed by customer service. Additionally, requests can be routed based on user tier, such as directing customer queries to a customer-specific LLM and staff queries to a staff-specific LLM. This targeted routing approach ensures that each request is handled by the most suitable AI or non-AI resource, optimising performance and efficiency across the board.

Implementation Techniques

LLM-Based Routing

This approach begins by using a lightweight LLM to analyse the input message. This initial step ensures quick and resource-efficient processing, determining the nature and complexity of the request without abusing the LLM.

Once the message is analysed, it is routed to the most suitable LLM or service. For example, if the request involves video generation, the system can direct it to a specialised model like Stable Video Diffusion, which is optimised for high-quality video outputs. Alternatively, some requests might be best fulfilled by calling an external API directly. If a user inquires about stock prices, the system can route the request to a stock API, ensuring accurate and up-to-date information without the need for complex language processing.

Additionally, for requests involving personally identifiable information (PII) or highly restricted data, routing to a self-hosted LLM ensures data privacy and security. This approach allows organisations to leverage advanced AI capabilities while maintaining control over sensitive information, avoiding the potential risks associated with processing such data through public LLMs. This strategic routing system not only optimises performance but also enhances data security and operational efficiency.

Vector-Based Routing
Vector-Based Routing involves converting each input message into a multidimensional vector, capturing the semantic meaning of the text. Various methods can achieve this transformation, including embeddings from models like BERT, Word2Vec, or FastText. These techniques encode the input message into a numerical format that reflects its contextual meaning, making it suitable for further analysis.

[Text to Vector conversion]

Once the message is represented as a vector, the system can utilize the concept of vector distance to determine the intent of the request. By comparing the vector of the input message with predefined intention vectors, the system can identify the purpose behind the user query. This comparison helps in routing the request to the most suitable service or model.

[Distance Comparison]
Techniques like cosine similarity or Euclidean distance are commonly used for these comparisons. Cosine similarity measures the cosine of the angle between two vectors, providing a metric for how similar the vectors (and thus the messages) are. Euclidean distance, on the other hand, calculates the straight-line distance between two points in the vector space. These methods ensure that the most appropriate response or action is taken based on the identified intent, providing a precise and efficient way to handle diverse requests by leveraging the geometric properties of vector spaces.

Notes: Cosine similarity measures the cosine of the angle (θ) between two vectors. As the angle θ approaches 0, the cosine of θ approaches 1. Thus, a higher value of cosine similarity indicates that the two vectors are closer to each other.

[Intention Comparison]

The Approach

LLM-based Routing is praised for its simplicity and ease of implementation. This method enhances accuracy significantly, although it incurs additional costs due to extra LLM requests. It is a straightforward solution that can be quickly integrated into existing systems, making it an attractive option for scenarios where accuracy is paramount.

Vector-based Routing presents a slightly more complex setup compared to LLM-based routing but remains feasible. This method provides faster response times, which can be crucial for time-sensitive applications. However, it tends to be less accurate than its LLM-based counterpart, making it suitable for contexts where speed is more critical than precision.

Hybrid Routing combines the strengths of both approaches. It uses Vector-based routing as the base layer, supplemented by an LLM escalation mechanism. This hybrid model offers a great balance between cost, performance, and accuracy, making it the preferred choice for systems that require both quick responses and high reliability.

Final Thoughts

Implementing AI Request routing brings a multitude of advantages, optimizing both performance and resource utilization. Here are the key benefits of this approach:

Cost Saving: Save unnecessary AI API calls by handling many requests with hard-coded responses or traditional APIs (e.g., greeting messages, real-time data requests).
Cost Saving: Simple tasks may not require a powerful LLM; request routing can automatically select a more cost-effective LLM.
Faster Response: Reducing unnecessary AI API calls frees up capacity for meaningful AI API calls, which is crucial given the rate limitations of many public LLMs.
Better Response Quality: Using tailored LLMs for specific tasks yields better results than relying on a single multimodal LLM.
More Secure: Employing self-hosted LLMs ensures sensitive data remains within the organisation, enhancing data security.
Scalability: Efficiently managing diverse requests allows the system to scale better, accommodating increased traffic without a proportional cost increase.
Improved User Experience: Faster, more accurate responses improve user satisfaction and engagement.

Services

Technology Delivery

Delivery Transformation

Technology Strategy and Operating Models

IT Service Management

Read More

AI Enablement and Acceleration Services

Cloud

Sectors

Financial Services

Read More

Public Sector

Read More

Other Sectors

Read More

Case Studies

Partners

Amazon Web Services

Read More

Microsoft

Read More

Insights and News

About Us

Leadership Team

Read More

Office Locations

Read More

Insights and News

Read More

Careers

Contact Us

Our Company

Read More

Do the Right Thing

Read More

Leadership Team

Read More

Office Locations

Read More

Insights and News

Read More

Transformation

Technology Delivery

Delivery Transformation

Tech Strategy and Operating Models

IT Service Management

Read More

Technology

AI Enablement and Acceleration Services

Cloud

Financial Services

Read More

Public Sector

Read More

Other Sectors

Read More

Amazon Web Services

Read More

Microsoft

Read More

SHARE

0

Understanding Request Routing in Generative AI

Authored by Airwalk Reply Senior Consultant Derek Ho.

Implementation Techniques

The Approach

Final Thoughts

GenAI Agentic Automation with AutoGen Read more

Client Feedback

We would love to talk about transforming your business

Get in touch

We would love to talk about
transforming your business