Cutting edge AI tools require cutting edge reliability measures and as the developers of a modern service management and workflow automation software, we need to be extra sure that our solution holds up against the highest ethical and legal standards.
We have been fortunate to find a good collaboration partner for this endeavour in ragas; ragas has helped us achieve significant AI accuracy and performance improvements, through their sophisticated evaluation and synthetic data generation techniques. Let’s dive into it.
The primary goal of the ragas experiment was to improve our system's ability to accurately identify and retrieve accurate information for employee queries.
When employees message our AI assistant on Slack, Teams or email, Atom needs to identify the intent and either retrieve relevant info, a service item or a form. While we’ve written about our efforts to experiment with Loaders and knowledge support using LlamaIndex, this is the next step in our efforts to get fast answers to employees. This involved enhancing intent recognition to categorize user queries into three distinct types:
To optimize Atom’s responses, we adopted Weaviate’s hybrid search platform feature.
While vector search is fundamental for an AI assistant, traditional keyword search is still important for use cases where precision matters like in critical HR scenarios or with legal documents.
Hybrid search in Weaviate combines keyword (BM25) and vector search to leverage both exact term matching and semantic context. By merging results within the same system, developers can build intuitive search applications faster.
But implementing hybrid search in Weaviate involves specifying parameters such as the alpha value and fusion type. For example, to balance keyword and vector searches equally, you can set alpha to 0.5. Additionally, you can use the relativeScoreFusion method to combine the scores from both search techniques. This approach ensures that the search results are both contextually relevant and accurate in terms of keyword matching.
💡 The alpha value determines the weight given to each search method. An alpha value of 0 uses only BM25 (keyword search), while a score of 1 relies solely on vector search. An alpha value of 0.5 balances both methods equally. Extensive experimentation with different alpha values allowed us to find the optimal balance for each dataset, enhancing the accuracy of user query responses.
To implement alpha tuning for our hybrid search method, we developed specialized datasets tailored to different user engagement facets:
We used ragas to create a synthetic knowledge base dataset and for retrieval evaluation.
An ideal evaluation dataset should encompass various types of questions encountered in production, including questions of varying difficulty levels. LLMs by default are not good at creating diverse samples as it tends to follow common paths. ragas test data generation employs various SOTA methods like evolve-instruct, etc to generate a high quality and diverse test dataset from any given list of documents. This ensure high coverage of various test cases seen in production as well.
This ensures comprehensive coverage of potential user queries, enhancing the robustness of our AI evaluation.
The process begins with loading a collection of documents, which are then used to generate synthetic Question/Context/Answer samples. Techniques like multi-context rephrasing and conditional modification add complexity to the questions, creating a more challenging and representative dataset for evaluation. This approach not only saves significant time compared to manual dataset creation but also ensures a higher quality of evaluation through diversity and complexity in the generated questions.
We achieved the following through the ragas experiment:
The improvements in accuracy have been particularly noteworthy.
For instance, our AI's ability to correctly identify user intent has increased by 15%, and the precision of responses has improved by 20%
These enhancements have had a direct impact on the quality of service we provide to our users, making our AI systems more reliable and effective. When it comes to enterprise IT service management systems, a 20% improvement in precision can have a world of an impact on end-user satisfaction and an agent’s workload. We are eager to continue building on this success and exploring new opportunities for innovation.
Our partnership with ragas has not only enhanced our AI capabilities but also paved the way for future innovations in synthetic data generation and evaluation methods. Our enterprise AI systems are now more capable than ever, providing precise and reliable responses that enhance the user experience.
To see our assistant in action, sign up for a demo today or connect with us on LinkedIN.