Demo is just a few hours of work, and anyone can build it. The journey to production, however, takes months of effort. And very few have done it even in today’s day and age, in the enterprise use cases context, beyond content generation or coding. Everything that you find online mostly won’t work for your use cases and needs, right out of the box.
Enterprise use cases beyond marketing and content generation need accuracy and factuality. Even for basic HR use cases, you cannot simply rephrase in the case of health benefits, insurance, and payroll support. Many enterprises need FAQs to be served as is instead of rephrasing when there is compliance involved (healthcare, finance, and other compliance requirements). And if you sell to enterprises, then you cannot share their data with other folks easily, which means most of the new GenAI infrastructure tools (LLMOps SaaS) are out of scope for your immediate use.
For some enterprise use cases around Q&A, the Generative AI approach doesn’t work, due to LLM rephrase, summarization, and hallucination. What they need is an Extractive QA or classic Information Retrieval solution. So if your customer just wants factual FAQ as is from the source documents, then it’s not a GenAI problem, but rather a classic Information Retrieval challenge. Don’t waste time with LLMs, trying to optimize prompts to avoid rephrase or to serve the answer as is, because it just doesn’t work. You have to determine the acceptable level of accuracy for your use case and that requires a good amount of work.
Enterprise use cases, where we need accuracy and factuality, are an uphill battle. LLMs excel at creative and generative use cases but terribly fail at providing always-accurate information. You need to incorporate a GuardRails layer for fact-checking, data verification, and sensitive information leak detection. You also need to scope data visibility based on the entities (people, location, department, level, role) so a good NER model is essential for de-identification, data masking, and data visibility segmentation.
Don't believe that you can easily handle all formats and all types of data using any open-source library. Reading data from PDF tables is as complex as it is, parsing data from dynamic HTML pages won’t work with most HTML readers, and you also have to leverage both the new (Unstructured, Azure Form Recognizer) and the old (Web crawlers, Adobe PDF Reader).
You must go beyond the Basic or Naive RAG (just top-k) to Advanced RAG, which leverages the use case case-based meta-data enrichment during ingestion for recursive retrievals, and use auto-eval for accuracy monitoring and chunking strategy for performance. There is an excellent paper from Nvidia and we use most of those techniques, including limiting the number of retrieved chunks to under 10. You need to remember that while RAG is useful for bringing enterprise knowledge context, it’s not a silver bullet to overcome LLM limitations.
There is no provider who can deliver with no rate limits today unless you are deploying the LLMs in your own environment - which isn’t ideal initially for most use cases unless there is a scale of usage for cost advantages. With the API and regional limits in place, you need an LLM API load balancer to monitor health checks, latency, and timeouts, and to reroute requests to other regions or different providers. This is a must, as cloud regions can have network outages, like the recent Azure outage that caused their hosted OpenAI service APIs to fail.
In enterprise use cases, it’s extremely important to have an auto-eval framework in place for your domain and use cases, as LLMs are frequently updated, and you must be able to detect improvements or failures due to model changes or upgrades by LLM providers. The emerging Evaluation Driven Development might become the norm for LLM Apps as models are constantly evolving.
For LLM Apps being sold to enterprises, you need input layers of protection for de-identification and data masking (we use Microsoft Presidio). You need to think beyond prompt templates with prompt chaining, dynamic prompt programming, compression, and controlling the output size. The monitoring for median, average, and 95th/99th percentile LLM query latency matters in production, for the end user perception and product experience. We use different models for intent classification (claude-instant), embedding (azure-openai-ada-text-embedding-002-v2), summarization (azure-openai-3.5-turbo with 16K context), and rerank (cohere).
Building production-ready AI apps is just 10% of AI work and 90% of what we did earlier as software engineering, even with GenAI and LLMs in the picture. You need DevOps and Platform Engineering added to the mix, to ensure you have load balancing for LLM APIs, recovery for your vector databases, cache for performance, monitoring and tracing of metrics for troubleshooting.
We want to thank the folks at LlamaIndex (Jerry and Ravi), Guardrails (Shreya), Predera (Vamshi), Ozonetel (Chaitanya), and many others who have shared their knowledge and supported us in our journey.
We still have a long way ahead but feel that we must share what we have learned so far with the B2B SaaS community, that’s racing to build solutions using GenAI and LLM, the new exciting tech.
Bonus read on boring tech vs exciting tech - https://mcfunley.com/choose-boring-technology
Founding Team, Engineering
Co-Founder and CTO
Founding team, Engineering
Founding Team, Engineering