Lessons Building an Enterprise GenAI (RAG) Product
As Generative AI becomes more ubiquitous among consumers, Enterprises are scrambling to incorporate GenAI into their platforms, find the right use cases and build robust customer-facing bots. Some early experience reports are available now, for example, LinkedIn’s, and resonate with mine on building both generative AI and deep learning products with Enterprise clients. This article is an attempt to structure my personal experiences and highlight the key patterns and challenges that are emerging in the design of Enterprise AI apps.
Updates:
RAG in Production for Enterprises
Let’s start with some patterns common to many such conversational + RAG apps.
Getting the RAG architecture designed and deployed over enterprise datasets is quicker if the data sources are already available via structured APIs.
Different org-units / teams hold expertise in different knowledge areas. Hence, the frontend of the bot app is a Router, which on receiving a request, picks the right specialized Knowledge Agent to respond.
The architecture consists of three modules: Routing, Retrieval, Generation. Routing and Retrieval modules have a classifier-type (discriminative) while Generation module is well, generative. Classifiers are generally smaller (task-specific fine tuned) models than the Generative model. Retrieval utilizes a cascade of models.
Once the system is up, we need Evals for each step to gain insights into errors and refine the modules.
Friendly reminder about the building curve for AI models. Bringing out the last mile perf from the Generative AI product is painfully slow. More on this below.
"Now, generation, that was a different story. It followed the 80/20 rule; getting it 80% was fast, but that last 20% took most of our work."
Unique Challenges in building an Enterprise Generative AI Product
What are some issues that are hard and unique to the Enterprise? I’ll discuss following core issues :
Assembling heterogeneous Org Data
Building Evals, Tackling Hallucinations
Prompt + RAG, Fine-tune or both?
Scaling GenAI app to large user base
Assembling the Org Data (and Processes)
A large org has many specialized teams with niche knowledge. In LinkedIn’s case: general knowledge, job assessment, unique data about people, companies, skills and courses.
That necessitates a Router at the top of the pipeline, which decides which API / Tool / Skill/Agent to use, and routes the incoming query/request to the right Agent.
The first issue is the obvious one. Exposing all the org knowledge via robust APIs is a challenge and requires a systematic effort to build a unified data platform with access controls, which all teams can build upon.
The second one is more interesting and specific to LLMs. Even though the relevant APIs may already exist, they may not provide data in an LLM-friendly format. To invoke external Tools, LLMs need a well-defined API schema and API result should follow that schema. Therefore, we need wrappers around APIs that can let LLMs talk to them.
Next, the user-facing LLMs must respond in safe, responsible and empathetic manner. This imposes response style constraints on each specialized Agent and needs org-wide consensus and adherence to output formats.
Finally, we want common processes across teams and avoid spurious differences. Specifically, inconsistent eval processes across teams really hurt and many teams could reuse and build upon similar prompt templates. Thus, there are tradeoffs on what tasks/modules to delegate to separate teams vs what to keep centralized.
Building Evals and Taming Hallucinations
From the report:
The team achieved 80% of the basic experience we were aiming to provide within the first month and then spent an additional four months attempting to surpass 95% completion of our full experience - as we worked diligently to refine, tweak and improve various aspects. We underestimated the challenge of detecting and mitigating hallucinations, as well as the rate at which quality scores improved—initially shooting up, then quickly plateauing.
LLM responses are unreliable. For the same input prompt, across different runs, the response may not only vary in content but also format. So, the LLM may adhere to the expected JSON/YAML format for 90% runs but for 10% they simply go wrong — violate the structure spec in many different ways.
There are several ways to tackle this nuisance — hacky vs sophisticated. One could setup multi-step self-reflection/correction flows but they increase latency for the end user as well as add to the inference costs (plus, not guaranteed to work). What often works is patching the code for every type of output format violation, until you’ve covered the most cases (or lost patience). These patches are regex rules and don’t affect latency much but take you back to the most primitive form of computer science in production! 🤷
I’ve also seen the practice of layered evals for responses at several enterprises. The first layer is handled by an engineer, who writes automated evals to validate outputs. The second layer belongs to domain-experts, SMEs who eyeball the outputs for subtle or newer errors, and finally the third layer via A/B testing to end-users. The eval completion time increases as we move down the layers.
Hallucination metrics are a form of eval that caters to detecting faithfulness of response to closed or open world knowledge. While they depend on the judge LLM and are far from perfect, they can help identify outliers quickly.
More abstractly, aligning bots to the goals of enterprises is hard and something we haven’t quite got a hang of; articulated well in this quote by Alex Ratner:
the reality we see working with enterprises today is that they're struggling to get basic LLMs aligned with their organizational standards, ethics, and objectives, in order to be production-ready.
E.g. the priority is: let's get basic customer-facing chatbots to communicate without saying inappropriate and inaccurate things, before we worry about them turning into Terminators!
Prompt + RAG, Fine-tune or both?
Fine-tuning embeds new knowledge into the model’s memory ahead-of-time whereas RAG techniques find relevant knowledge at run-time and infer the answer to a query.
Fine-tuning creates a model runtime with fixed memory where RAG can access unbounded memory.
Fine-tuning has an upfront cost and capital investment, while RAG pays a small runtime cost for each run to access external knowledge stores.
Fine-tuning models needs more specialized team skills (data curation, annotation, train models, infra, ..) compared to implementing RAG, via LLM APIs.
Both methods may fail to generate accurate responses on account of missing information, improper reasoning or hallucinating invalid information. Need to create evals specific to each method and iterate.
Here is a rule-of-thumb that I’ve found to work in several cases:
Develop an initial prototype using a combination of RAG and prompt-engineering. Usually, the prototype consists of multiple stages.
Identify tasks/stages that can do well with a fixed memory and carve them out into separate models. This involves fine-tuning one or more LLMs.
Scaling to a Large User Base
There are three popular metrics for LLM-based app performance
TTFT, Time To First Token
TBT, Time Between Tokens
TPS, Tokens Per Second
TTFT closely relates to the perceived app latency by the user. In essence like the initial SaaS web page load time. Long TTBTs can turn away the users very easily.
Async everywhere. LLMs have a non-trivial response generation time. So, we cannot call and wait for the complete LLM response for user-facing apps. Instead, all components that rely on LLMs handle async streaming tokens and UIs present tokens as and when they arrive, keeping the user aware of delays with visual cues. This, in turn, leads to designing the complete app around async call patterns.
Besides, the above key challenges, there are several more that lurk around as we optimize more deeply.
Refactor away tasks from large LLMs to smaller, narrow, task-specific models. Narrow models can be deployed on-premise to save costs.
Choosing/routing between LLMs to reduce costs, latency
Prompt optimizations, pre-, post-processing to reduce token counts at every stage of pipeline
More parallelization among tasks to reduce LLM-blocking delays.
Managing Prompt versions, monitoring platforms for error analysis
and more..
Wrap up
Generative AI using LLMs allows large enterprises to enable better customer engagement (availability, content and response quality) while saving costs. I discussed four key challenges that are ubiquitous to generative apps in enterprise settings:
unifying the enterprise-wide data, APIs and response formats,
deciding between fine-tuning LLMs vs external RAG
building a robust Eval platform with minimal impact on latency and
system design metrics and patterns that enable scaling to large user bases.
LLMs are very powerful language calculators and enable seamless conversational analytics. However, there is a trust gap between humans and LLMs, which makes it difficult to build robust apps. Design processes as well as Evals need to work around and bridge this gap to build robust, deterministic, customer-facing generative apps for enterprises.