Synaptiq, What Are We Blogging About?

Ask Tim: Buy or Build Generative AI

Written by Tim Oates | Jun 11, 2024 3:15:00 PM

Home ownership comes with lots of decisions about whether to do it yourself or get some help. For months my garage door opener was hit or miss. Sometimes I’d press the button and the door would noisily slide open. Other times the door would sit there in silence. I replaced the button.  That worked for a while. Then I replaced the sensors that keep the door from closing on you. That worked for a while. Then I called a professional which, in hindsight, I should have done months ago. Now I never have to wonder when it’s time to take out the recycling whether I’ll be taking it through the garage or if I’ll have to carry it through the house.

Decision makers at companies that are considering deploying generative AI are also faced with a decision as to whether to build it yourself or get some help. The stunning pace of scientific advancement in GenAI is matched by equally rapid development of powerful open source tools. What used to take a small team a few months to build can now be done in a week by someone who knows what they are doing. Due to the hype around GenAI there is usually a person inside your company who is eager to skill up and be that “someone”, and there is no end of consultants that have hung a GenAI shingle outside their virtual office.

Unlike my struggle with the garage door opener, which was a simple time vs. money tradeoff, deciding whether to build or buy a GenAI solution is more complicated and the stakes are clearly higher. This blog breaks down some of the factors to consider with GenAI and gives advice for each on when building or buying makes more sense.

The Most Common Use-Case: RAG Chatbots

For the sake of simplicity we’ll focus on Retrieval Augmented Generation (RAG) chatbots. This is by far the most common use case we’re seeing at Synaptiq as we talk to potential clients. Large Language Models (LLMs) like ChatGPT know a lot. You can ask ChatGPT who won the 8th super bowl, or for a recipe for clam chowder, or how to fix a broken garage door opener, and the answers you get will be really good. But LLMs were not trained on your company’s data and therefore can’t answer questions about it. That’s where RAG comes into the picture.

RAG systems ingest your documents into a database. When a user poses a question, relevant documents are retrieved and a prompt is created for the LLM that says something like “Please answer the following question based solely on the content below. The question is … and the content is …”. That way you get the amazing language abilities of the LLM but force the core of the answer to focus on your data.


The RAG System Pipeline

Now let’s break that pipeline down and for each part consider what factors would make you lean more towards buy or build.

Data ingest: The data used by RAG chatbots is typically text.  It could be as simple as a single PDF file of the employee handbook to build an HR portal for your employees, or as complicated as a large set of user manuals, customer service call transcripts, and field reports from technicians to build a customer-facing portal for people who bought your appliances.  

In the first case there is a small number of documents that change infrequently, have just a few formats (e.g., PDF or Word), are written well, and probably all live in the “HR” folder in your corporate intranet.  In the second case the content is spread out over multiple locations and formats (files and tables in different databases), has highly varying quality and content (well written manuals, automated transcriptions of calls, hastily written free form notes from busy field technicians), and is constantly being updated.  Further, you’ve got to be really careful to not surface raw call data to a customer in the portal so that, for example, nobody sees a response from the LLM like “many customers think that our Model X is the worst dishwasher ever made”.

So you might consider building if the data are uniform, small, high quality, and static, but you may want to buy if the data are heterogeneous, large, dynamic, or need special preprocessing (e.g., you might want to use an LLM to summarize the call logs to extract just the model number, problem reported, and resolution to create a more semi-structured dataset devoid of any content that conveys sentiment).

Data storage and retrieval: RAG systems first break your text up into chunks, like sections or pages or paragraphs.  Each chunk is stored in a vector database that associates each bit of text with a vector (list) of numbers such that two chunks that are semantically similar will have similar vectors.  When a user submits a query, the query is turned into a vector and the database can quickly find the most relevant chunks (those with the most similar vectors to the query).

If there’s any part of the standard RAG pipeline that is commoditized, this is it.  There are many vector databases that do the same thing, are performant, and scale well.  There is some art to picking the chunk size and the number of chunks that are retrieved per query; too few may miss the best content and too many may confuse the LLM.  

In general, building the data storage and retrieval component is a good idea if you've got the expertise on staff, but buying may be a good idea if finding the right content is hard with standard mechanisms and you need to explore more sophisticated approaches like post-retrieval reranking to push the right content higher.

Choosing an LLM: Choosing an LLM can be daunting, but there are ways of future proofing your system in case buyer’s remorse sets in.  The main things to consider are quality, control, and cost.  

LLMs from different vendors and different versions of the same LLM can give very different responses given the same data, prompts, and queries.  It’s a good idea to do a few simple tests with your content to see if one LLM feels like a better fit for your use case and data.  Stephen Sklarew, the CEO of Synaptiq, uses AnythingLLM to quickly build chatbots for internal use cases and side projects.  AnythingLLM makes it shockingly easy to ingest data, choose from a number of different LLMs, and surface a chatbot.  It’s a great tool for that quick test drive to inform the choice of an LLM.

Another key decision is whether you’ll host your own LLM or use a commercial API.  There are really good open source LLMs, such as the Llama series from Meta available from Hugging Face, that you can run on your own hardware, giving you complete control.  The programming interface will not change, the data never leaves your IT footprint, and your ongoing costs all revolve around paying for compute.  Commercial APIs open up the world of ever more powerful LLMs like the ChatGPT series and free you from managing compute for the models, but expose you to per-call costs that can be surprisingly high for the most powerful models.

Frameworks like Amazon’s Bedrock make swapping in different LLMs easy, ensuring that changing your mind will not lead to lots of additional work on infrastructure.  But individual LLMs can produce very different output given small changes in the prompt, and different LLMs can respond very differently to the same prompt.  So there will probably be prompt engineering work to do if you swap in a different LLM.

This is all complicated by the fact that modern RAG systems often make multiple calls to the LLM to, for example, enforce guardrails (see below).  Each call introduces cost and latency, though some types of calls can be made to smaller, cheaper, faster LLMs, leaving the core question answering tasks to the larger, slower, and more expensive one.

It may be a good idea to build if preliminary tests with a tool like AnythingLLM shows promising results with a wide range of LLMs, but it may be a good idea to buy if it’s hard to get quick results that look good, your application needs to scale to lots of concurrent users, or throughput and costs need to be carefully controlled.

Guardrails: Early LLMs were eager to answer your questions, including ones about how to make a bomb or successfully engage in various illegal activities.  It’s harder to coax that kind of information out of the current generation of LLMs, but there are other kinds of interactions that you might like to guard against.  For example, you would not want an HR chatbot to exhibit bias against a protected group, surface personal information about an employee, answer questions unrelated to HR, or be overly eager to suggest firing people.  

Guardrails, bits of code that prevent unwanted behavior, are commonplace in chatbot deployments.  Some are as simple as filters against lists of vulgar words, while others involve calling an LLM to judge the relevance of a user query to prevent off-topic interactions.

Guardrails can drive up cost and latency if they involve calls to LLMs, and can be finicky to get right if they involve using LLMs to make judgements about things like relevance or response quality.  Consider building if your guardrails are simple and few, but buying may be better if there are many guardrails and they involve potentially expensive, slow, and finicky calls to LLMs.

Evaluation: It’s important to know how well your RAG chatbot is doing before rolling it out to users.  Qualitative evaluation based on ad hoc queries can give you a good idea as to whether things are on track or off the rails.  But it’s clear at this point in the blog that there are lots of moving parts that may be undergoing iterative development, and repeatable, quantitative metrics are key to track the impact of changes.

There are well-defined metrics that are now common in RAG chatbot deployments, such as the fraction of relevant content retrieved, how much of the answer is drawn from that content, and response time.  There are also frameworks to help compute the metrics, with Ragas being a popular one that we use.

There is a bit of an art to defining a set of metrics that measure the things you actually care about and not going for a “kitchen sink” approach, as well as creating the right kind and amount of ground truth data for those metrics that require it (such as answer accuracy).

So, Buy or Build Generative AI? 

Consider building if a single framework like Ragas has all of the metrics you need, and it is clear that those metrics provide information about the things that will make your system successful for the target audience.  Consider buying if the available metrics are a bit confusing, are thought to be incomplete with respect to your specific use case, or the task of creating the ground truth data feels daunting.

Though I characterized my approach to home improvement projects as being based on a time-money tradeoff, there are really three axes - time, money, and quality.  Given enough time I can do many things well, but there are some projects where learning while doing will lead to sub par results.  Due to the many interacting factors discussed above, RAG pipelines intended for external use tend to be the kind of project where learning while doing can be ill advised.  I hope that this blog will help you think about which option is best for your organization.

—Dr. Tim Oates, Co-Founder & Chief Data Scientist at Synaptiq 

 

 

Photo by Debbie Hudson on Unsplash

 

About Synaptiq

Synaptiq is an AI and data science consultancy based in Portland, Oregon. We collaborate with our clients to develop human-centered products and solutions. We uphold a strong commitment to ethics and innovation. 

Contact us if you have a problem to solve, a process to refine, or a question to ask.

You can learn more about our story through our past projects, our blog, or our podcast.