~20min read
Introduction
Context
In the modern world, applications such as ChatGPT, based on LLMs (Large Language Models), are becoming ubiquitous. Their potential extends across all sectors, from customer service to human resources management, offering tailored solutions to problems of all kinds.
The many announcements from industry leaders underline the democratization of AI. The constant evolution of tools allows companies to foresee prospects for performance and operational efficiency thanks to GenAI.
Our R&D process
In view of the advances made in recent months in the field of generative artificial intelligence, Nobori decided to launch an internal initiative to understand and master the key concepts involved. The aim is to use this newly-acquired knowledge to provide our customers with the best possible support in their transition to GenAI applications. We are also committed to developing innovative internal solutions to facilitate the work of our consultants and automate certain processes.
Our R&D has led us to develop a chatbot that implements RAG (Retrieval-Augmented Generation). A chatbot is simply a conversational agent, which in our case relies on an LLM to generate answers to the user’s questions.
Two problems are regularly encountered when interacting with LLMs:
- Hallucinations: a model’s ability to fabricate information during text generation.
- Outdated answers: some of the information used by models may come from sources that are out of date at the time of generation.
RAG aims to alleviate these problems, by implementing an information retrieval system to provide the model with context when generating responses. This fundamental concept makes it possible to enrich text generation by extracting relevant data beforehand to contextualize and improve the accuracy of answers.
High-level view of the PoC
This PoC is an application that lets you talk to a chatbot. You provide it with documents, such as texts or articles, and then you can ask it questions. It will respond with information or thoughts based on the documents you’ve previously provided. It’s as if you were having a discussion with an intelligent virtual assistant that draws its knowledge from the documents you share with it.
Below is a diagram of the general operation, which will be explained in detail throughout this article:
- First, the user uploads their documents, which go through a step of Document Processing before being stored.
- The user then interacts with the application as with a conventional chatbot. During Query Processing, the question is reformulated to include the conversation history, and used for the rest of the process.
- The Retrieval phase is the core of a document-based Q&A application. The aim is to find relevant information in the documents, based on the user’s query. This information serves as context for answering the query.
- Once the context and the question have been obtained, they are passed on to the final prompt, which is then used by the model to generate our answer. This concludes the Answer Generation step!
Technological landscape
We first looked at a number of existing open-source solutions that partially met our requirements, to understand how they worked and how their projects were structured. These included h2oGPT and PrivateGPT. When testing them, they proved insufficiently flexible, and we were looking for an application offering long-term modularity.
Since we couldn’t find a suitable solution, we decided to develop our own application. We explored framework options to start developing the PoC. We made a comparison between LlamaIndex and LangChain, and finally opted for the latter. It has concrete examples illustrating the various concepts it implements, as well as extensive and explicit documentation. In addition, the project’s GitHub is much more active, which is a major argument to consider when choosing an open-source framework.
Once the choice of framework was made, we needed to figure out a way of accessing models. We interact with an LLM through API requests or via a CLI. Two hosting methods are available:
- External providers such as HuggingFace, together.ai, Azure AI, or Amazon Bedrock.
- Running a model locally, like Ollama for example. The LLM then runs on our architecture, in this case a laptop.
In the current version of PoC, we have the choice of using local open-source models or paid models such as GPT-4 from OpenAI. Our application makes it easy to interchange the model(s) used.
Focus on the notions of Prompt and Chain
Before getting into detail, let’s simply cover two important notions to keep in mind for the rest of the article:
- Prompt: a series of instructions written in natural language, given to the model to indicate what is expected of its response. This prompt can be a question, a sentence starter, or even a combination of several elements, and serves as a guide for generation. It conditions the model’s understanding of the task in hand, and directly influences the nature and content of the output response.
- Chain: the use of an LLM in isolation is appropriate for simple applications, but more complex applications require the chaining of several LLMs, either with each other or with other components. One of the fundamental principles of the LangChain framework, from which it takes its name, is the concept of chain. A chain is defined very generically as a sequence of calls to different components managed by the framework (prompt, LLM, memory, etc.), which may themselves include other chains.
In-depth process analysis
With these concepts in mind, we can get to the core of the explanations, starting with a more detailed version of the diagram presented earlier:
This representation provides a complete view of how each part of the process works and how they are interconnected.
1. Documents processing
There are two types of data: structured and unstructured. To operate at their best, LLMs need to be fed with clear, well-organized data.
When the information comes from databases or tools like Notion, which store data in a structured way, it’s easy to use. In our case, we’re dealing with PDF documents, which fall into the category of unstructured data. To manage the extraction of elements from this type of document, we use the ETL (Extract, Transform, Load) named Unstructured.
Document processing management is based on the Parent Document concept. The original document is broken down into sections, called chunks, which are then stored.
Modifications are then made to these chunks, such as the creation of summaries (chunks could also be divided again to obtain smaller sections). An ID maintains the link between the original chunks and the summarized chunks, which are then saved as embeddings. This method makes it easy to search through embeddings and still obtain usable results.
But what does embedding mean? Let’s find out!
An embedding is a vector representation commonly used by AI models to facilitate data manipulation.
In our case, a word embedding can have hundreds of values, each representing a different aspect of a word’s meaning. Example with the word « framework » in the sentences « LangChain is a framework » and « Framework provides a structure for software development »:
As the model processes the set of words represented by a sentence, in this case « LangChain is a framework« , it produces a vector – or list of values – and adjusts it according to the proximity of each word to other words in the training data.
At the end of the document processing step, a mixed storage is obtained, containing both chunks and embeddings. This storage will be used in the Retrieval phase, which appears later in the process.
2. Query processing
Processing the user’s query involves a rephrasing step based on the chat history. This allows the information previously exchanged to be integrated into the new question.
Here’s an example of a prompt that could be used for this purpose:
Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.
At the end of standalone question add this ‘Answer the question in {language}.’
Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:
Variables between curly brackets are replaced by their value when the prompt is used.
3. Retrieval
This phase is the core of a document Q&A application. As previously mentioned, the retriever is based on a mixed storage:
- docstore: chunks of original documents
- vectorstore: embeddings of chunk summaries
In our case, the documents (docstore) are stored in RAM, while the vectorstore uses a vector database called ChromaDB.
We then perform a similarity search on the contents of the vectorstore, using the rephrased question obtained earlier, and in return we get the chunks containing the relevant information that will serve as context for the final LLM.
We thus succeed in retrieving chunks present in the original documents that contain information useful for generating the answer.
Let’s take a closer look at similarity search:
Similarity search can be applied in various ways. For the purposes of this article, we’ll focus solely on the one we use, namely cosine similarity. This consists in finding the vectors closest to the one representing the question formulated by the user.
Mathematically, cosine similarity calculates the angle between two vectors in a multi-dimensional space, based on the concept that vectors pointing in similar directions are considered similar.
The smaller the angle, the more similar the vectors.
It is thanks to this mathematical formula that the retriever works out which document chunks are most relevant to the user’s question.
We now have everything we need: a question and the information required to answer it. All that’s left is to make use of them to conclude the process.
4. Answer generation
The prompt used by the LLM to generate the final response is made up of several sections, those collected in the previous steps and others pre-written:
- The « System » section is hard-coded in the prompt. It is used to tell the model what behavior to adopt and how to generate the response.
- The « Context » section contains the chunks of documents used as context to generate the response.
- The « Query » section contains the user’s question, reformulated with the chat history.
- The « Assistant » section is also hard-coded, and specifies where the model should start generating.
To facilitate the use of the PoC, the generated content is sent directly to the user interface. We are currently using Streamlit, a solution offering easy-to-use features for the development of LLM-based applications.
This concludes the explanation of how the PoC works, and leads to an opening on the challenges encountered during development and the limitations of the product in its current state.
Challenges and limitations
Prompt Engineering: One of the most important areas for improvement in the PoC results is prompt engineering. The application is based on chains made up of different LLMs, and their prompts are key to increasing the relevance and quality of responses.
Context quality and size: In this version of the PoC, we choose to summarize the chunks of the original documents and store them as embeddings. In reality, we could simply store several versions of the chunks without making summaries, to avoid losing information. To do so, we would have to split the chunks again, and keep all the embeddings in our storage. We are keeping an eye on current progress to find, or develop, a solution to this problem. For example, in their latest announcements, OpenAI released a version of GPT-4 with a context window of 128k tokens, a higher capacity than the vast majority of existing LLMs.
Generic chunk storage: The last point to mention is the storage of original document chunks used by the retriever. The PoC manipulates both text and tables, but does not store chunks from one session to the next. This is due to the fact that currently, the framework does not offer the possibility of storing different types of chunks other than in RAM. This slows down testing and is not a viable long-term solution. If, in the future, the PoC has to manage images as well, as mentioned in this article, this problem will become even more widespread. It will then be necessary to set up storage that can handle text, tables and images, while still being usable by the retriever.
Conclusion
At our scale, we do not claim to develop a product capable of answering every problem, but a solution capable of handling specific use cases.
As stated in the limitations, the emphasis is placed on improving results through document processing. We are keeping an eye on the evolution of RAGs in order to process documents more efficiently. With this in mind, we are currently exploring solutions to make our RAG multimodal, i.e. we could process images in the same way as text or tables.
Progress in open-source technology augurs well for the use of LLM by small organizations and individuals. We can mention the work of Mistral AI, which has enabled the development of several models based on Mistral-7B, such as Zephyr-7B-beta. These LLMs have performance levels that rival pay models like GPT-3.5 and can be hosted locally, making them accessible to the general public.
To round off this article, we’d like to express our gratitude to the open-source community, without whom this PoC wouldn’t exist in its current form. The resources made available and the dedication of the community have contributed to the emergence of numerous initiatives which, we are sure, will help the whole ecosystem to thrive.
We hope that this article will contribute in its own way to sharing the knowledge acquired throughout the development of this PoC, and that it will inspire others to explore this fascinating subject.
References
- HuggingFace : https://huggingface.co/blog/inference-endpoints-llm
- together.ai : https://www.together.ai/products#inference
- Azure ML : https://ml.azure.com/model/catalog
- Amazon Bedrock : https://aws.amazon.com/fr/bedrock/
- Ollama : https://ollama.ai
- h2oGPT : https://gpt.h2o.ai
- privateGPT : https://github.com/imartinez/privateGPT
- LangChain : https://www.langchain.com
- LlamaIndex : https://www.llamaindex.ai
- Mistral : https://mistral.ai
- Unstructured : https://unstructured.io/
- ChromaDB : https://docs.trychroma.com/
Written by:
- Ewan Lemonnier
- Etienne Le Gourrierec