Retrieval-Augmented Generation (RAG) Application with LLamaIndex, Hugging Face, and QDrant Vector Database

5 min readSep 24, 2024

In the age of massive data and sophisticated AI models, one area has advanced significantly and its called LLM. However most of the LLMs are trained on massive datasets and they are often too generalized. The challenge isn’t just generating responses with an LLM but generating relevant and contextual answers based on external knowledge. External knowledge is the data LLM was not trained on. This is where Retrieval-Augmented Generation (RAG) steps in.

Retrieval-Augmented Generation (RAG) Application with LLamaIndex, Hugging Face, and QDrant Vector Database

LLMs are general:

You can ask a general LLM who is Messi, and it would be able to tell you that, but it doesn’t know who you are. Suppose you train this on your company’s data or your own data, now it would be able to tell you who you are, yes you a superstar now!

RAG combines document retrieval with language generation, enhancing a model’s capacity to deliver precise and informed responses by using both external and internal knowledge. This article will walk you through how I built a RAG-based application using LLamaIndex, QDrant, and Hugging Face models/embedding, with external data source.

By the end, you’ll get a sense of how to structure your data pipeline, integrate it with an LLM, and make a dynamic RAG-based system that can actually learn from your data!

What other ways of teaching an LLM?

We have several other ways of teaching an LLM. We can do it with RAG prompt engineering or fine tuning. In this article we will be utilizing RAG approach to teach the LLM.

I fine tuned BERT for sentiment analysis on Kubeflow which can be found here.

Why RAG?

Large Language Models (LLMs) like GPT and BERT can generate text with impressive fluency, but they’re restricted by their training data. What if you wanted to build a chatbot that answers questions about restaurant reviews in real-time, but the model doesn’t “know” about the latest reviews?

RAG solves this problem by pairing retrieval and generation:
- Retrieval: Instead of relying solely on the LLM’s internal knowledge, the model first retrieves relevant documents from an external source, we can define the data source.
- Augmented Generation: The LLM then generates responses based on both its own knowledge and the retrieved documents.

With this approach, the responses stay fresh, relevant, and factual based on real-time data.

For better performance, frequently update the data source with latest data.

The Tech Stack

To implement RAG, I used a combination of:

LLamaIndex, QDrant, Hugging Face, Streamlit, LLM model from Hugging Face, Ollama or LM Studio for running LLM on local machine.

Scraping Data | Any source

We needed data to fuel the RAG pipeline, so you first need to scrape the data or find some other data to feed the LLM. I downloaded the contents and stored them in downloaded_content folder.

Now, we have the review texts ready for processing!

Building the Document Index with LLamaIndex

To make retrieval efficient, I used LLamaIndex to create a document index from the downloaded data. LLamaIndex allows you to break large documents into smaller pieces (nodes), which can then be retrieved based on user queries, it makes us query the data effectively.

Spiting the reviews into meaningful chunks is necessary, otherwise model would be inefficient and prone to irrelevant results if we feed the entire document.

We can use SimpleDirectoryReader from LLamaIndex to read the directory, and UnstructuredReader to read the downloaded documents.

Using Hugging Face for Embeddings

Now we need to transform the data into embeddings, we can use Hugging Face embeddings to transform the text data into high dimensional vectors. For similarity search, we are using QDrant vector database, we could have used Pinecone or PGVector, etc.

I used a model from Hugging Face to generate the embeddings:

With these embeddings, I stored vectors into QDrant, which allows efficient retrieval using vector similarity searches.

Integrating QDrant for Fast Vector Retrieval

QDrant is a vector search engine that stores the embeddings and helps retrieve relevant reviews when queried.

We can create QDrant vector database with the following docker-compose script:

version: '3.5'

services:
  qdrant:
    image: qdrant/qdrant:latest
    restart: always
    container_name: qdrant
    ports:
      - 6333:6333
      - 6334:6334
    expose:
      - 6333
      - 6334
      - 6335
    configs:
      - source: qdrant_config
        target: /qdrant/config/production.yaml
    volumes:
      - ./qdrant_data:/qdrant/storage

I connected QDrant to store and retrieve vectors based on the embeddings generated in the previous step. This allows us to find the closest (most semantically similar) review given any query.

Now, when a user queries the system, we first retrieve the closest matching reviews from QDrant based on the user’s question.

Retrieval and Generation

Finally, with both the document retrieval and the LLM set up, we implemented the retrieval-augmented generation process. We can now use Query Engine for querying the index for answer generation from the external data source.

When a user asks a question, the following steps take place:

Retrieve: Search QDrant for the most relevant review based on the user’s query.
Augment: Pass the retrieved reviews to the Hugging Face LLM.
Generate: The LLM generates a response, combining both its internal knowledge and the retrieved reviews.

Building the Frontend with Streamlit

To make the experience user-friendly and building the UI quickly, we can use Streamlit for a simple Ui, otherwise for more complex ones, we will need a different front-end stack. Even mobile app can be built to serve the LLM.

Conclusion

With LLamaIndex for indexing, QDrant for vector similarity search, and Hugging Face for generation, we have a powerful RAG-based system capable of answering questions based on real-world, constantly updated data. This system leverages the strengths of retrieval and generation, making it possible to query large datasets like reviews and get meaningful, context-aware responses.

If you’re looking to build a smart, responsive application with external data sources, RAG is a fantastic architecture to explore. This project is just a glimpse of its capabilities, and with more datasets and models, you can build even more robust and specialized systems.

We implemented this llm on a smaller scale, to learn to scale machine learning models and serve millions of audience check out Hands On MLOps With Kubeflow.

Hope you like the article.