This blogpost shows how Retrieval Augmented Generation can be done easily using Dataiku (without any code!). We want to ask some questions relating to CPF and retrieve the answers based on information available in the news releases made by the CPF Board. The CPF is a mandatory social security savings scheme funded by contributions from employers and employees in Singapore.
First, news releases on the CPF website were scraped into a dataset, which contains the headline, url link and the text in the news release.
Then we make use of the Embedding recipe to perform embedding (literally as the name of the recipe suggests), which is to create vector representations of a chunk of text. There are specialized LLMs that do this and in this case, we select OpenAI’s GPT 3.5. These vector representations are saved in a vector store, a specialized kind of database, that allows to quickly search for the “closest vectors” when we run a query to make use of the information in them.
The output of the Embedding recipe that shows up in the Flow is a Knowledge Bank pointing to the vector store.
This is what you will see when you click on the Knowledge Bank. It shows the settings in the Knowledge Bank.
Next, within Prompt Studios, we can test a simple prompt to get the answers to questions that were saved in a dataset and assess the correctness of the answer.
Note that we will need to select the Retrieval augmented option under the choice of LLM.
If we’re satisfied with the returned results, we can save this as a Recipe and it will show up in your Flow.
This is what you see in the Prompt Recipe if you click on it.
Running the recipe will then generate all the results for all the questions in your dataset.
You can see the sources that were referenced in the answer to check for accuracy.
The answers returned show high efficacy of the Retrieval-Augmented LLM. And we’re done! We have performed RAG on an internal knowledge base of our own.
Apart from performing RAG, we can also make use of other LLM recipes or write our own prompts in Prompt Studio (before deploying them) to perform text classification and text summarization.
If you want to learn more about what’s needed to perform RAG in Dataiku, check out this blogpost.
Comments