RAG Chatbot Arena

In the ever-changing landscape of AI, Retrieval Augmented Generation (RAG) is one of the latest trends. If you log in to any of the popular platforms that use artificial intelligence, you’ll probably see a disclaimer that reads something like: “chatbot can make mistakes. Check your responses, etc.” One of the key benefits of RAG is mitigating this. If implemented correctly, the chances of an LLM generating incorrect or misleading information, so-called “hallucinations” will be drastically reduced.

The idea is very simple at its core. We introduce an external knowledge base (it could be a basic text document) to the LLM so that it can reference this information before generating a response. This is also known as grounding, and it serves to optimize and customize the LLM's output. This way, the output stays relevant, trustworthy, and useful in various contexts.

RAG is particularly useful for customer service and support. Businesses can feed company-specific data to the chatbot to generate personalized responses and improve service quality.

As you can guess, all the big players in AI have their own RAG chatbots. We’re talking OpenAI with its Assistant, Intercom with Fin, IBM’s watsonx, etc. The list goes on. So, we decided to put these to the test. Is RAG really worth the hype, or just a buzzword?

For the external knowledge base, we will use information gathered from a company called Soland, which specializes in investment services such as citizenship, residency, and real estate. All of the information from their website has been condensed and formatted into a single PDF.

As far as testing goes, we opted for Langsmith as well as our own (hopefully objective) judgment and opinions. Also worth noting is that the products covered are only the free trial versions.

The list of candidates was long, but we decided to go for the usual suspects: IBM watsonx AI Assistant, Intercom Fin, Anthropic Claude, OpenAI Assistant, as well as our own custom chatbot that we like to call Elevation. It’s still in development, though.

QA Dataset

For Langsmith evaluation to work, we need to provide the dataset, which is basically a set of inputs and outputs. Each input is a question and each output is an answer. They are also called ground truth questions and answers since they’re derived from our uploaded PDF file.

Examples from the Dataset

Inputs:

"What is Soland?"
"Which countries are included in your Citizenship by investment program?"
"Is it possible to acquire residency by investment through Soland?"
"How much would it cost to acquire citizenship by investment in North Macedonia?"

Outputs:

"Soland is a leader in luxurious property acquisition and business initiation."
"St. Lucia, Grenada, St. Kitts and Nevis, Antigua and Barbuda, Dominica, Vanuatu, Turkey, and Egypt."
"Of course. This is also known as golden visa programs."
"I have no knowledge of this."

The outputs are reference answers or the desired output. These represent the core data used for both Langsmith and our subjective evaluations.

We started by asking the chatbots the same question multiple times to check for any possible inconsistencies. Then, the Langsmith evaluators come into play, specifically the Reference Answer Evaluator, which compares the RAG chain answer (LLM's output) to our reference answer in the dataset. For our product, Elevation, we added the Hallucination Evaluator, which checks if the generated answer is fully based on the ground truth documentation (retrieved documents). The evaluators were applied to first answers only. For the purposes of this document, we included evaluations only for the examples mentioned above, as including all of them would have made the document too lengthy.

The Bottom Line

RAG chatbots work. They’re excellent at what they do—not perfect, though. They are cost-effective and simple to use. Mistakes creep in here and there, but overall, their performance is impressive, and they mostly deliver on their promises. RAG technology is here to stay, and we can definitely expect rapid advances in this field.

Anthropic Claude led the pack with a 91% success rate, followed closely by Elevation at 88%. Fin was third with 83%, followed by OpenAI Assistant and watsonx at 81% and 77%, respectively.

The Breakdown

Question #1: “What is Soland?”

watsonx AI Assistant:

All the answers from watsonx were consistent and did not contradict one another. One is a bit longer than the others, but they’re fairly similar.

Langsmith Reference Answer Evaluator output gave a binary evaluation of CORRECT and explained: "The student's answer also provides additional information about the company's location, services, team, and values. This additional information does not conflict with the context. Therefore, the student's answer is factually accurate."

Elevation:

Our custom chatbot scored similarly. All answers were consistent with one another and in the same tone.

Reference Answer Evaluator gave the score of CORRECT, stating: "The student's answer is more detailed than the context provided, but it does not contradict the context. The context states that Soland is a leader in luxurious property acquisition and business initiation. The student's answer agrees with this, stating that Soland specializes in luxurious property acquisition and business initiation. The additional information provided by the student about Soland's location, services, and team does not conflict with the context."

There were no hallucinations, according to the Hallucination Evaluator: "The assistant's response accurately reflects the information provided in the ground truth documentation. It correctly identifies Soland as a Dubai-based company specializing in luxurious property acquisition and business initiation. The assistant also correctly mentions the services offered by Soland, including citizenship, residency, real estate, and business setup. The assistant's response is well-grounded in the provided documents and does not include any hallucinated information. Therefore, I would rate it as follows: Rating: 10."

Anthropic Claude:

One of the answers from Claude was significantly more detailed than the others. But again, there was no conflicting information.

Reference Answer Evaluator output: "The student's answer aligns with the context provided, as these services would indeed position Soland as a leader in luxurious property acquisition and business initiation. Therefore, the student's answer is factually accurate. GRADE: CORRECT."

Intercom Fin:

Fin’s answers were almost identical; the difference was just the tense of one word.

Reference Answer Evaluator output: "The student's answer is more detailed than the context provided, but it does not contradict the context. Therefore, the student's answer is factually accurate. GRADE: CORRECT."

OpenAI Assistant:

Answers were not identical, but they weren’t conflicting either. Basically using synonyms for the same things.

Reference Answer Evaluator output: "The student's answer is more detailed than the context provided, but it does not contradict the context."

Question #2: “Which countries are included in your Citizenship by investment program?”

watsonx AI Assistant:

Answers from watsonx were similar but they did contradict each other to some extent. The second one omitted one of the countries and added a country that wasn’t on the list. I tried asking it if there are other countries that offer citizenship by investment, and it just made more mistakes.

Langsmith Reference Answer Evaluator output: "The student's answer includes St. Lucia, St. Kitts and Nevis, Grenada, and Antigua and Barbuda, which are all part of the context provided. However, the student's answer does not include Dominica, Vanuatu, Turkey, and Egypt, which are also part of the context. The student's answer also mentions 'additional options that can be discussed upon request,' but this is vague and does not specifically mention the other countries in the context. Therefore, the student's answer is not fully accurate. GRADE: INCORRECT."

Elevation:

Answers were consistent and similar.

Langsmith Reference Answer Evaluator output: "The student correctly listed St. Lucia, St. Kitts and Nevis, Antigua and Barbuda, Dominica, Vanuatu, Grenada, Turkey, and Egypt as countries included in the Citizenship by investment program. GRADE: CORRECT."
Hallucination Evaluator: "The assistant's response is accurate and aligns with the information provided in the ground truth documentation. The assistant correctly lists all the countries mentioned in the documents that offer a Citizenship by Investment program. Therefore, the assistant's response is fully based on the ground truth documentation. Rating: 10."

Anthropic Claude:

One of the answers from Claude was a lot more detailed than others but did not contradict any of them.

Langsmith Reference Answer Evaluator output: "The student's answer lists all the countries mentioned in the context: St. Lucia, Grenada, St. Kitts and Nevis, Antigua and Barbuda, Dominica, Vanuatu, Turkey, and Egypt. The student also provides additional information about the programs, which does not conflict with the context. Therefore, the student's answer is factually accurate. GRADE: CORRECT."

Intercom Fin:

Answers were very similar but not identical. Consistent.

Langsmith Reference Answer Evaluator output: "The student's answer lists all the countries mentioned in the context: St. Lucia, Grenada, St. Kitts and Nevis, Antigua and Barbuda, Dominica, Vanuatu, Turkey, and Egypt. The student also provides additional information about each country's program, but this information does not conflict with the context. Therefore, the student's answer is factually accurate. GRADE: CORRECT."

OpenAI Assistant:

Answers were very terse and identical to one another. Basically just a list of countries.

Langsmith Reference Answer Evaluator output: "The student has listed all the countries mentioned in the context. The countries listed by the student match exactly with the countries provided in the context. Therefore, the student's answer is factually accurate. GRADE: CORRECT."

Question #3: “Is it possible to acquire residency by investment through Soland?”

watsonx AI Assistant:

Some answers were incorrect. The question was about residency by investment, but watsonx provided information on citizenship by investment programs. This was probably the most problematic question for watsonx.

Langsmith Reference Answer Evaluator output: "The context confirms that Soland does offer residency by investment programs, also known as golden visa programs. The student's answer also states that Soland offers these programs, and even provides examples of specific programs. Therefore, the student's answer is in line with the context provided. GRADE: CORRECT."

Note: As stated above, the answer is in line with the context provided (“Of course. This is also known as golden visa programs.”) BUT it did mix up citizenship by investment and residency by investment when providing additional details. ...

Elevation:

Answers were quite similar and not contradicting.

Langsmith Reference Answer Evaluator output: "The student's answer is in line with the context provided. The context mentions that Soland offers residency by investment, also known as golden visa programs. The student's answer correctly states that Soland offers these programs and provides additional information about what these programs offer to individuals. The student's answer does not contradict the context in any way. GRADE: CORRECT."
Hallucination Evaluator: "The assistant's response is accurate and grounded in the ground truth documentation. The documents mention that Soland specializes in helping global clients achieve residency and citizenship through dedicated residence by investment programs. Therefore, the assistant's response is in line with the information provided in the documents. Rating: 10."

Anthropic Claude:

Some answers by Claude were more detailed, but all of them were correct and consistent.

Langsmith Reference Answer Evaluator output: "The student's answer is in line with the context provided. The context confirms that Soland offers services related to residency by investment, also known as 'golden visa' programs. The student's answer not only confirms this but also provides additional information about the specific programs that Soland can help with. The student's answer does not contradict the context in any way. Therefore, the student's answer is correct. GRADE: CORRECT."

Intercom Fin:

The pattern continues. Answers are almost identical; the difference is one sentence, which was phrased a bit differently.

Langsmith Reference Answer Evaluator output: "The student's answer is in line with the context provided. The context confirms that it is possible to acquire residency by investment through Soland, also known as golden visa programs. The student's answer not only confirms this but also provides additional information about the specific programs offered by Soland. Therefore, the student's answer is factually accurate and does not contradict the context. GRADE: CORRECT."

OpenAI Assistant:

Answers are similar, but the substance is basically the same.

Langsmith Reference Answer Evaluator output: "The student's answer is in line with the context provided. The context confirms that Soland offers residency by investment, also known as golden visa programs. The student's answer also states that Soland offers these programs and provides additional information about what these programs entail. There are no conflicting statements between the context and the student's answer. GRADE: CORRECT."

Question #4: “How much would it cost to acquire citizenship by investment in North Macedonia?”

watsonx AI Assistant:

Answers were almost identical in length and substance.

Langsmith Reference Answer Evaluator output: "The student's answer is in line with the context provided. The context states that there is no knowledge about the cost of acquiring citizenship by investment in North Macedonia. The student's answer reflects this lack of information. Therefore, the student's answer is factually accurate based on the context. GRADE: CORRECT."

Elevation:

Answers were a bit different but not contradicting.

Langsmith Reference Answer Evaluator output: "The student's answer does not provide a specific cost for acquiring citizenship by investment in North Macedonia, which is the question asked. However, the context provided does not offer a specific cost either, and in fact, states that North Macedonia is not included in the citizenship by investment programs. The student's answer, while not providing a specific cost, does accurately reflect the information provided in the context. The student's suggestion to contact relevant authorities or agencies for more information is a reasonable response given the lack of specific information in the context. GRADE: CORRECT."
Hallucination Evaluator: "The assistant's response is accurate as the ground truth documents do not provide any information about the cost of acquiring citizenship by investment in North Macedonia. The assistant correctly states that it does not have this information. Therefore, the assistant's response is fully based on the ground truth documentation. Rating: 10."

Anthropic Claude:

All the answers were correct, but some were not to the point. They correctly state that the document does not include information about a citizenship by investment program in North Macedonia or the associated costs.

Langsmith Reference Answer Evaluator output: "The student's answer does not provide a direct answer to the question asked. The question asks for the cost of acquiring citizenship by investment in North Macedonia, but the student's answer only states that the document does not provide information about a citizenship by investment program in North Macedonia. The student does not provide any information about the cost of such a program in North Macedonia. Therefore, the student's answer does not answer the question."

Note: This evaluation seems a bit off; the answer is de facto true and in line with the context because this information is not present in the context.

Intercom Fin:

The answers were significantly different. One answer states that the information provided does not specify the cost of acquiring citizenship by investment in North Macedonia through Soland. However, other answers do not mention North Macedonia at all and instead list examples of citizenship by investment programs offered by Soland in other countries. The first answer is acceptable and accurate.

Langsmith Reference Answer Evaluator output: "The student's answer is not providing a factual answer to the question. Instead, the student is asking for clarification on the question. Therefore, the student's answer does not provide the information asked for in the question. GRADE: INCORRECT."

OpenAI Assistant:

Most answers were incorrect and not grounded in the documents. The chatbot provided specific information on the cost of citizenship by investment, which suggests it may have fallen back on web search, despite the prompt specifying not to include information not provided in the context.

Langsmith Reference Answer Evaluator output: "Since the context does not provide any information about the cost to acquire citizenship by investment in North Macedonia, I cannot verify the accuracy of the student's answer. Therefore, I cannot grade the student's answer as correct or incorrect. GRADE: Cannot be determined."

Final Thoughts

Overall, the results of our evaluation show that while there are discrepancies and areas for improvement, RAG-based chatbots generally provide valuable and reliable information when properly grounded in a curated knowledge base. The different models show varying levels of effectiveness, with some handling nuances and context better than others. Moving forward, further refinement in both the models and the evaluative criteria will help in enhancing the accuracy and usability of these systems.