Comparing LLMs on "Real-World" Retrieval

The Observable notebook for this report (containing the data) is available here.

I have been struggling to quantitavely and qualitatively compare LLMs, since it’s become increasingly hard to rely on standard evals (every model beats previous models) on standard benchmarks, and no one knows whether the models are trained on the benchmarks.

How accurate are LLMs in reasoning over unstructured data, especially when confronted with information that may not be in their training data? The existing needle-in-the-haystack benchmarks are a good start, but they are a bit limited for two reasons:

The documents are likely to be in the LLMs’ training set (e.g., Paul Graham’s essays)
The questions and inserted facts are not relevant to any particular task

So, I decided to make my own small task and evaluate the models! This is more of an educational exercise for me, and not intended to be a rigorous eval at all.

The dataset consists of ~85 transcripts of video calls between doctors and their patients, pulled from train.csv and valid.csv in this repository.

I ask 3 questions—a simple question, medium-difficulty question, and a hard question—to 8 instruction-tuned models:

Model	Input Token Limit	Last Updated
OpenAI GPT-3.5-Turbo-16k	16k	July 2023
OpenAI GPT-4 11-06 preview	128k	November 2023
Anthropic Haiku	200k	March 2024
Anthropic Sonnet	200k	February 2024
Anthropic Opus	200k	February 2024
Gemini Pro	32k	February 2024
Mistral-7B	128k	March 2024
Gemma-7B	8k	March 2024
DBRX-Instruct	32k	March 2024

In this report, I show each model’s performance, as well as the responses for each question. You can hover over the the scatter plots to see models’ answers vs the reference answer. At the end, I run the synthetic needle-in-a-haystack evals.

Note that I manually created the reference answer, with a lot of data wrangling, so there are bound to be mistakes. I’m not trying to have perfect data; that would take ages :-)

Simple Question: What is the first name of the patient?

Takeaways

gemma is pretty bad; the rest of the models are good
typically the name is mentioned in the first couple sentences of the transcript. if not, the name is likely mentioned in the end of the transcript (e.g., “Bye [name], have a good day”)
DBRX gives super long answers?
this is the most basic of questions to make sure the pipeline works, and TLDR: it works

Medium: What medications is the patient currently taking?

I realize this question is grammatically incorrect; oh well.

Here, I define recall as the proportion of words in the reference answer that sound like medicine names and are longer than 5 characters, which are also found in the model’s output. I did a lot of manual blacklisting of words/curation to get words that sound like medicine names, but I bet my logic isn’t exhaustive.

Takeaways

NOTE: The idx/position here is the average of the positions of each medication in the transcript, so that’s why you see floats

The task, listing medications, simply asks the model to regurgitate various medications from the transcript
Medications currently taken are distributed mostly in the first third of the transcript
Opus is slightly better than GPT-4 in my metrics. But if you hover over the bad answers in the scatterplot for GPT-4, you’ll find that GPT-4 rephrases the medications slightly
Claude models follow a super consistent and slightly more verbose structure?
Personally I would prefer Opus to GPT-4 here
All models degrade in performance when the answer is in the second and third segments of the transcript (note that segments are equally sized)
Gemini fails to return any meaningful result (e.g., it returns empty string or N/A) for several of the calls
Haiku and Sonnet look eerily similar in performance (but clearly have different responses)
DBRX has super long responses (and parts of the first part of the response seem to be half system prompt? maybe i prompted incorrectly)

Hard: What is the longest sentence the patient said?

I compute Jaccard similarity between the model’s response and the reference answer.

Takeaways

GPT-4 seems significantly better than the other models, and the top two models are openai models
All models degrade in performance when the answer is in the second and third segments of the transcript (note that segments are equally sized)
Transcript quality is really bad, often the quotes that are labeled “patient” seem like they are coming from the doctor (and vice versa). I wonder what would happen if we flip the patient-doctor labels and rerun this experiment :-)
Mistral seems to attach a number of words to its response? no idea why, but these are not always correct
The better-performing models give a consistent structure
Gemini is just…bad?
DBRX is really bad…maybe i prompted it wrong? i did copy the system prompt they gave in huggingface

A Final Test: Synthetic Eval

How do the models perform on the standard needle-in-the-haystack eval, where the haystack is our transcripts?

To test this, we leverage a similar experimental setup to what’s been done before, but we focus on retrieving a single needle.

For each transcript, we randomly select 3 toppings from a list of 10 pizza toppings, and concatenate them into a sentence. E.g., The secret ingredients needed to build the perfect pizza are: Espresso-soaked dates, Lemon and Goat cheese.

Next, we (uniformly) randomly pick an index in the document to insert this sentence.

Then, we ask each model the following question: What are the secret ingredients needed to build the perfect pizza?

For each model and document, we compute the recall of secret ingredients (i.e., 0, 0.33, 0.67, or 1.0).

Takeaways

NOTE: The max idx is larger here (~13k) because all transcripts contain an answer. In previous questions, the super-long transcripts didn’t contain any answers, so they weren’t included in the analysis.

None of the models are perfect! But other peoples’ evals show perfect results? I guess this medical transcript data wasn’t in the training set, so who knows what the result would be.
GPT 3.5 is the best!! That is quite surprising. I double and triple checked my code; maybe there is a bug. But I couldn’t find one. Also the GPT-4 responses are quite verbose, which aligns with my expectations.
The models either get all 3 or get none, with the exception of Gemma, which occasionally retrieves 2 of the 3??
Claude models are worse than Mistral
Surprisingly Mistral gets the ingredients, but you can see that its responses are a bit verbose
Gemini is quite bad, but many of the outputs are empty so that explains the poor performance
DBRX completely flops on all but 1!! It does not even output anything…again, maybe i’m sampling wrong
Sometimes the models are self-aware, and say that the transcript is not about pizza :-)