Evaluating LLM Accuracy - Justin Rahardjo

LLMs are everywhere now. We are building different applications utilizing them to make our lives easier. However, accuracy is not always guaranteed; models can hallucinate and may struggle with providing quantitative responses. While many reports evaluate the models themselves, those results don’t always translate well to our own applications. So, what should we consider when evaluating an LLM specific to our application?

When evaluating a particular application or task, I like to think of a combination of prompt, model, and configurations as the input to the system. Adjusting any of these—like tweaking the model’s temperature—changes the output. Even removing one or two words from the prompt can alter the outcome. So, let’s consider all these factors as part of the input, understanding that any adjustments mean we’re changing the input itself.

Throughout this article, I’ll use the example of categorizing customer support tickets to explain everything, but these methods are applicable to any use case. Let’s dive in.

Garbage In, Garbage Out

Before we even start measuring the accuracy of our application, we need a good dataset. This should be a list of example inputs and expected outputs that are accurate for your domain, application, or industry. Here’s an example:

ID	Support Ticket	Category
1	“I forgot my password and can’t log in. Can you help me reset it?”	Account Management
2	“My credit card was charged twice for the same order. How do I get a refund?”	Billing
3	“The website keeps giving me an error when I try to submit my order. Can you help?”	Tech Support
4	“I received a large instead of a medium size shirt. How can I exchange it?”	Product Support
5	“I’m having trouble accessing my account.”	Tech Support

This looks like a solid set of inputs, right? But let’s take a closer look at the last one:

I’m having trouble accessing my account.

Depending on your company, this could fall under “Tech Support” or “Account Management”. Without more context, it’s hard to say. You need to ensure the categorization aligns with your company’s policies, workflows, and expected behaviour. How would a person triaging these tickets classify this one? For our fictional company, they would prefer to categorize this ticket under “Account Management” since it’s more likely to be a password issue than a technical one.

This is why we need to ensure that our inputs are solid before we start measuring accuracy.

Metrics

Now that we have a good dataset as a foundation, how do we actually measure accuracy? There are many metrics we can use. Here are some common ones that I rely on.

Basic Accuracy

This metric is generally used when you have a right or wrong answer. For our example, it would be whether or not the category output is correct. When it is correct, we give it a score of 1; when it is incorrect, we assign a score of 0. Here’s a sample:

Expected	LLM Output	Score
Account Management	Account Management	1
Account Management	Tech Support	0
Billing	Product Support	0
Tech Support	Tech Support	1

Based on this, the total accuracy would be 50% as it got 2 out of 4 correct.

F1 Score

What if we expected a list of categories for each support ticket? For example, we might expect ambiguous tickets to be categorized under multiple categories. Instead of choosing one or the other, the model could specify both. Now, it’s no longer a simple right or wrong scenario. Here’s an example output:

ID	Expected	LLM Output
1	Account Management	Account Management
2	Account Management, Tech Support	Tech Support, Billing
3	Billing	Product Support
4	Product Support, Billing	Product Support, Account Management

To measure this, I find that the F Score is the best approach, as it considers false positives/negatives as well as true positives/negatives. Here’s the formula:

Precision = True Positives / (True Positives + False Positives)
Recall = True Positives / (True Positives + False Negatives)

F1 = 2 * (Precision * Recall) / (Precision + Recall)

Let’s calculate it for ID-2:

True Positives = 1 (Tech Support)
False Positives = 1 (Billing)
False Negatives = 1 (Account Management)

Precision = 1 / (1 + 1) = 1/2
Recall    = 1 / (1 + 1) = 1/2

F1 = 2 * (1/2 * 1/2) / (1/2 + 1/2) = 2 * 1/4 / 1 = 1/2

Doing this across the entire set:

ID	Expected	LLM Output	Precision	Recall	F1 Score
1	Account Management	Account Management	1.0	1.0	1.0
2	Account Management, Tech Support	Tech Support, Billing	0.5	0.5	0.5
3	Billing	Product Support	0.0	0.0	0.0
4	Product Support, Billing	Product Support, Account Management	0.5	0.5	0.5

This gives us an overall accuracy of 50%.

Mean Squared Error

If we expect the output to be on a scale or rating system, we might want to use a metric that considers how close the output was to the actual answer. Let’s assign these tickets a priority between 1 and 3, with 1 being low priority and 3 being high priority. We want to account for the fact that if we expected a rating of 3 and the output was 1, that should be considered less accurate than if the output was 2. Here’s an example:

ID	Expected	LLM Output
1	2 - Medium Priority	3
2	3 - High Priority	2
3	3 - High Priority	1
4	1 - Low Priority	1

The metric to use here would be the Mean Squared Error, which measures the amount of error in the output. We can then invert this to get the overall accuracy. Here’s the formula:

MSE = (1/n) * Σ(actual - predicted)²

where:
- n is the number of data points.
- actual is the true value.
- predicted is the predicted value.

Accuracy = 1 - (MSE / Max Squared Error)

Let’s calculate the squared error for the first row:

SE = (actual - predicted)² = (2 - 3)² = (-1)² = 1

Doing it for the rest:

ID	Expected	LLM Output	Squared Error
1	2 - Medium Priority	3	1
2	3 - High Priority	2	1
3	3 - High Priority	1	4
4	1 - Low Priority	1	0

Now, let’s calculate the rest, noting that the Maximum Squared Error would be (3 - 1)² = 4 based on our scale from 1 to 3.

MSE = (1/n) * Σ(actual - predicted)² = (1/4) * (1 + 1 + 4 + 0) = 1/4 * 6 = 1.5

Accuracy = 1 - (MSE / Max Squared Error) = 1 - (1.5 / 4) = 0.625

So, our accuracy is 62.5%.

Precision

Another thing to measure is the precision or consistency of the outputs. LLMs are not always consistent. You can test this by running the exact input (prompt, model, model parameters combination) a minimum of 10 times. Ideally, it would produce the same output every time, but due to the nature of LLMs, it may vary. I’ve found that most prompts running against the GPT-4 and Claude 3 models have about an 80-90% precision.

Now that you know how to measure the accuracy and precision of your prompts, you might find that it’s not as accurate as you thought. So, how can you improve it? There are various methods you can try, like prompt engineering, ensembling, and more. I’ll write about the different methods I’ve used to improve accuracy in future posts.

In the meantime, let me know what you think? Are there other ways you’ve used to evaluate your prompts?