a) there is a test set of p-many problems,
b) for each problem, k solution candidates are generated by that black-box system for that problem,
c) some black-box assessment is made, how many of these k solution candidates come close to solving the problem,
d) and finally, the metric aggregates all the p \times k assessments into a single number.
I collectively call these “@k-type metrics”. The black-box assessment refers to the fact that no introspection into the assessment mechanism (which can be human, in software, or a combination of these) needs to be provided. Unfortunately, these metrics often are not fully defined, perhaps even ill-defined, confusing, and inconsistent. In this post, I will clean up several misconceptions around metrics. Knowing about these issues will hopefully make your evals more rigorous.
I will focus here on assessments of the output of the black-box system that we want to evaluate, which makes only binary judgments (correct/incorrect), because these are the most widespread. There are other types of metrics that can be distinguished based on whether they are execution-based or execution-free metrics (see, e.g., https://arxiv.org/pdf/2212.10481), but we will not delve into these. Note that classic metrics, such as BLEU, are subsumed in the black-box assessment (modulo binary outputs); one can use pass@k with BLEU metrics, but just as well with accuracy.
n@k = x%: This metric says: Sample k candidates to solve a given problem. Let your black-box system pick n<=k of these for each problem. Check if at least one of the n candidates solves, and if it does, mark that problem as solved. Report the percentage of the p problems that were solved this way. This is a metric defined first in the AlphaCode paper [https://arxiv.org/pdf/2203.07814, page 6 and page 15].
The problem is: This is an ill-defined (incomplete) metric; the metric changes depending on how the system chooses the n samples, but this information is not included in the metric name. If some function f that chooses the n samples is used, then the metric should actually be called something like f@n@k. For example, if you use logprobs (assuming your system relies on these) to order all the k samples and choose the top n ones, this metric should be called logprobrank@n@k. If some other scoring function is used, for example, a reward model M that makes a guess about how well a generated problem will work, then the metric should be named M@n@k. The metrics f@n@k and M@n@k may not agree, but reporting them as “n@k” without providing the ranking procedure is incomplete: If I just tell you that n@k = 0.7 and you don’t have access to the system (which you don’t in case of commercial LLMs), you will not be able to fully interpret the evaluation, as you don’t know the function f that chose the n samples! The AlphaCode paper makes this suboptimal state of affairs explicit (page 6): “The filtering method is up to the system itself”.
One way to fix this situation, if we want to stick with the “n@k” name of the metric, would be to change its definition slightly. Here’s a very similar definition which makes the n@k metric rigorous, let’s call this “n@@k”, which I define in the following way: There exists a subset of n samples, out of a total of k samples, such that on that sample x% are achieved. Notice that his definition doesn’t rely on a mechanism to find n samples anymore. Then n@@k becomes an upper bound for the entire family of n@k-like metrics, since each choosing mechanism will pick a specific subset of n samples, but there might be a subset of n samples that wasn’t picked that might score higher.
I am not saying that n@@k is useful - I am saying it’s rigorous, and n@k is not, as for the latter, you typically have to figure out the mechanism by which the n samples are chosen.
pass@k. This metric is simple to define. Not that if n=k, then the n@k metric (for any choosing mechanisms), coincides with n@@k. This is the pass@k matrix as all of the k samples are being assessed for correctness, not just a subset of n of them. pass@k was first defined in the SPoC paper (https://proceedings.neurips.cc/paper/2019/file/7298332f04ac004a0ca44cc69ecf6f6b-Paper.pdf), but not using the pass@k name. Rather, it is called “success rate at B” (page 6), where B is a “budget” of trials (i.e., what we denote by k) that a system can generate per problem.
ratiopass@k. pass@k is not sensitive to how many of the k samples were positive: Obviously, it makes a difference if, for a given problem, only a few solution candidates are correct vs. most solution candidates are correct. pass@k doesn’t pick up on this difference. One such example of a metric is the ratiopass@k metric introduced only recently, in 2024: https://onlinelibrary.wiley.com/doi/full/10.4218/etrij.2023-0357. This is strange, since it is a completely natural metric to consider: ratiopass@k metric was introduced in the context of assessing generated software programs, where there are test input-output pairs. In this context, it is natural to move away from a binary evaluation if all tests suceed or not, and allow for fractional points based on the number of tests that passed. This is not the case for any kinds of assessments: For example, if you assess mathematical proofs, it is not straightforward to split the assessments into “additive categories” (like test input-output pairs for programs) so that the entire assessment passes if individual tests pass.
Hopefully, this helps you get your LLM evals underway.