Fullrank is an interactive CLI tool for Bayesian inference of list rankings based on noisy comparisons. It takes a list of items, then efficiently prompts the user to compare pairs of items until the user decides that the posterior distribution is sufficiently low entropy. It can then sample from the resulting posterior distribution and compute various statistics.
# install via pip
pip install git+https://github.com/max-niederman/fullrank.git
# interactively compare items
fullrank compare items.txt > comparisons.json
# infer a ranking distribution and compute statistics
fullrank stats < comparisons.json
# compute raw samples in JSONL format for your own processing
fullrank raw-sample 10000 samples.jsonl < comparisons.json
# get the unified skew-normal distribution parameters of the posterior
fullrank infer-sun < comparisons.jsonDeterministic sorting algorithms rank lists by comparing pairs of items. If an item is greater than another, it is moved higher in the list. However, sometimes it is uncertain which item is greater. For example:
- The best chess player wins only 60% of their non-draw games against the second-best player.
- Only three-quarters of human raters prefer one LLM completion over another.
- A person prefers one food over another, but only on 80% of days.
Estimating rankings in the presence of this uncertainty is called noisy sorting. A common approach is to model comparisons between items as depending on a latent numerical value ("skill" or "rating") for each item. For example, the commonly used Bradley–Terry model assumes that
where
Gwern Branwen's Resorter is a CLI tool for manual noisy sorting of lists based on the Bradley–Terry model. However, its frequentist approach limits it in a few ways:
- It has to use imperfect heuristics to guess which comparison is most informative.
- There's no principled way to quantify how accurate the resulting maximum-likelihood ranking is, or tell when to stop comparing items.
- It can't answer questions like "What is the probability that item
$i$ is truly the best item?"
As a project to learn more about Bayesian inference, I decided to build a Bayesian version of Resorter.
The Bradley–Terry model is quite nice for maximum-likelihood estimation,
but I was unable to get it to work well in a Bayesian setting.
Given a normal prior on the skills
where
Instead, I used a probability model very similar to Bradley–Terry, but using a probit link instead of a logit link. That is, under the Thurstonian model,
where
I'll now derive the posterior density in the Thurstonian model.
For convenience, I'll represent the observed comparisons as a matrix
where
It turns out that the normalization constant can be represented quite nicely
using the multivariate normal CDF
And since
Likewise,
This is called a unified skew-normal (SUN) distribution, and it is the posterior of most probit models. Using the convention of Arrellano-Valle and Azzalini1, we can write
Arrellano-Valle and Azzalini1 also gives us a convolutional representation of the posterior. If
where
Fullrank exploits this fact to efficiently sample from the posterior
using samples of
Ideally, Fullrank should always present the user with the most informative comparison. That is, the comparison whose probit has maximal entropy.
Unified skew-normal distributions are closed under full-rank linear transformations, so each comparison probit is distributed according to a one-dimensional SUN distribution. At least in the case of a standard normal prior, each comparison has identical first and second moments, so intuitively the entropy should be controlled by the skewness.
Fullrank currently assumes that the entropy is decreasing in the
Footnotes
-
R. B. Arellano-Valle and A. Azzalini, "Some properties of the unified skew-normal distribution," Statistical Papers, vol. 63, pp. 461–487, 2022. doi:10.1007/s00362-021-01235-2 ↩ ↩2