Optimal prediction sets for plant identification: an interactive guide

Insights and visualization of conformal prediction in the context of long-tailed classification. Challenges from citizen sciences platforms like Pl@ntNet
Conformal prediction
classification
machine learning
Pl@ntNet
Authors
Published

Oct 20, 2025

Introduction and motivation

This blog post is an effort led by Joseph Salmon, with contributions from Tiffany Ding and Jean-Baptiste Fermanian, to synthesize joint research by the three authors. The full paper (Ding, Fermanian, and Salmon 2025) is available on arXiv and the code is on Github.

Why settle for one label when you could have a set?

Traditionally, machine learning models give you a single best guess: but what if the problem is tricky, or multiple answers are plausible? In those cases, a set of possible labels is far more useful, especially when it comes with a statistical guarantee that the true answer is included.

Conformal prediction (often abbreviated CP below) provides a framework for constructing prediction sets with valid coverage guarantees (Vovk, Gammerman, and Shafer 2005). A particular focus on classification was later considered (Vovk et al. 2016), and methods for minimizing the size of prediction sets, such as the Least Ambiguous Set-Valued Classifiers, were proposed in parallel by Sadinle, Lei, and Wasserman (2019). In this post, we focus on extensions of the later approach.

What does it mean for a prediction set to be optimal? Given that we have decided to generate prediction sets, how do we generate an “optimal” prediction set? A reasonable definition, and the one we will use, is that a prediction-set-generating procedure is optimal if it generates the smallest sets that satisfy a given coverage constraint (e.g., the probability that the true label is contained within the prediction is at least 90%). What kind of coverage we care about depends on the setting, and the coverage notion you choose can make a big difference in the resulting sets. Sadinle, Lei, and Wasserman (2019) have derived the theoretically optimal sets with a minimal constraint on the marginal coverage or class-conditional coverage and show how to approximate the optimal sets using conformal prediction while maintaining a coverage guarantee. We do the same for two new notions of coverage, macro-coverage and weighted macro-coverage, and show that these lead to more useful prediction sets in long-tailed classification settings.

Here is an overview of the four coverage notions we cover below:

  • Marginal coverage: Guarantees coverage across all data points, ensuring the true label is included with the desired probability on average over the dataset, but may neglect rare or underrepresented classes.

  • Class-conditional coverage: Guarantees coverage within each class separately, guaranteeing the desired inclusion probability for every class, but often producing impractically large sets for classes with high uncertainty or few samples.

  • Macro coverage: Balances coverage evenly across classes, by averaging the per-class coverages to emphasize fairness between common and rare classes.

  • Weighted macro coverage: Balances coverage according to class importance, extending macro coverage by incorporating user-specified weights on each class that reflect the relative importance of covering each class

Depending on which coverage notion appears in the constraint, the resulting sets can look very different. We will illustrate this using an artificial 2D example and real-world plant identification data.

Motivating example: Pl@ntNet

Pl@ntNet, a leading citizen-science platform for plant identification, serves as our real-world context for exploring these ideas. More precisely we focus on the Pl@ntNet-300K dataset (Garcin et al. 2021), a representative and more manageable subset of the full Pl@ntNet collection. The Pl@ntNet-300K dataset is available on Zenodo, with documentation and code provided on the Pl@ntNet-300K GitHub repository. The dataset comprises approximately 300,000 images spanning over 1,000 plant species. Here are a few sample images from the Pl@ntNet-300K dataset (cropped to a consistent size for visualization):

Pelargonium capitatum
Pelargonium capitatum (L.) L’Hér.
Sedum litoreum
Sedum litoreum Guss.
Daucus pusillus
Daucus pusillus Michx.
Geropogon hybridus (L.) Sch.Bip.
Geropogon hybridus (L.) Sch.Bip.

The dataset exhibits a highly imbalanced class distribution: a few species are well-represented with numerous images, while the majority have only a few examples. This characteristic makes it an ideal benchmark for studying long-tailed classification problems, as illustrated in the interactive figure below.

Conservation status in Pl@ntNet-300K

Each species in the dataset has an International Union for Conservation of Nature (IUCN) conservation status, which indicates its risk of extinction. We categorize these statuses into two groups for later analysis:

IUCN conservation status
At Risk Not at Risk Unknown/unclear
Critically Endangered (CR) Least Concern (LC) Not assessed (or missing data)
Endangered (EN) Data Deficient (DD)
Vulnerable (VU)
Near Threatened (NT)


This metadata allows us to analyze species distributions both in terms of their popularity and their conservation status, adding a critical ecological perspective to our study. This dual focus serves as a key motivation for introducing our weighted macro coverage metric.

Plant species visualization

The interactive visualization below lets you explore the species distribution in the Pl@ntNet-300K training set, enriched with their IUCN conservation statuses.

Y-Scale:
Species Preview

This visualization provides a dynamic way to explore the intersection of species popularity and ecological importance. “At-risk” species are highlighted and typically have far fewer observations, reflecting the challenges in documenting rare or threatened flora. Use the interactive features (hover the points on the left!) to navigate observation counts, conservation status, and the distribution of both common and rare species.

Notation and model

Let \((X,Y) \in \mathcal{X} \times \mathcal{Y}\) be a (feature, label) pair. In our example, \(X\) is an image, and \(Y\) is the true label (species) associated with it. Let \(\hat{p}(y|x)\) be an estimate of the probability \(p(y|x)\) that the label is \(y\) given features \(x\). Let \(\alpha \in [0,1]\) be a user-specified miscoverage probability, so \(1-\alpha\) corresponds to the desired coverage level.

The available data is randomly split into four distinct sets: a training set for fitting the model \(\hat{p}(y|x)\), a validation set for hyperparameter tuning (in our experiments, the only hyperparameter is the number of training epochs), a calibration set \((X_i, Y_i) \in \mathcal{X} \times \mathcal{Y}\) for \(i \in [n]\) to evaluate coverage, and a test set for final evaluation. Note that in order for the coverage guarantees from conformal prediction to hold, the calibration set must be exchangeable with the test set. The random splitting procedure ensure this is satisfied.

Marginal coverage

In classification setting, suppose our aim is to control the size (here the cardinality) of our prediction sets while ensuring marginal coverage. This corresponds to the following optimization problem:

\[ \begin{align} &\min_{\mathcal{C} : \mathcal{X} \mapsto 2^\mathcal{Y}} \mathbb{E}[|\mathcal{C}(X)|] \\ \text{ s.t. } & {\mathbb P}\left( Y \in \mathcal{C}(X) \right) \geq 1-\alpha, \end{align} \tag{1}\] The following result due to Sadinle, Lei, and Wasserman (2019) describes the optimal set.

Proposition 1 (Marginal coverage and prediction set) Assuming that \(p(Y|X)\) does not have point mass at its \(\alpha\)-quantile, denoted \(t^*_{\alpha}\), the solution of (1) is given by \[ \begin{align*} \mathcal{C}^*(x) & = \left\{ y \in \mathcal{Y} : p(y|x) \geq t^*_{\alpha}\right\}. \end{align*} \]

Proof

This result is a direct consequence of Proposition 4 stated below with the particular choice of \(\omega(y) = \frac{1}{|\mathcal{Y}|}\).

Interpretation: This result implies that the smallest sets that achieve marginal coverage are created by including the classes with the highest probability of being the correct class. Specifically, we include all labels \(y\) where \(p(y|x)\) exceeds some threshold \(t^*_{\alpha}\) chosen to ensure the desired coverage level of \(1-\alpha\).

In practice: While we do not have direct access to \(p(y|x)\), we can use an estimate \(\hat{p}(y|x)\) to approximate the optimal prediction set for a chosen threshold \(t_\alpha\):

\[ \widehat{{\mathcal C}}(X) = \{y \in {\mathcal Y}: \hat{p}(y|X) \geq t_\alpha\}, \] To ensure marginal coverage of \(1-\alpha\), we select \(t_\alpha\) by rewriting \(\widehat{{\mathcal C}}(x)\) in terms of a conformal score: \[ \widehat{{\mathcal C}}(x) = \{y \in {\mathcal Y}: s(x,y) \leq -t_\alpha\}, \] where the score is defined as \[ s(x,y) = -\hat{p}(y|x). \]

By setting \(-t_\alpha\) as the \((1-\alpha)(1+\tfrac{1}{n})\)-level quantile of the calibration scores \(s(X_{i},Y_{i})\) for \(i=1,\dots, n\), i.e., \(\hat{q} = \lceil (n+1)(1-\alpha) \rceil\)-th largest value in \(\{s(X_{1},Y_{1}), \dots, s(X_{n},Y_{n})\}\), the conformal prediction set \(\smash{\widehat{{\mathcal C}}}\) achieves a marginal coverage guarantee (Vovk, Gammerman, and Shafer 2005), that is,

\[ \begin{align} {\mathbb P}(Y \in \widehat{{\mathcal C}}(X)) \geq 1-\alpha. \end{align} \] We will refer to the above procedure as Standard Conformal Prediction (Standard CP) with the softmax score function (so named because the \(\hat{p}\) that appears in \(s(x,y) = -\hat p(y|x)\) is frequently obtained from a softmax vector). Note that other scores have also been considered in the literature, see for instance Romano, Sesia, and Candès (2020).

We generalize this procedure for generating prediction sets in Algorithm 1. Standard CP is a special case where the threshold function \(\hat{\mathbf{q}}=(\hat{q},\dots,\hat{q}) \in {\mathbb R}^{|{\mathcal Y}|}\) applies the same threshold to all classes. However, the other conformal procedures we present below, such as Classwise CP, impose different thresholds for each class.

\begin{algorithm} \caption{Conformal prediction meta-algorithm} \begin{algorithmic} \REQUIRE Score function $s: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}$, miscoverage level $\alpha \in [0,1]$, calibration set $\mathcal{D}_{\mathrm{cal}} = \{(X_i, Y_i)\}_{i=1}^n$, test point $X_{n+1}$, threshold function $\hat{\mathbf{q}}: \mathbb{R}^n \times [0,1] \to \mathbb{R}^{|\mathcal{Y}|}$ \STATE Compute scores: $S_i \gets s(X_i, Y_i)$ for $i = 1, \ldots, n$ \STATE Compute thresholds: $\mathbf{q} \gets \hat{\mathbf{q}}((S_i)_{i=1}^n, \alpha)$ \RETURN Prediction set $\mathcal{C}(X_{n+1}) = \{ y : s(X_{n+1}, y) \leq q_y \}$, where $q_y$ is the $y$-th entry of $\mathbf{q}$ \end{algorithmic} \end{algorithm}

Currently (as of October 2025), Standard CP with softmax is more or less the strategy used to generate prediction sets in the Pl@ntNet app. This performs reasonably well, but can we do better?

2D example

To build intuition, let’s visualize the Standard CP prediction sets for a simple 2D Gaussian mixture model.

Parameters

We generate \(n=900\) samples from a mixture of \(|\mathcal{Y}|=3\) Gaussian distributions with the following parameters:

Parameters
\(\pi_1 = 0.6\) \(\pi_2 = 0.1\) \(\pi_3 = 0.3\)
\(\mu_1 = \begin{pmatrix} 0 \\ 3.5 \end{pmatrix}\) \(\mu_2 = \begin{pmatrix} -2 \\ 0 \end{pmatrix}\) \(\mu_3 = \begin{pmatrix} 1 \\ 1 \end{pmatrix}\)
\(\Sigma_1 = \begin{pmatrix} 1 & 0.8 \\ 0.8 & 1 \end{pmatrix}\) \(\Sigma_2 = \begin{pmatrix} 0.25 & 0 \\ 0 & 0.25 \end{pmatrix}\) \(\Sigma_3 = \begin{pmatrix} 1.85 & 1.4 \\ 1.4 & 1.85 \end{pmatrix}\)

Visualizing the data

The data-generating model is a three-class Gaussian mixture with known parameters (means, covariances, mixing proportions) given above, enabling exact density and theoretical boundary computation. We provide an interactive visualization below.

In both plots, the red boundary is the decision boundary of the Bayes classifier (the theoretically optimal boundary).

  • Left plot: Individual class densities (color-filled contours) and mixture density (grey level lines).
  • Right plot: The points are samples from the mixture distribution, colored by true class. The colored shading indicates the Bayes classifier’s predicted class for each region.

Visualizing Standard CP sets

Concerning the distribution \(p(y|x)\), we provide three candidates in our Gaussian mixture model visualization, though for the moment only the first (true density function) is used:

  • Theoretical: \(\hat p(y|x)=p(y|x)\) (only available in simulations),
  • Empirical: \(\hat p(y|x)\) is a Gaussian mixture whose parameters are estimated with vanilla estimators (empirical means, empirical covariances and empirical proportions),
  • Logistic Reg.: \(\hat p(y|x)\) is obtained by logistic regression.

We now provide a visualization for the Standard conformal prediction, as decribed in the previous section. Hence, our classifier can output multiple classes, forming prediction sets rather than a single label. The size of these sets is controlled by the allowed miscoverage level \(\alpha\) (accessible through the slider), which reflects the confidence of the predictions.

As \(\alpha\) varies, the prediction sets shrink:small \(\alpha\) yields larger, conservative sets, while large \(\alpha\) allows smaller, selective ones. This reflects the core principle of conformal prediction, balancing coverage and uncertainty.


Using the \(\alpha\) slider above reveals two extremes of this trade-off:

  • When \(\alpha\) is small (high coverage), the classifier may need to include all classes in some regions to meet the coverage requirement. For example, at \(\alpha = 0.0001\), the central area contains full prediction sets.
  • When \(\alpha\) is large (low coverage), it may output empty sets in uncertain regions (shown in white), as seen around \(\alpha = 0.15\). If a non-empty output is required, one can instead return the class with the highest conditional probability.

Together, these extremes illustrate how \(\alpha\) directly controls the size–coverage trade-off in conformal prediction.

Conditional coverage

A concern with the standard conformal prediction approach is that all classes are treated similarly. If certain rare classes behave differently from the majority, they may never appear in the prediction set. For instance, the classifier may consistently assign low probabilities to rare classes, as these are not sufficiently represented during training and then, they never get above the threshold.

For this reason, it makes sense to control the conditional coverage for all classes \(y \in {\mathcal Y}\), while targeting sets of minimum size. Then the optimization formulation now becomes:

\[ \begin{align} & \min_{\mathcal{C} : \mathcal{X} \mapsto 2^\mathcal{Y}} \mathbb{E}[|\mathcal{C}(X)|] \, . \\ \text{ s.t. } & {\mathbb P}\left( y \in \mathcal{C}(X) \mid Y=y \right) \geq 1-\alpha, \forall y \in \mathcal{Y} \end{align} \tag{2}\]

Sadinle, Lei, and Wasserman (2019) have shown that the optimal solution for such a set:

Proposition 2 (Conditional-coverage and prediction set) The solutions of (Equation 2) can be written

\[ \begin{align*} \mathcal{C}^*(x) = \left\{ y \in \mathcal{Y} : p(y|x) \geq t_{\alpha,y}\right\}, \end{align*} \] where the threshold \(t_{\alpha,y}\) is fixed as the \(\alpha\)-quantile of the conditional distribution \(p(Y|X)\) knowing \(Y=y\), i.e, \({\mathbb P}(p(Y|X) \geq t_{\alpha,y} |Y = y) = 1 − \alpha\).

Proof

This result also follows from the Neyman-Pearson lemma applied to each class separately. We refer to Theorem 6 of Sadinle, Lei, and Wasserman (2019) for the complete proof.

Interpretation: the optimal way to get conditional coverage is to select the labels whose probability is larger than a particular threshold \(t_{\alpha,y}\), where now the threshold \(t_{\alpha,y}\) can be adapted in a class-wise fashion, and is fixed as the \(\alpha\)-quantile of the conditional distribution \(p(Y|X)\) knowing \(Y=y\), i.e, \({\mathbb P}(p(Y|X) \geq t_{\alpha,y} |Y = y) = 1 − \alpha\).

In practice, the threshold of each class \(y\) is now computed as the quantile of the scores on class \(y\). This is known under the name Mondrian conformal prediction (Vovk, Gammerman, and Shafer 2005). Formally, for any class \(y\in \mathcal{Y}\), we set \(t_{\alpha,y}\) as the \((n_y+1)(1-\alpha)\) score of the calibration points over class \(y\), where \(n_y\) is the number of these points. It corresponds to applying Algorithm 1 with the threshold function \(\hat{\mathbf{q}}\) equal to the vector of these conditional quantiles. We refer to this method as Conditional conformal prediction.

For classes whose number of calibration points \(n_y\) is not large enough (i.e., \(n_y< \alpha^{-1}\)), this threshold is fixed as the maximum possible value of the score function, or formally to \(+\infty\). It implies that these rare classes will always be included in the prediction sets. It can be observed below in the plots for \(\alpha\) small enough.


The figure above illustrates the influence of conditional modeling: overall, the predicted sets tend to be larger, especially when the miscoverage rate \(\alpha\) is close to 0, and often include the marginal prediction sets. In practice, estimating the conditional distribution in high-dimensional spaces with limited data is challenging, which can lead to excessively large prediction sets. We also include bar plots showing per-class coverage, which reveal that Standard CP tends to under-cover rare classes (e.g., Class 2 in green). This poor class-conditional behavior motivates our next proposal.

Macro-coverage

Standard conformal prediction often under-covers rare classes, despite producing compact prediction sets. On the other end of the spectrum, conditional conformal prediction ensures equal coverage across all classes, but at the cost of unwieldy, oversized sets. What if we could find a middle ground?

This is what we tackle by introducing a macro-coverage constraint, inspired by the macro-accuracy metric in multi-class classification. Our objective is to minimize the expected size of prediction sets while ensuring that all classes (including those with limited data) benefit from equitable and reliable coverage.

The macro-coverage formulation reads as follows: \[ \begin{align} &\min_{\mathcal{C} : \mathcal{X} \mapsto 2^\mathcal{Y}} \mathbb{E}[|\mathcal{C}(X)|] \\ \text{ s.t. } & \text{MacroCov}({\mathcal C}) \geq 1-\alpha, \end{align} \tag{3}\] where we defined the macro-coverage as \[ \text{MacroCov}({\mathcal C}) = \frac{1}{|\mathcal{Y}|}\sum_{y \in \mathcal{Y}} {\mathbb P}(Y \in \mathcal{C}(X) \mid Y = y). \tag{4}\] The motivation of this program is to treat all classes equally. We provide an adaptation of the previous proposition for this new constraints:

Proposition 3 (Macro-coverage and prediction set) The solutions of (Equation 3) \[ \begin{align*} \mathcal{C}^*(x) = \left\{ y \in \mathcal{Y} : \frac{p(y|x)}{p(y)} \geq t\right\}, \end{align*} \] for some threshold \(t\) that depends on \(\alpha\).

Proof

This result is a direct consequence of Proposition 4 stated below with the particular choice of \(\omega(y) = \frac{1}{|\mathcal{Y}|}\).

Though we do not have access to \(p(y|x)\) and \(p(y)\) in practice, we have estimates \(\hat{p}(y|x)\) and \(\hat{p}(y)\) from our classifier and the distribution of empirical training labels, respectively. By creating prediction sets as \(\smash{\widehat{{\mathcal C}}_{\mathsf{PAS}}(X) = \{y \in {\mathcal Y}: \hat{p}(y|X)/\hat{p}(y) \geq t\}}\) for a threshold \(t\). We choose \(t\) to achieve \(1-\alpha\) marginal coverage in the following way: Observe that \(\smash{\widehat{{\mathcal C}}}\) can be rewritten as \(\smash{\widehat{{\mathcal C}}(x) = \{y \in {\mathcal Y}: s_{\mathsf{PAS}}(x,y) \leq -t\}}\), where \[ \begin{align} s_{\mathsf{PAS}}(x,y) = - \frac{\hat{p}(y|x)}{\hat{p}(y)}, \end{align} \] and \(\mathsf{PAS}\) stands for prevalence-adjusted softmax.

In summary, the method we propose is simply running Algorithm 1 with the \(\mathsf{PAS}\) score function (which we will refer to as Standard with \(\mathsf{PAS}\)) and \(\hat{\mathbf{q}}=(\hat{q},\dots,\hat{q})\) as the vector of quantile of the scores, where \(\hat{q}\) is the \(\lceil (n+1)(1-\alpha) \rceil\)-th largest value in \(\{s_{\mathsf{PAS}}(X_{1},Y_{1}), \dots, s_{\mathsf{PAS}}(X_{n},Y_{n})\}\). This corresponds to running Standard CP with the new \(s_{\mathsf{PAS}}\) score. Note that for fair comparisons with Standard CP, we set the threshold in Standard with PAS to achieve marginal coverage \(1-\alpha\), i.e., we fix \(t\) so \({\mathbb P}(Y \in \widehat{{\mathcal C}}_{\mathsf{PAS}}(X)) = 1-\alpha\).

Then, this achieves the desired marginal coverage guarantee while (approximately) optimally trading off set size and macro-coverage. In the visualization below, we also illustrate the variation of the (empirical) coverage per class w.r.t. \(\alpha\) (evaluated on the test set).


Weighted-macro coverage

Recall that macro-coverage is the unweighted average of the class-conditional coverages. However, in some settings it may be more important to cover some classes than others; in these settings, it makes sense to instead optimize a weighted average of the class-conditional coverages. Given user-chosen class weights \(\omega(y)\) for \(y \in {\mathcal Y}\) that sum to one, we can define the \(\omega\)-weighted macro-coverage as

\[ \begin{align} \mathrm{MacroCov}_{\omega}({\mathcal C}) = \sum_{y \in {\mathcal Y}} \omega(y) {\mathbb P}(Y \in {\mathcal C}(X) \mid Y = y). \end{align} \tag{5}\]

Proposition 4 (Weighted macro-coverage and prediction set) The solutions of Equation 3 when \(\mathrm{MacroCov}\) is replaced with \(\mathrm{MacroCov}_{\omega}\) are of the form \[ \begin{align} {\mathcal C}^*(x) = \left\{ y \in {\mathcal Y}: \omega(y) \dfrac{p(y|x)}{p(y)} \geq t\right\}, \end{align} \] for some threshold \(t\) that depends on \(\omega\) and \(\alpha\), respectively.

Proof

This proof mainly realies on Lemma 1. We assume that \(X\) has a density \(p\) with respect to the Lebesgue measure \(\lambda\), but the proof can be adapted for any measure. Define the functions \(f(x,y) := \omega(y)p(x|y)\) and \(g(x,y) = p(x)\) and the measure \(\mu = \lambda \times \Big( \sum_{y\in {\mathcal Y}} \delta_y\Big)\), which is the product measure between the Lebesgue measure (denoted by \(\lambda\)) and the counting measure on \({\mathcal Y}\). Let \(\nu = 1-\alpha\). Observe that plugging these values of \(f\), \(g\), \(\mu\), and \(\nu\) into Equation 6 yields Equation 3, because \[\begin{align*} \int_{{\mathcal X}\times {\mathcal Y}}1_{y\in {\mathcal C}(x)}f(x,y)d \mu(x,y) &= \sum_{y\in {\mathcal Y}} \omega(y) \int_{\mathcal X}1_{y\in {\mathcal C}(x)} p(x|y) dx \\ &= \sum_{y\in {\mathcal Y}} \omega(y) {\mathbb P}\left( y \in {\mathcal C}(X) \mid Y=y \right), \end{align*}\] and \[\begin{align*} \int_{{\mathcal X}\times {\mathcal Y}}1_{y\in {\mathcal C}(x)} g(x,y)d\mu(x,y) &= \sum_{y\in {\mathcal Y}} \int_{\mathcal X}1_{y\in {\mathcal C}(x)} p(x) dx \\ &= \int_{\mathcal X}\sum_{y\in {\mathcal Y}} 1_{y\in {\mathcal C}(x)} p(x) dx \\ &= \int_{\mathcal X}|{\mathcal C}(X)| p(x) dx \\ &= \mathbb{E}\left( |{\mathcal C}(X)| \right). \end{align*}\] By Lemma 1, it follows that the optimal solution to Equation 3 is given by Equation 8, which for our choice of \(f\) and \(g\) can be rewritten as \[{\mathcal C}_{1-\alpha}^*(x) = \{y \in {\mathcal Y}: \frac{\omega(y)p(y|x)}{p(y)} \geq t_{1-\alpha}\}\] by observing that \[\begin{equation*} \frac{f(x,y)}{g(x,y)} = \frac{\omega(y)p(x|y)}{p(x)} = \frac{\omega(y)p(x,y)}{p(x)p(y)} =\frac{\omega(y)p(y|x)}{p(y)}\,. \end{equation*}\]

We can approximate these optimal sets by running standard CP with the weighted prevalence-adjusted softmax (WPAS), \[ \begin{align} s_{\mathsf{WPAS}}(x,y) := - \omega(y) \frac{\hat{p}(y|x)}{\hat{p}(y)}, \end{align} \] as the score function.

Again, for fair comparisons with Standard CP, we set the threshold in Standard with WPAS to achieve marginal coverage \(1-\alpha\), i.e., we fix \(t\) so \({\mathbb P}(Y \in \widehat{{\mathcal C}}_{\mathsf{WPAS}}(X)) = 1-\alpha\).

Weighted score adapted to handle endangered species

Motivated by plant conservation, we design weights for the weighted prevalence-adjusted softmax (WPAS) score to target coverage of at-risk species in Pl@ntNet-300K.

Let \(\mathcal{Y}_{\text{at-risk}} \subseteq \mathcal{Y}\) be the set of at-risk species. The coverage of at-risk species is weighted \(\lambda \geq 1\) times more than the coverage of other species, so

\[ \omega(y) = \begin{cases} \frac{\lambda}{W} & \text{if } y \in \mathcal{Y}_{\text{at-risk}} \\ \frac{1}{W} & \text{otherwise}, \end{cases} \]

where \(W = \lambda|\mathcal{Y}_{\text{at-risk}}| + |\mathcal{Y} \setminus \mathcal{Y}_{\text{at-risk}}|\) is a normalizing factor to ensure \(\sum_{y \in \mathcal{Y}} \omega(y) = 1\). We will show the result of running WPAS with this \(\omega\) in the Pl@ntNet-300K experimental results below.

Summary of Propositions 1-4

It might help readers to compare at a glance the objective and the quantity being thresholded in each approach (for instance \(p(y\mid x)\) vs \(p(y\mid x)/p(y)\)). Below is a compact synthesis of the propositions discussed above.

Theoretical synthesis (coverage \(\geq 1-\alpha\))
Constraint Quantity thresholded
Marginal coverage \(p(y\mid x) \ge t_{\alpha}\) (same for all y)
Class-conditional coverage \(p(y\mid x) \geq t_{\alpha,y}\) (class-specific)
Macro-coverage \(\dfrac{p(y\mid x)}{p(y)} \geq t_{\alpha}\) (prevalence-adjusted)
Weighted macro coverage \(\omega(y)\dfrac{p(y\mid x)}{p(y)} \geq t_{\alpha}\) (weights \(\omega\))

In each case, \(t_{\alpha}\) is the threshold that ensures \(1-\alpha\) marginal coverage (or in the case of class-conditional coverage, \(t_{\alpha,y}\) is chosen to ensure \(1-\alpha\) class-conditional coverage for class \(y\), which implies \(1-\alpha\) marginal coverage after marginalizing over classes).

Conformal prediction in action on Pl@ntNet-300K

We will now turn back on the Pl@ntNet-300K dataset presented earlier.

Evaluation metrics

We introduce here some evaluation metrics of interest. Let \(\hat{c}_y\) denote the empirical coverage for class \(y\) (evaluated on the test set), and let \(1-\alpha\) be the target coverage level. We leverage in particular the per-class coverage \(\hat{c}_y\), through class-balanced averages: the at‑risk class-coverage average \(\tfrac{1}{|\mathcal{Y}_{\mathrm{at\text{-}risk}}|}\sum_{y\in\mathcal{Y}_{\mathrm{at\text{-}risk}}}\hat{c}_y\) and the not‑at‑risk average computed analogously.

Evaluation metrics (\(\hat{c}_y\) : empirical coverage for class \(y\))
Metric Definition Target
Avg. set size \(\frac{1}{N}\sum_{i=1}^{N}|\mathcal{C}(X_i^{\mathcal{T}})|\) Lower
MarginalCov \(\frac{1}{N}\sum_{i=1}^{N}\mathbb{1}\{Y_i^{\mathcal{T}}\in\mathcal{C}(X_i^{\mathcal{T}})\}\) Higher
MacroCov \(\frac{1}{|\mathcal{Y}|}\sum_{y\in \mathcal{Y}}\hat{c}_y\) Higher
FracBelow50% \(\frac{1}{|\mathcal{Y}|}\sum_{y\in \mathcal{Y}} \mathbb{1}\{\hat{c}_y \le 0.5\}\) Lower
UnderCovGap \(\frac{1}{|\mathcal{Y}|}\sum_{y\in \mathcal{Y}} \max(1-\alpha-\hat{c}_y,0)\) Lower
At-risk average \(\hat{c}_y\) \(\tfrac{1}{|\mathcal{Y}_{\mathrm{at\text{-}risk}}|} \sum_{y \in \mathcal{Y}_{\mathrm{at\text{-}risk}}} \hat{c}_y\) Higher
Not-at-risk average \(\hat{c}_y\) \(\tfrac{1}{|\mathcal{Y} \setminus \mathcal{Y}_{\mathrm{at\text{-}risk}}|} \sum_{y \in \mathcal{Y} \setminus \mathcal{Y}_{\mathrm{at\text{-}risk}}} \hat{c}_y\) Higher

Results

We present a summary of the result obtained on such a dataset. We consider the softmax score function described as an initial conformal score, and the base model is a ResNet-50 He et al. (2016) trained using the standard cross-entropy loss. Code for reproducing the experiments is available at https://github.com/tiffanyding/long-tail-conformal. We consider four different values of \(\alpha\) in the following, but the behaviors remain similar irrespective to this choice.

Overall analysis

Let us start with a synthesis over all classes/species.

When targeting set size and class-conditional or macro-coverage, it is more effective to optimize for this trade-off directly than trading off set size and marginal coverage. Adjusting \(\alpha\) in Standard CP is a plausible solution to target class-conditional coverage, but the results show that this does not optimally navigate the set size/class-conditional coverage trade-off (Pareto front), which is our goal. In comparison, our methods, which explicitly optimize macro-coverage, consistently achieve better trade-offs than Standard CP.

Classwise should generally be avoided in long-tail settings, as comparable class-conditional and macro-coverage can be achieved with significantly smaller sets using our proposed methods (above their associated sets almost reach the full set of classes).

Standard with \(\mathsf{PAS}\) is Pareto optimal, in the sense that at any marginal coverage level, there is no method that simultaneously achieves better set size and class-conditional/macro- coverage. This suggests that Standard with \(\mathsf{PAS}\) is a good starting place for practitioners due to its simplicity and strong performance on all metrics.

At-risk species analysis

We now turn to a more detailed analysis of at-risk species. To this end, we present results for the Standard method applied to the Pl@ntNet-300K dataset, using three conformal score functions: softmax, PAS, and WPAS with \(\lambda \in \{1, 10, 10^2, 10^3\}\). Our findings show that increasing \(\lambda\) in WPAS enhances the class-conditional coverage for at-risk classes, as measured by \(\hat{c}_y\), the empirical class-conditional coverage for class \(y\). We summarize these results by computing the “at-risk average \(\hat{c}_y\)” as \((1/|\mathcal{Y}_{\text{at-risk}}|)\sum_{y \in \mathcal{Y}_{\text{at-risk}}} \hat{c}_y\), and similarly for the “not-at-risk average \(\hat{c}_y\)”.

We observe that Standard with \(\mathsf{WPAS}\) improves the class-conditional coverage of at-risk classes relative to Standard with softmax or \(\mathsf{PAS}\). Comparing \(\mathsf{WPAS}\) to \(\mathsf{PAS}\), we see that increasing \(\lambda\), the amount we up-weight at-risk classes, leads to larger improvements in the class-conditional coverage of at-risk classes, as expected. These increases are “paid for’’ in terms of a mild increase in average set size and have no discernible effect on the class-conditional coverage of not-at-risk classes, which is appealing from a practical perspective.

Conclusion

In summary, conformal prediction provides a principled and flexible framework for uncertainty quantification in plant species identification, especially in the presence of long-tailed and imbalanced data. By carefully choosing the coverage notion, we can achieve both reliable guarantees and practical interpretability, empowering both researchers and practitioners to make more informed decisions in challenging real-world settings. For the interested reader the associated paper not only consider new conformal score function approaches, we also propose another approach that requires modifying the conformal procedure itself.

References

Ding, T., J.-B. Fermanian, and J. Salmon. 2025. Conformal Prediction for Long-Tailed Classification.” ArXiv e-Prints. https://arxiv.org/abs/2507.06867.
Garcin, Camille, Alexis Joly, Pierre Bonnet, Jean-Christophe Lombardo, Antoine Affouard, Mathias Chouet, Maximilien Servajean, Titouan Lorieul, and Joseph Salmon. 2021. Pl@ntNet-300K: A Plant Image Dataset with High Label Ambiguity and a Long-Tailed Distribution.” In NeurIPS Datasets and Benchmarks 2021.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Deep Residual Learning for Image Recognition.” In CVPR, 770–78.
Neyman, Jerzy, and Egon Sharpe Pearson. 1933. IX. On the problem of the most efficient tests of statistical hypotheses.” Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 231 (694-706): 289–337.
Romano, Yaniv, Matteo Sesia, and Emmanuel J. Candès. 2020. “Classification with Valid and Adaptive Coverage.” In NeurIPS.
Sadinle, Mauricio, Jing Lei, and Larry Wasserman. 2019. “Least Ambiguous Set-Valued Classifiers with Bounded Error Levels.” J. Amer. Statist. Assoc. 114 (525): 223–34.
Vovk, Vladimir, Valentina Fedorova, Ilia Nouretdinov, and Alexander Gammerman. 2016. “Criteria of Efficiency for Conformal Prediction.” In Conformal and Probabilistic Prediction with Applications: 5th International Symposium, COPA 2016, Madrid, Spain, April 20-22, 2016, Proceedings 5, 23–39. Springer.
Vovk, Vladimir, Alex Gammerman, and Glenn Shafer. 2005. Algorithmic Learning in a Random World. Springer.

Appendix

Technical detour

As a key technical tool for our proofs, we state the following version of the Neyman–Pearson Lemma. The proof may be skipped on first reading.

Lemma 1 (Neyman and Pearson (1933)) Let \(\mu\) be a measure on \({\mathcal X}\times {\mathcal Y}\) and let \(f,g: {\mathcal X}\times {\mathcal Y}\to {\mathbb R}_{\geq 0}\) be two non-negative functions. For \(\nu \geq 0\), consider the problem \[ \begin{align*} &\min_{{\mathcal C}:{\mathcal X}\to 2^{{\mathcal Y}}} \int_{{\mathcal X}\times {\mathcal Y}}\mathbb{1}\{y \in {\mathcal C}(x)\} g(x,y)d\mu(x,y) \\ &\, \text{s.t.} \int_{{\mathcal X}\times {\mathcal Y}}\mathbb{1}\{y\in{\mathcal C}(x)\} f(x,y)d\mu(x,y)\geq \nu. \end{align*} \tag{6}\]

If there exists \(t_\nu\) such that \[ \begin{equation} \int_{{\mathcal X}\times {\mathcal Y}} \mathbb{1}\{f(x,y) \geq t_\nu \cdot g(x,y)\} f(x,y) d\mu(x,y) = \nu, \end{equation} \tag{7}\] then \[ \begin{align} {\mathcal C}^{*}_{\nu}(x) = \{y \in {\mathcal Y}: f(x,y) \geq t_\nu \cdot g(x,y) \} \end{align} \tag{8}\]

is the optimal solution to (6). Moreover, any other minimizer \({\mathcal C}\) of (6) is equal to \({\mathcal C}_{\nu}^*\), \(\mu_f\) and \(\mu_g\)-almost everywhere, i.e., for \(h\in \{f,g\}\), \(\mu_h\big( \{ (x,y) : y\in {\mathcal C}^*_\nu(x) \}\Delta \{ (x,y) : y\in {\mathcal C}(x) \}\big)=0\) where \(\Delta\) denotes the symmetric distance between the sets and \(\mu_h\) is defined as \(\mu_h(A) = \int_A h d\mu\).

Proof

We will show that \({\mathcal C}^*(x) = \{y :f(x,y) \geq t_\nu g(x,y) \}\) is the optimal solution to (6). We first demonstrate that it is an optimal solution and then that it is unique.

Optimality. By definition of \(t_\nu\), we have \[ \begin{equation} \int_{{\mathcal X}\times {\mathcal Y}} \mathbb{1}\{ y \in {\mathcal C}^*(x)\} f(x,y) d\mu(x,y) = \nu, \end{equation} \tag{9}\] which is trivially greater than or equal to $`, so \({\mathcal C}^*(x)\) is indeed a feasible solution. To show that \({\mathcal C}^*(x)\) is optimal, we must argue that it achieves a smaller objective value than any other feasible solution. Let \({\mathcal C}: {\mathcal X}\to 2^{\mathcal Y}\) be any other set-generating procedure that satisfies the constraint in (6). We want to show that \[ \begin{equation*} \int_{{\mathcal X}\times {\mathcal Y}} \mathbb{1}\{ y \in {\mathcal C}^*(x)\} g(x,y) d\mu(x,y) \leq \int_{{\mathcal X}\times {\mathcal Y}} \mathbb{1}\{ y \in {\mathcal C}(x)\} g(x,y) d\mu(x,y)\,. \end{equation*} \] We prove this by showing their difference is negative: \[ \begin{align*} &\int_{{\mathcal X}\times {\mathcal Y}} \mathbb{1}\{ y \in {\mathcal C}^*(x)\} g(x,y) d\mu(x,y) - \int_{{\mathcal X}\times {\mathcal Y}} \mathbb{1}\{ y \in {\mathcal C}(x)\} g(x,y) d\mu(x,y) \\ &= \int_{{\mathcal X}\times {\mathcal Y}} \Big(\mathbb{1}\{ y \in {\mathcal C}^*(x) \backslash {\mathcal C}(x)\} - \mathbb{1}\{ y \in {\mathcal C}(x)\backslash {\mathcal C}^*(x)\}\Big) g(x,y) d\mu(x,y) \\ & \leq \frac{1}{t_\nu}\int_{{\mathcal X}\times {\mathcal Y}} \Big(\mathbb{1}\{ y \in {\mathcal C}^*(x) \backslash {\mathcal C}(x)\} - \mathbb{1}\{ y \in {\mathcal C}(x)\backslash {\mathcal C}^*(x)\} \Big) f(x,y) d\mu(x,y) \\ &= \frac{1}{t_\nu} \int_{{\mathcal X}\times {\mathcal Y}} \Big( \mathbb{1}\{ y \in {\mathcal C}^*(x) \} - \mathbb{1}\{ y \in {\mathcal C}(x)\}\Big) f(x,y) d\mu(x,y) \\ & \leq \frac{1}{t_\nu} \left( \nu - \nu \right) \leq 0\,. \end{align*} \] The first inequality follows from \(y \in {\mathcal C}^*(x)\Longleftrightarrow g(x,y)\) \(\leq t_\nu^{-1}f(x,y)\). The second inequality comes from applying the equality stated in (9) to the first integral and then using the definition of \({\mathcal C}\) as satisfying the constraint of (6) to lower bound the second integral.

Uniqueness. Let \({\mathcal C}: {\mathcal X}\to 2^{{\mathcal Y}}\) be another optimal set-generating procedure, so it achieves the same objective value as \({\mathcal C}^*\), \[ \begin{equation} \int_{{\mathcal X}\times {\mathcal Y}} \!\!\!\! \mathbb{1}\{ y \in {\mathcal C}(x)\} g(x,y) d\mu(x,y) = \! \int_{{\mathcal X}\times {\mathcal Y}} \!\!\!\!\mathbb{1}\{ y \in {\mathcal C}^*(x)\} g(x,y) d\mu(x,y)\,, \end{equation} \tag{10}\] and it is a feasible solution, \[ \begin{equation*} \int_{{\mathcal X}\times {\mathcal Y}} \mathbb{1}\{ y \in {\mathcal C}(x)\} f(x,y) d\mu(x,y) \geq \nu\,. \end{equation*} \] Let us first note the following non-negativity relationship, for all \((x,y) \in {\mathcal X}\times {\mathcal Y}\): \[ \begin{equation}\label{eq:aux_uniq2} \left( \mathbb{1} \{ y \in {\mathcal C}^*(x)\} - \mathbb{1}\{ y \in {\mathcal C}(x)\} \right)\left( f(x,y) - t_\nu g(x,y) \right) \geq 0, \,. \end{equation} \tag{11}\] Integrating (11) over \({\mathcal X}\times {\mathcal Y}\), then applying (10), we get \[ \begin{equation*} \int_{{\mathcal X}\times {\mathcal Y}} \mathbb{1}\{ y \in {\mathcal C}^*(x)\} f(x,y) d\mu(x,y) - \int_{{\mathcal X}\times {\mathcal Y}} \mathbb{1}\{ y \in {\mathcal C}(x)\} f(x,y) d\mu(x,y)\geq 0\,. \end{equation*} \]

Combining this with the definition of \({\mathcal C}^*\) and the feasibility of \({\mathcal C}\), we must have $_{} { y (x)} f(x,y) d(x,y) = `. Thus, (11) integrates to zero. Since we also know that (11) is nonnegative, it must be true that, \(\mu\)-almost everywhere, \[ \begin{equation*} \left( \mathbb{1}\{ y \in {\mathcal C}^*(x)\} - \mathbb{1}\{ y \in {\mathcal C}(x)\} \right)\left( f(x,y) - t_\nu g(x,y) \right) = 0. \end{equation*} \] For \((x,y) \in {\mathcal X}\times {\mathcal Y}\) such that \(f(x,y) \neq t_\nu g(x,y)\), then \(\mathbb{1}\{ y \in {\mathcal C}^*(x)\}= \mathbb{1}\{ y \in {\mathcal C}(x)\}\), and in turn it implies, using the definition of \({\mathcal C}^*(x) = \{y :f(x,y) \geq t_\nu g(x,y) \}\), that \(\mu\)-almost everywhere, \({\mathcal C}(x) \subseteq {\mathcal C}^*(x)\) and \({\mathcal C}^*(x) \subseteq {\mathcal C}(x) \cup \{ y : f(x,y) = t_\nu g(x,y)\}\). It remains to show that the sets are equal almost everywhere by showing that the set \[ \begin{align*} D &:= \big\{ (x,y) : y \notin {\mathcal C}(x) \text{ and } y \in {\mathcal C}^*(x) \big\} \\ &= \big\{ (x,y) : y \notin {\mathcal C}(x) \text{ and } f(x,y) = t_\nu g(x,y)\big\} \end{align*} \] is of measure \(0\) under \(\mu_f\). By definition of \({\mathcal C}^*\), \[ \begin{align*} \nu &= \int_{{\mathcal X}\times {\mathcal Y}} \mathbb{1}\{ y \in {\mathcal C}^*(x)\} f(x,y) d\mu(x,y) \\ &= \int_{{\mathcal X}\times {\mathcal Y}} \mathbb{1}\{ y \in {\mathcal C}(x)\} f(x,y) d\mu(x,y) + \int_{{\mathcal X}\times {\mathcal Y}} \mathbb{1}\{ (x,y) \in D\} f(x,y) d\mu(x,y) \\ &= \nu + \int_{D} f d\mu \,. \end{align*} \] Thus, we must have \(\mu_f(D) = 0\). Since \(t_\nu g=f\) on \(D\), we also get that \(\mu_g(D) =0\).

Citation

BibTeX citation:
@online{ding2025,
  author = {Ding, Tiffany and Fermanian, Jean-Baptiste and Salmon,
    Joseph},
  title = {Optimal Prediction Sets for Plant Identification: An
    Interactive Guide},
  date = {2025-10-20},
  url = {https://josephsalmon.eu/blog/long-tail/},
  langid = {en}
}