Large Language Models Must Be Taught to Know What They Don’t Know (2024)

\doparttoc\faketableofcontents

Sanyam Kapoor 1Nate Gruver*1*1{}^{\text{*}1}start_FLOATSUPERSCRIPT * 1 end_FLOATSUPERSCRIPT
Manley Roberts2Katherine Collins3Arka Pal2Umang Bhatt1
Adrian Weller3Samuel Dooley2  Micah Goldblum1  Andrew Gordon Wilson1
1New York University 2Abacus AI 3Cambridge University
Equal contribution. Order decided by coin flip. Correspondence to: sk6876@nyu.edu & nvg7279@nyu.edu

Abstract

When using large language models (LLMs) in high-stakes applications, we need to know when we can trust their predictions. Some works argue that prompting high-performance LLMs is sufficient to produce calibrated uncertainties, while others introduce sampling methods that can be prohibitively expensive. In this work, we first argue that prompting on its own is insufficient to achieve good calibration and then show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead. We show that a thousand graded examples are sufficient to outperform baseline methods and that training through the features of a model is necessary for good performance and tractable for large open-source models when using LoRA. We also investigate the mechanisms that enable reliable LLM uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators, applicable not just to their own uncertainties but also the uncertainty of other models. Lastly, we show that uncertainty estimates inform human use of LLMs in human-AI collaborative settings through a user study.

1 Introduction

‘‘I have high cortisol but low ACTH on a dexamethasone suppression test. What should I do?’’If the answer to such a question is given without associated confidence, it is not actionable, and if the answer is presented with erroneously high confidence, then acting on the answer is dangerous. One of the biggest open questions about whether large language models (LLMs) can benefit society and reliably be used for decision making hinges on whether or not they can accurately represent uncertainty over the correctness of their output.

There is anything but consensus on whether LLMs accurately represent uncertainty, or even how we should approach uncertainty representation with language models. Claims regarding language models’ ability to estimate uncertainty vary widely, with some works suggesting that language models are increasingly capable of estimating their uncertainty directly through prompting, without any fine-tuning or changes to the training data [25, 51], and others suggesting that LLMs remain far too overconfident in their predictions [59, 60]. The task of uncertainty estimation in LLMs is further exacerbated by linguistic variances in freeform generation, all of which cannot be exhaustively accounted for during training. LLM practitioners are therefore faced with the challenge of deciding which estimation method to use.

One particular dichotomy in uncertainty estimation methods for language models centers around whether the estimates are black- or white-box. Black-box estimates do not require training and can be used with closed-source models like GPT-4 [1] or Gemini [48], while white-box methods require training parameters on a calibration dataset. Although black-box estimates have become popular with the rise of restricted models, the increased availability of strong open-source models, such as LLaMA [53] or Mistral [24], has made more effective white-box methods more accessible.

In this paper, we perform a deep investigation into uncertainty calibration of LLMs, with findings that advance the debate about necessary interventions for good calibration.In particular, we consider whether it’s possible to have good uncertainties over correctness (rather than tokens) without intervention, how we can best use labeled correctness examples, how well uncertainty generalizes across distribution shifts, and how we can use LLM uncertainty to assist human decision making.

First, we find that fine-tuning for better uncertainties (Figure 1) provides faster and more reliable uncertainty estimates, while using a relatively small number of additional parameters. The resulting uncertainties also generalize to new question types and tasks, beyond what is present in the fine-tuning dataset. We further provide a guide to teaching language models to know what they don’t know using a calibration dataset.Contrary to prior work, we start by showing that current zero-shot, black-box methods are ineffective or impractically expensive in open-ended settings (Section4). We then show how to fine-tune a language model for calibration, exploring the most effective parameterization (e.g. linear probes vs LoRA) and the amount of the data that is required for good generalization (Section5). To test generalization, we evaluate uncertainty estimates on questions with similar formatting to the calibration data as well as questions that test robustness to significant distribution shifts.Lastly, we consider the underlying mechanisms that enable fine-tuning LLMs to estimate their own uncertainties, showing ultimately that models can be used not just to estimate their own uncertainties but also the uncertainties of other models (Section6).Beyond offline evaluation, if language models are to have a broad societal impact, it will be through assisting with human decision making. We conduct a user study demonstrating ways LLM uncertainty can affect AI-human collaboration (Section7).111https://github.com/activatedgeek/calibration-tuning

Large Language Models Must Be Taught to Know What They Don’t Know (1)

2 Related Work

As generative models, LLMs naturally express a distribution over possible outcomes and should capture variance in the underlying data. On multiple-choice tests, where the answer is a single token, an LLM’s predicted token probabilities can lead to a calibrated distribution over the answer choices [43]. When answers consist of entire sentences, however, language model likelihoods become a less reliable indicator of uncertainty because probabilities must be spread over many phrasings of the same concept. Kuhn etal. [30] attempt to mitigate this issue by clustering semantically equivalent answers. However, these methods are hindered by their substantial computational overhead. Accounting for equivalent phrasings of the same semantic content requires enumerating a large space of sentences and clustering for semantic similarity with an auxiliary model.

Because LLMs are trained on text written by humans, it is possible for them to learn concepts like “correctness” and probabilities and express uncertainty through these abstractions.Leveraging this observation, Kadavath etal. [25] and Tian etal. [51] show that careful prompting can produce uncertainty estimates in text that grow more calibrated as model capabilities increases.In light of this phenomenon, language models might gain an intrinsic notion of uncertainty, applicable across a broad range of topics. In the same vein, Burns etal. [9] and Azaria and Mitchell [4] find that pre-trained models have hidden representations which are predictive of truthfulness and use linear probes to classify a model’s correctness.

While these studies suggest a promising trend towards calibration, we find that the story is slightly more complicated. Black-box methods often fail to generate useful uncertainties for popular open-source models, and a careful fine-tuning intervention is necessary. In this way, our findings are closer to those of Xiong etal. [59], who show that zero-shot uncertainty estimates have limited ability to discriminate between correct and incorrect answers, even when used with the best available models (e.g., GPT-4). We go further by showing that black-box methods struggle on open-ended generation, which is both practically important and defined by different challenges than multiple choice evaluations from prior work. Moreover, while others have focused on improving black-box methods [30, 51, 59], we embrace open-source models and their opportunities for fine-tuning, showing that we can maintain the speed of prompting methods while dramatically boosting performance.

Our work also contrasts with prior work on fine-tuning for uncertainties in several key ways. While we build on prior work from Lin etal. [33] and Zhang etal. [62] that poses uncertainty estimation as text completion on a graded dataset, we introduce several changes to the fine-tuning procedure, such as regularization to maintain similar predictions to the base model, and provide extensive ablations that yield actionable insights. For example, we show that, contrary to prior work [4], frozen features are typically insufficient for uncertainty estimates that generalize effectively, and that fine-tuning on as few as 1000 graded examples with LoRA is sufficient to generalize across practical distribution shifts. Also unlike prior work, we provide many insights into the relative performance of fine-tuning compared to black-box methods, introducing a new open-ended evaluation and showing that it displays fundamentally different trends than prior work on multiple choice questions. Although Kadavath etal. [25] also considers calibration for multiple choice questions, many of our conclusions differ. For example, while Kadavath etal. [25] suggest that language models are strongest when evaluating their own generations and subsequently posit that uncertainty estimation is linked to self-knowledge, we find that capable models can readily learn good uncertainties for predictions of other models without any knowledge of their internals. Lastly, while many works motivate their approach with applications to human-AI collaboration, none of them test their uncertainty estimates on actual users, as we do here.

3 Preliminaries

Question answering evaluations.

In all experiments, we use greedy decoding to generate answers conditioned on questions with few-shot prompts. We then label the generated answers as correct or incorrect and independently generate P(correct)𝑃correctP(\text{correct})italic_P ( correct ) using one of the uncertainty estimators. For evaluation, we primarily use the popular MMLU dataset [18], which covers 57 subjects including STEM, humanities, and social sciences. Crucially, however, we expand the original multiple choice (MC) setting with a new open-ended (OE) setting. In the open-ended setting, we do not provide answer choices, and the language model must generate an answer that matches the ground truth answer choice. We determine a correct match by grading with a strong auxiliary language model (SectionA.2).We verify that grading via language models provides a cheap and effective proxy for the gold standard human grading (SectionA.3), consistent with related findings [10].

Metrics. A model that assigns percentage p𝑝pitalic_p to an answer is well-calibrated if its answer is correct p𝑝pitalic_p percent of the time it assigns that confidence. Calibration is typically measured using expected calibration error (ECE) [37], which compares empirical frequences with estimated probabilities through binning (SectionA.4). A lower ECE is better, and an ECE of 00 corresponds to a perfectly calibrated model. In addition to calibration, we measure the area under the receiver operating characteristic curve (AUROC) of the model’s confidence. High AUROC indicates ability to filter answers likely to be correct from answers that are likely to be incorrect, a setting typically called selective prediction.

Temperature scaling. Temperature scaling [42, 17] improves the calibration of a classifier by scaling its logits by 1T1𝑇\frac{1}{T}divide start_ARG 1 end_ARG start_ARG italic_T end_ARG (where T𝑇Titalic_T is the temperature) before applying the softmax function.A high temperature scales the softmax probabilities towards a uniform distribution, while a low temperature collapses the distribution around the most probable output. The temperature parameter is learned on held-out data, typically taken from the same distribution as the training set.

4 Do We Get Good Uncertainties Out-of-the-Box?

In this section, we focus on black-box222Here we consider access to a model’s samples and token-level likelihoods as black-box. Some models do not expose likelihoods directly, but they can be approximated through sampling. methods for estimating a language model’s uncertainty. Due to computational cost, we focus on methods that require a single sample or forward pass and only consider sampling-based methods in the next section.

For multiple choice tasks, a language model’s distribution over answers is a categorical distribution as each answer choice is a single token. Early work on LLMs, such as GPT-3, showed that this distribution is often poorly calibrated [18]. Fundamentally, however, maximum likelihood training should encourage calibration over individual tokens [15], and the calibration of recent LLMs appears to improve in proportion with their accuracy [43].

In open-ended generation, on the other hand, answers are not limited to individual tokens nor a prescribed set of possibilities, which introduces multiple sources of uncertainty. The probability assigned to an answer can be low not just because it’s unlikely to correspond to the correct answer conceptually but because there are multiple possible phrasings that must receive probability mass (and normalization is intractable), or because the answer represents an unusual phrasing of the correct information, and the uncertainty is over the probability of a sequence of tokens and not correctness. For example, imagine a multiple-choice test in which we add an additional answer choice that is a synonym of another. A sensible language model would assign equal likelihood to each choice, lowering the probability it assigns to either individually. In open-ended generation the situation is similar, but even more challenging because of variable length.Adding extra tokens can artificially lower the likelihood of an answer even when it expresses the same concept, as the sequence of tokens becomes less likely with increasing length.

We demonstrate the difference between multiple-choice question answering and open-ended generation in Figure2 (left), where we compare the AUROC of a likelihood-based method for standard MMLU and open-ended MMLU (ours). For open-ended generations, we use perplexity, PPL(s)=exp(1Ni=1Nlogp(sis<i))PPL𝑠1𝑁superscriptsubscript𝑖1𝑁𝑝conditionalsubscript𝑠𝑖subscript𝑠absent𝑖\text{PPL}(s)=\exp\left(\frac{1}{N}\sum_{i=1}^{N}\log p(s_{i}\mid s_{<i})\right)PPL ( italic_s ) = roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log italic_p ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) ), where s𝑠sitalic_s is the tokenized sequence, because it is a length-normalized metric and commonly used when token-level probabilities are exposed by the model [19]. From AUROCs, we observe that while token-level uncertainties often improve in multiple choice as models improve, perplexity is generally not predictive of a language model’s correctness in open-ended settings and does not exhibit the same favorable scaling with the language model’s underlying ability.

Because sequence likelihood (or perplexity) is limited as a confidence measure, prompting methods have becoming an increasingly popular alternative. Lin etal. [33] introduced the following formats that lay the foundation for recent work [51, 62]:

NameFormatConfidence
Zero-ShotClassifier“Question. Answer. True/False: True P( “ True”) /(P( “ True”) + P( “ False”))
Verbalized“Question. Answer. Confidence: 90%float( “ 90%”)

In the first approach, the language model’s logits are used to create a binary classifier by scoring two possible strings denoting true and false.Similarly, in Kadavath etal. [25], the classifier takes in a slightly modified prompt, “Is the answer correct? (a) Yes (b) No ” and confidence is then computed P( “(a)”) / (P( “(a)”) + P( “(b)”)). In the second approach (also used in [51, 59]), uncertainty estimates are sampled as text and then converted into numbers. We provide the extended details in SectionB.2.

Large Language Models Must Be Taught to Know What They Don’t Know (2)

Large Language Models Must Be Taught to Know What They Don’t Know (3)

The prospects of calibration by learning to model human language. If we view language modeling as behavior cloning [46] on human writing, the optimal outcome is a language model that recapitulates the full distribution of human writers present in the training data. Unfortunately, most humans exhibit poor calibration on tasks they are unfamiliar with [28, 29, 32], and not all pre-training data is generated by experts. Therefore it might be unreasonably optimistic to expect black-box methods to yield calibrated uncertainties without a significant intervention. Alignment procedures (e.g. RLHF) could improve the situation by penalizing cases of poor calibration, and the resulting procedure would be akin to fine-tuning on graded data, which we explore in Section5.

Experiments with open-source models. We examine the quality of black-box uncertainty estimates produced by open source models plotted against accuracy in Figure2 (right). We use LLaMA-2 [52, 53], Mistral [24], and LLaMA-3 models, and we evaluate on open-ended MMLU to highlight how the methods might perform in a “chat-bot” setting. Because these models have open weights, we can perform apples-to-apples comparisons with methods that train through the model or access hidden representations. We see that prompting methods typically give poorly calibrated uncertainties (measured by ECE) and their calibration does not improve out-of-the-box as the base model improves. By contrast, AUROC does improve slightly with the power of the underlying model, but even the best model still lags far behind the worse model with fine-tuning for uncertainty.

5 How Should We Use Labeled Examples?

Our goal is to construct an estimate for P(correct)𝑃correctP(\text{correct})italic_P ( correct ), the probability that the model’s answer is correct. Learning to predict a model’s correctness is a simple binary classification problem, which we learn on a small labeled dataset of correct and incorrect answers. There are many possible ways to parameterize P(correct)𝑃correctP(\text{correct})italic_P ( correct ), and we study three that vary in their number of trainable parameters and their use of prompting:

  • Probe: Following Azaria and Mitchell [4], we train a small feed-forward neural network on the last layer features of a LLM that was given the prompt, question, and proposed answer as input. The model outputs P(correct)𝑃correctP(\text{correct})italic_P ( correct ) while keeping the base LLM frozen.

  • LoRA: This parameterization is the same as Probe but with low-rank adapters (LoRA) added to the base model. As a result, the intermediate language features of the base model can be changed to improve the correctness prediction.

  • LoRA + Prompt: Following Kadavath etal. [25], we pose classifying correctness as a multiple choice response with two values, the target tokens “i” and “ii” representing ‘no’ and ‘yes’ respectively. We perform LoRA fine-tuning on strings with this formatting.

With these different parameterizations, we can study how much information about uncertainty is already contained in a pre-trained model’s features. Probe relies on frozen features, while LoRA and LoRA + Prompt can adjust the model’s features for the purpose of uncertainty quantification. Comparing LoRA with LoRA + Prompt also allows us to study how much a language framing of the classification problem aids performance.

Datasets. For training, we build a diverse set of samples from a collection of benchmark datasets, similar to instruction-tuning [56].From the list of 16 benchmark datasets in SectionC.2, we use a sampled subset of size approximately 20,000. We hold out 2000 data-points to use as a temperature scaling calibration set [17].

MethodECEAUROC
w/o KL29.9%70.2%
w/ KL10.8%71.6%
Training and regularization.

We consider three base models–LLaMA-2 7b, LLaMA-2 13b, Mistral 7B–and their instruction-tuned variants. For fine-tuning, we use 8-bit quantization and Low-Rank Adapters (LoRA) [20].For LoRA, we keep the default hyperparameters: rank r=8𝑟8r=8italic_r = 8, α=32𝛼32\alpha=32italic_α = 32, and dropout probability 0.10.10.10.1. Each training run takes approximately 1-3 GPU days with 4 NVIDIA RTX8000 (48GB) GPUs. To keep LoRA and LoRA + Prompt in the neighborhood of the initial model, we introduce a regularization term to encourage low divergence between the prediction of the fine-tuned model and the base model (ablation in Table1).

Sampling baseline. We estimate the uncertainty by clustering generations by semantic similarity [30]. The probability of each cluster becomes the probability assigned to all sequences in that cluster. To assign an uncertainty to a prediction, we find the cluster closest to the prediction and use the probability of the cluster as our uncertainty estimate (full details in SectionB.1). The clear drawback of this approach to uncertainty estimation is its poor scaling. We draw K𝐾Kitalic_K samples from the model (K=10 in our case), and then these samples must be clustered using O(K2superscript𝐾2K^{2}italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) comparisons with an auxiliary model of semantic similarity. Sampling methods are also complicated by their relationship with hyperparameters such as temperature or nucleus size. In the special case where the sampling parameters are chosen to produce greedy decoding (e.g. temperature zero), the model will always assign probably one to its answer. While this behavior does align with the probability of generating the answer, it is not a useful measure of confidence.

Fine-tuning results. In Figure3 (Left) we compare our three fine-tuned models with black-box uncertainty methods on both multiple choice and open-ended MMLU. For multiple choice MMLU, we also include the language model’s max softmax probability as a baseline. Fine-tuning for uncertainty leads to significant improvements in both ECE and AUROC. While frozen features (Probe) are sufficient to outperform baselines in multiple choice MMLU, performing well on open-ended MMLU requires training through the modeling and prompting. Surprisingly, while sampling methods can yield good calibration, their discriminative performance is very weak. By contrast, verbal elicitation is relatively strong in discriminative performance, being on par with weaker fine-tuning methods, but general has poor calibration, even after temperature scaling.

How much data do we need? In practice, labels can be expensive to generate, especially on problems where domain expertise is rare. Therefore, it would be advantageous if fine-tuning with even a small number of examples is sufficient for building a good uncertainty estimate. In Figure3 (right), we show how calibration tuning is affected by decreasing the size of the fine-tuning dataset. We find that having around 1000100010001000 labeled examples is enough to improve performance over simpler baselines, but that increasing the size of the fine-tuning dataset yields consistent improvements in both calibration and selective prediction, although the marginal benefit of additional data points decreases after around 5000500050005000 examples.

6 When and Why Do These Estimates Generalize?

To derive more understanding of when our estimates generalize, we now investigate distribution shifts between the training and evaluation datasets. To have a practically useful tool, we might desire robustness to the following shifts, among others:

Subject matter. Ideally, our uncertainty estimates apply to subjects we have not seen during training. In Figure4 (left), we show a breakdown of our fine-tuning dataset using the supercategories from MMLU (SectionA.5). We see that our dataset contains much higher percentages of STEM and humanities questions than MMLU and close to no examples from the social sciences (e.g. government, economics, sociology). Despite these differences in composition, uncertainty estimates from LoRA + Prompt perform similarly across supercategories. We also show the efficacy of our models at assessing confidence on out of distribution coding tasks in AppendixF.

Format.   Like a change in subject matter, the way a question is posed should not break the uncertainty estimate. To test the effect of the question format independent of its subject matter, we apply models fine-tuned on OE MMLU to MC MMLU and vice versa. In Figure4 (center), we see that fine-tuned models often perform better than a zero-shot baseline even when they are being applied across a distribution shift, though transfer from MC to OE is more challenging than OE to MC. Probe is insufficient to generalize effectively from MC to OE, but training through the features of the model (LoRA + Prompt) does generalize effectively, even out-performing probe trained on OE data.

Solvability.   Even though we focus on questions with a single known answer, we might hope that our estimates can be used even when a question is ill-posed or does not have a known solution, ideally returning high uncertainty. We generate answers, labels, and uncertainty estimates for the answerable and unanswerable questions in the SelfAware dataset [60] using the same procedure as OE MMLU. In Figure4 (right), we plot P(correct)𝑃correctP(\text{correct})italic_P ( correct ) from Zero-Shot Classifier and LoRA + Prompt predicted for each answerable and unanswerable question.Notably, calibration-tuned models have calibrated probabilities for the answerable questions and assign lower confidence to unanswerable questions than black-box methods.

Large Language Models Must Be Taught to Know What They Don’t Know (6)

Large Language Models Must Be Taught to Know What They Don’t Know (7)

Large Language Models Must Be Taught to Know What They Don’t Know (8)

6.1 What are uncertainty estimates learning?

Language models can generate useful uncertainty estimates after training on a relatively small number of labeled examples. How is this possible? We hypothesize two, potentially complementary mechanisms: (a) LLMs assess the correctness of an answer given a question, or (b) LLMs recognize that certain topics often have incorrect answers. To understand the difference, let’s explore a useful metaphor. Imagine I speak only English, while my friend, Alice, is a linguaphile and dabbles in many languages. I have a spreadsheet of how often Alice makes mistakes in each language. Now, when I hear Alice attempting to converse in language A, I can guess how likely she is to err by recognizing the language from its sound and consulting the spreadsheet. I can do this without understanding the language at all. Alternatively, I can learn each language, which would be more complex but would strengthen my predictions.

To disentangle these two possibilities in our setting, we perform an additional experiment, in which we replace the language model’s answers in the fine-tuning dataset with incorrect answer options. If a language model is simply learning patterns in the errors present in the training data, then we would expect this ablation to perform on par with the original method because it suffices to learn patterns in the content of the question and answer without needing the true causal relationship between question, answer, and correctness label. The results are shown in Figure5 (left). We see the model trained on incorrect answers performs surprisingly well, on par with a Probe model, but significantly worse than a model trained on the original sampled answers. Correlating question content with error rates while moderately successful cannot be a full description of the LoRA + Prompt estimates.

Self-knowledge. Lastly, we examine whether a language model can be used to model not just its own uncertainties but the uncertainties of other models. Several prior works argue that models identify correct questions by way of internal representations of truth, which might be unique to a model evaluating its own generations [4, 9]. In Figure5 (right), we show that, by contrast, Mistral 7B actual has better AUROC values when applied to LLaMA-2 7B than LLaMA-2 7B applied to itself. In Figure5 (left), we show that sBERT [44] and OpenAI sentence embeddings are competitive with Probe on both LLaMA-2 7B and Mistral. Together, these results suggest that LLM uncertainties are likely not model-specific. The practical upside of this insight is that one strong base model can be used to estimate the uncertainties of many other models, even closed-source models behind APIs, when a small labeled dataset is available or can be generated.

Large Language Models Must Be Taught to Know What They Don’t Know (9)

Large Language Models Must Be Taught to Know What They Don’t Know (10)

Large Language Models Must Be Taught to Know What They Don’t Know (11)

7 Does Calibrated Confidence Improve Collaboration with AI Assistants?

One key motivation for estimating LLM uncertainty is to signal the model’s reliability during collaborative decision making. To examine how our uncertainty estimates can be used in this capacity, we perform a preliminary user study (with N=181𝑁181N=181italic_N = 181 participants) in which participants complete a multiple choice exam in collaboration with an LLM (Mistral 7B Instruct). For each question, the participant is provided both the LLM’s prediction and an uncertainty estimate, which can be from a calibrated method or an uncalibrated method. We hope to show that users are more likely to adopt calibrated uncertainty scores as part of their decision process. A more detailed description of the setup of our study is available in AppendixG.

People are sensitive to informed confidence scores.

Figure 6 shows density plots of the model’s reported confidence and whether the user chose to agree with the model’s prediction. We find that participants are sensitive to the confidence scores and tend to use scores when deciding to agree or disagree with the model’s prediction if the uncertainties are reliable. On the other hand, participants generally do not modulate their decision to rely on the output of a random confidence baseline (Figure6(c)), in which the display uncertainty estimate is generated uniformly at random.We see the strongest discrepancy in reliance choices when LoRA + Probe confidence scores are presented, highlighting that calibrated confidence does influence user behavior.

We include additional details and results in AppendixG. We find that confidence scores have the biggest effect on improving the lowest performing users, rather than on average accuracy. However, this is a preliminary result in the nascent field of studying LLM uncertainties in practical collaborative decision making with users. We are only still scratching the surface of this question.For more fine-grained conclusions, a study should be devoted to this subject. We outline several limitations and future directions in AppendixG.

Large Language Models Must Be Taught to Know What They Don’t Know (12)Large Language Models Must Be Taught to Know What They Don’t Know (13)Large Language Models Must Be Taught to Know What They Don’t Know (14)
(a) Zero-Shot Prompt(b) LoRA + Prompt(c) Random (Control)

8 Discussion

There is much disagreement about the role of calibrated uncertainty in large language models, how it can best be achieved, and promise of black-box methods. We hope to have shed light on these questions throughout this paper. In contrast to prior results, we find that out-of-the-box uncertainties from LLMs are unreliable for open-ended generation and introduce a suite of fine-tuning procedures that produce calibrated uncertainties with practical generalization properties. In the process, we discovered that fine-tuning is surprisingly sample efficient and does not seem to rely on representations of correctness specific to a model evaluating its own generations, allowing estimators to be applied from one model to another. Moreover, we found it is possible, at least in the cases we considered, for calibrated uncertainties to be robust to distribution shifts.

There are many exciting questions for future work. Currently fine-tuning relies on two separate models for question answering and uncertainty estimation. Ideally, we want a single model that can generate questions and uncertainty without switching between model weights. We anticipate that an uncertainty-aware pre-training or alignment phase might become essential but implementing such a procedure while maintaining base language modeling abilities will introduce a challenging online learning problem where the correctness labels evolve during training.

Beyond improving the safety and usefulness of language models, high quality uncertainties can also be used in active learning procedures, e.g. for sample-efficient fine-tuning [39], where data points are selected based on the predicted utility and the model’s uncertainty, in order to balance the explore-exploit trade-off. Uncertainty estimates can also be used to improve factuality of language models by increasing the likelihood of generations that the model is confident about (judges likely to be correct), for example by using an alignment procedure (e.g. RLHF, DPO) with a reward function that encourages confident generations [50].

We also showed how uncertainty information could be used to influence human decision making. In the end, LLMs will impact society through decision making, and to make reasonable decisions we need uncertainty information — particularly to protect against rare but costly mistakes.

Acknowledgements

This work is supported by NSF CAREER IIS-2145492,NSF CDS&E-MSS 2134216, NSF HDR-2118310, BigHat Biosciences, Capital One, and an Amazon Research Award.

References

  • Achiam etal. [2023]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etal.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
  • Amini etal. [2019]Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi.MathQA: Towards interpretable math word problem solving with operation-based formalisms.In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357–2367. Association for Computational Linguistics, jun 2019.doi: 10.18653/v1/N19-1245.
  • Aroyo and Welty [2015]Lora Aroyo and Chris Welty.Truth is a lie: Crowd truth and the seven myths of human annotation.AI Magazine, 36(1):15–24, 2015.
  • Azaria and Mitchell [2023]Amos Azaria and TomM. Mitchell.The internal state of an llm knows when its lying.ArXiv, abs/2304.13734, 2023.
  • Bhatt etal. [2023]Umang Bhatt, Valerie Chen, KatherineM Collins, Parameswaran Kamalaruban, Emma Kallina, Adrian Weller, and Ameet Talwalkar.Learning personalized decision support policies.arXiv preprint arXiv:2304.06701, 2023.
  • Bishop [2006]ChristopherM Bishop.Pattern recognition and machine learning.Springer google schola, 2:1122–1128, 2006.
  • Bisk etal. [2019]Yonatan Bisk, Rowan Zellers, RonanLe Bras, Jianfeng Gao, and Yejin Choi.Piqa: Reasoning about physical commonsense in natural language.ArXiv, abs/1911.11641, 2019.
  • Bowman etal. [2015]SamuelR. Bowman, Gabor Angeli, Christopher Potts, and ChristopherD. Manning.A large annotated corpus for learning natural language inference.In Conference on Empirical Methods in Natural Language Processing, 2015.
  • Burns etal. [2022]Collin Burns, Hao-Tong Ye, Dan Klein, and Jacob Steinhardt.Discovering latent knowledge in language models without supervision.ArXiv, abs/2212.03827, 2022.
  • Chiang and yiLee [2023]Cheng-Han Chiang and Hung yiLee.Can large language models be an alternative to human evaluations?In Annual Meeting of the Association for Computational Linguistics, 2023.
  • Clark etal. [2019]Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova.Boolq: Exploring the surprising difficulty of natural yes/no questions.ArXiv, abs/1905.10044, 2019.
  • Clark etal. [2018]Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord.Think you have solved question answering? try arc, the ai2 reasoning challenge.ArXiv, abs/1803.05457, 2018.
  • Collins etal. [2023]KatherineMaeve Collins, Matthew Barker, Mateo EspinosaZarlenga, Naveen Raman, Umang Bhatt, Mateja Jamnik, Ilia Sucholutsky, Adrian Weller, and Krishnamurthy Dvijotham.Human uncertainty in concept-based ai systems.In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 869–889, 2023.
  • DeMarneffe etal. [2019]Marie-Catherine DeMarneffe, Mandy Simons, and Judith Tonhauser.The commitmentbank: Investigating projection in naturally occurring discourse.In proceedings of Sinn und Bedeutung, volume23, pages 107–124, 2019.
  • Gneiting and Raftery [2007]Tilmann Gneiting and AdrianE Raftery.Strictly proper scoring rules, prediction, and estimation.Journal of the American statistical Association, 102(477):359–378, 2007.
  • Gordon etal. [2011]AndrewS. Gordon, Zornitsa Kozareva, and Melissa Roemmele.Semeval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning.In International Workshop on Semantic Evaluation, 2011.
  • Guo etal. [2017]Chuan Guo, Geoff Pleiss, YuSun, and KilianQ. Weinberger.On calibration of modern neural networks.In International Conference on Machine Learning, 2017.
  • Hendrycks etal. [2020]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, DawnXiaodong Song, and Jacob Steinhardt.Measuring massive multitask language understanding.ArXiv, abs/2009.03300, 2020.
  • Hills and Anadkat [2023]James Hills and Shyamal Anadkat.Using logprobs, Dec 2023.URL https://cookbook.openai.com/examples/using_logprobs.
  • Hu etal. [2021]J.Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen.Lora: Low-rank adaptation of large language models.ArXiv, abs/2106.09685, 2021.
  • Huang etal. [2019]Lifu Huang, RonanLe Bras, Chandra Bhagavatula, and Yejin Choi.Cosmos qa: Machine reading comprehension with contextual commonsense reasoning.In Conference on Empirical Methods in Natural Language Processing, 2019.
  • Jain etal. [2024]Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica.Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024.
  • Janssen etal. [2008]KJM Janssen, KGM Moons, CJKalkman, DEGrobbee, and YVergouwe.Updating methods improved the performance of a clinical prediction model in new patients.Journal of clinical epidemiology, 61(1):76–86, 2008.
  • Jiang etal. [2023]AlbertQiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, DevendraSingh Chaplot, Diego deLasCasas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L’elioRenard Lavaud, Marie-Anne Lachaux, Pierre Stock, TevenLe Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and WilliamEl Sayed.Mistral 7b.ArXiv, abs/2310.06825, 2023.
  • Kadavath etal. [2022]Saurav Kadavath, Tom Conerly, Amanda Askell, T.J. Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zachary Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, John Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, TomB. Brown, Jack Clark, Nicholas Joseph, Benjamin Mann, Sam McCandlish, Christopher Olah, and Jared Kaplan.Language Models (Mostly) Know What They Know.ArXiv, abs/2207.05221, 2022.
  • Keren [1991]Gideon Keren.Calibration and probability judgements: Conceptual and methodological issues.Acta psychologica, 77(3):217–273, 1991.
  • Khashabi etal. [2018]Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth.Looking beyond the surface: A challenge set for reading comprehension over multiple sentences.In North American Chapter of the Association for Computational Linguistics, 2018.
  • Kruger and Dunning [1999]Justin Kruger and David Dunning.Unskilled and unaware of it: how difficulties in recognizing one’s own incompetence lead to inflated self-assessments.Journal of personality and social psychology, 77(6):1121, 1999.
  • Kruger and Dunning [2002]Justin Kruger and David Dunning.Unskilled and unaware–but why? a reply to krueger and mueller (2002).American Psychological Association, 2002.
  • Kuhn etal. [2023]Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar.Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.ArXiv, abs/2302.09664, 2023.
  • Li and Roth [2002]Xin Li and Dan Roth.Learning question classifiers.In International Conference on Computational Linguistics, 2002.
  • Lichtenstein etal. [1977]Sarah Lichtenstein, Baruch Fischhoff, and LawrenceD Phillips.Calibration of probabilities: The state of the art.In Decision Making and Change in Human Affairs: Proceedings of the Fifth Research Conference on Subjective Probability, Utility, and Decision Making, Darmstadt, 1–4 September, 1975, pages 275–324. Springer, 1977.
  • Lin etal. [2022]StephanieC. Lin, Jacob Hilton, and Owain Evans.Teaching models to express their uncertainty in words.Trans. Mach. Learn. Res., 2022, 2022.
  • Loshchilov and Hutter [2017]Ilya Loshchilov and Frank Hutter.Fixing weight decay regularization in adam.ArXiv, abs/1711.05101, 2017.
  • MacKay [2004]David JohnCameron MacKay.Information theory, inference, and learning algorithms.IEEE Transactions on Information Theory, 50:2544–2545, 2004.
  • Mihaylov etal. [2018]Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal.Can a suit of armor conduct electricity? a new dataset for open book question answering.In Conference on Empirical Methods in Natural Language Processing, 2018.
  • Naeini etal. [2015]MahdiPakdaman Naeini, GregoryF. Cooper, and Milos Hauskrecht.Obtaining well calibrated probabilities using bayesian binning.Proceedings of the … AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, 2015:2901–2907, 2015.
  • Nie etal. [2019]Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela.Adversarial nli: A new benchmark for natural language understanding.ArXiv, abs/1910.14599, 2019.
  • Osband etal. [2022]Ian Osband, SeyedMohammad Asghari, Benjamin VanRoy, Nat McAleese, John Aslanides, and Geoffrey Irving.Fine-tuning language models via epistemic neural networks.arXiv preprint arXiv:2211.01568, 2022.
  • Palan and Schitter [2018]Stefan Palan and Christian Schitter.Prolific. ac—a subject pool for online experiments.Journal of Behavioral and Experimental Finance, 17:22–27, 2018.
  • Paszke etal. [2019]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, LuFang, Junjie Bai, and Soumith Chintala.Pytorch: An imperative style, high-performance deep learning library.In Neural Information Processing Systems, 2019.
  • Platt etal. [1999]John Platt etal.Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers, 10(3):61–74, 1999.
  • Plaut etal. [2024]Benjamin Plaut, Khanh Nguyen, and TuTrinh.Softmax probabilities (mostly) predict large language model correctness on multiple-choice q&a.arXiv preprint arXiv:2402.13213, 2024.
  • Reimers and Gurevych [2019]Nils Reimers and Iryna Gurevych.Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084, 2019.
  • Sakaguchi etal. [2019]Keisuke Sakaguchi, RonanLe Bras, Chandra Bhagavatula, and Yejin Choi.Winogrande: An adversarial winograd schema challenge at scale.ArXiv, abs/1907.10641, 2019.
  • Schaal [1996]Stefan Schaal.Learning from demonstration.Advances in neural information processing systems, 9, 1996.
  • Talmor etal. [2019]Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant.Commonsenseqa: A question answering challenge targeting commonsense knowledge.ArXiv, abs/1811.00937, 2019.
  • Team [2024]Gemini Team.Gemini: A family of highly capable multimodal models, 2024.
  • Terwilliger etal. [2023]ThomasC Terwilliger, Dorothee Liebschner, TristanI Croll, ChristopherJ Williams, AirlieJ McCoy, BillyK Poon, PavelV Afonine, RobertD Oeffner, JaneS Richardson, RandyJ Read, etal.Alphafold predictions are valuable hypotheses and accelerate but do not replace experimental structure determination.Nature Methods, pages 1–7, 2023.
  • Tian etal. [2023a]Katherine Tian, Eric Mitchell, Huaxiu Yao, ChristopherD Manning, and Chelsea Finn.Fine-tuning language models for factuality.arXiv preprint arXiv:2311.08401, 2023a.
  • Tian etal. [2023b]Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and ChristopherD Manning.Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback.arXiv preprint arXiv:2305.14975, 2023b.
  • Touvron etal. [2023a]Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample.Llama: Open and efficient foundation language models.ArXiv, abs/2302.13971, 2023a.
  • Touvron etal. [2023b]Hugo Touvron, Louis Martin, KevinR. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, DanielM. Bikel, Lukas Blecher, CristianCanton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, AnthonyS. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, IsabelM. Kloumann, A.V. Korenev, PunitSingh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, EricMichael Smith, R.Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, JianXiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, andThomas Scialom.Llama 2: Open foundation and fine-tuned chat models.ArXiv, abs/2307.09288, 2023b.
  • Uma etal. [2021]AlexandraN Uma, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, and Massimo Poesio.Learning from disagreement: A survey.Journal of Artificial Intelligence Research, 72:1385–1470, 2021.
  • Vodrahalli etal. [2022]Kailas Vodrahalli, Tobias Gerstenberg, and JamesY Zou.Uncalibrated models can improve human-ai collaboration.Advances in Neural Information Processing Systems, 35:4004–4016, 2022.
  • Wei etal. [2021]Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, AdamsWei Yu, Brian Lester, Nan Du, AndrewM. Dai, and QuocV. Le.Finetuned language models are zero-shot learners.ArXiv, abs/2109.01652, 2021.
  • Welbl etal. [2017]Johannes Welbl, NelsonF. Liu, and Matt Gardner.Crowdsourcing multiple choice science questions.ArXiv, abs/1707.06209, 2017.
  • Wolf etal. [2020]Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, TevenLe Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and AlexanderM. Rush.Transformers: State-of-the-art natural language processing.In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics.URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  • Xiong etal. [2023]Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi.Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.ArXiv, abs/2306.13063, 2023.
  • Yin etal. [2023]Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuanjing Huang.Do large language models know what they don’t know?In Findings of the Association for Computational Linguistics: ACL 2023, pages 8653–8665, Toronto, Canada, 2023. Association for Computational Linguistics.
  • Zellers etal. [2019]Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi.Hellaswag: Can a machine really finish your sentence?In Annual Meeting of the Association for Computational Linguistics, 2019.
  • Zhang etal. [2023]Hanning Zhang, Shizhe Diao, Yong Lin, YiR Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang.R-tuning: Teaching large language models to refuse unknown questions.arXiv preprint arXiv:2311.09677, 2023.

Appendix

\parttoc

Appendix A Evaluation Methods

A.1 Evaluating Correctness

For a given question with known and generated answers (Q,A,A^)𝑄𝐴^𝐴(Q,A,\hat{A})( italic_Q , italic_A , over^ start_ARG italic_A end_ARG ) the correctness C𝐶Citalic_C is True if the generated answer A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG matches the ground truth answer A𝐴Aitalic_A.For multiple-choice question-answering, the matching process only involves checking the first token generated via greedy decoding.

For open-ended evaluations, determining if the answer given is correct is more complex. One simple approach is to check if the ground truth answer A𝐴Aitalic_A appears as a substring of answer A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG. However, this does not capture rephrasings that may be essentially equivalent - such as "NYC" for "New York City," or "Daoism" and "Taoism." Conversely, it also has the potential to be over-generous if the model is particularly verbose and emits many incorrect answers along with the correct string.Given the difficulty involved in writing a rule-based method for evaluating open-ended answer correctness, we use instead a strong auxiliary language model to evaluate correctness. The auxiliary language model is shown the query Q𝑄Qitalic_Q, the ground truth answer A𝐴Aitalic_A, and the model’s output A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG, and is prompted to grade the answer whilst tolerating nuance. For full details of the prompt used see (fig.7). In this paper we utilise GPT 3.5 Turbo as the auxiliary grading model.We conduct a comparison of human grading, substring grading, and GPT 3.5 Turbo grading on select subsets of MMLU in sectionA.3. We find that humans and GPT 3.5 Turbo have much greater agreement than humans and the substring method.

A.2 Grading

Dataset Construction.

To perform calibration-tuning (CT), we need tuples (Q,A,A^,C)𝑄𝐴^𝐴𝐶(Q,A,\hat{A},C)( italic_Q , italic_A , over^ start_ARG italic_A end_ARG , italic_C ), answers from a language model that have been graded for correctness. When calibration-tuning on multiple choice questions, we can use an exact string match to generate C𝐶Citalic_C. To grade open-ended answers, we use a strong language model and grading prompt G𝐺Gitalic_G instead (fig.7):

  • 𝑮𝑮\boldsymbol{G}bold_italic_G: a prompt used for grading answers 𝑨^bold-^𝑨\boldsymbol{\hat{A}}overbold_^ start_ARG bold_italic_A end_ARG with 𝑨𝑨\boldsymbol{A}bold_italic_A.

Compared to alternatives like exact match, language model grading is insensitive to re-phrasings that are equivalent in meaning - such as “NYC" and “New York City," or “Daoism" and “Taoism".LLM grading can also penalize answers that are overly verbose or use a different meaning of the same word, potentially containing incorrect answers along with the correct string.For example, if the question is “What’s it called when you move quickly by foot and both feet aren’t always touching the ground?” and the LLM response is “A bank run", the grader should be able to distinguish that this is semantically dissimilar to the true answer “run”.

In this paper, we utilize GPT 3.5 Turbo as the auxiliary grading model. When comparing many possible grading methods on subsets of MMLU, we find that GPT 3.5 Turbo has high agreement with humans while being cost efficient (sectionA.3).

Grading prompt (𝑮)𝑮(\boldsymbol{G})( bold_italic_G )The problem is: 𝑸𝑸\boldsymbol{Q}bold_italic_Q
The correct answer is: 𝑨𝑨\boldsymbol{A}bold_italic_A
A student submitted: 𝑨^bold-^𝑨\boldsymbol{\hat{A}}overbold_^ start_ARG bold_italic_A end_ARG

The student’s answer must be correct and specific but not overcomplete (for example, if they provide two different answers, they did not get the question right). However, small differences in formatting should not be penalized (for example, ‘New York City’ is equivalent to ‘NYC’). Did the student provide an equivalent answer to the ground truth? Please answer yes or no without any explanation: 𝑪𝑪\boldsymbol{C}bold_italic_C </s>

A.3 Comparison of Grading Techniques

We conducted an analysis of the methods outlined in sectionA.1 for open-ended evaluation. First, the base LLaMA-2 13b-chat model was prompted with questions from the following test subsets of MMLU: World Religions, Philosophy, Anatomy, High School Chemistry and Elementary School Math. The questions were stripped of their multiple-choice options before being supplied to the model.

A response was generated by the model via greedy decoding and this response was compared to the ground truth answer. The grading methods tested were Human, Substring Match, GPT 3.5 Turbo, and GPT 4.

The humans (a subset of our authors) were tasked to judge if the model response was essentially equivalent to the ground truth. For substring match, equivalence was determined by simply checking whether the ground truth answer existed as a substring within the model response. For GPT 3.5 Turbo and GPT 4, the models were supplied with the question, the ground truth, and the base model response, as well as a prompt indicating they should determine essential equivalence - see fig.7.

MMLU SubsetSubstring MatchGPT3.5GPT4World Religions21.6%6.4%1.8%Philosophy22.8%2.3%14.5%Anatomy13.3%14.8%1.5%Chemistry13.8%5.4%1.0%Math12.4%14.8%3.7%Average16.8%8.7%4.5%

We recorded the binary decision on correctness for each query and response by each of the grading methods above. Taking the human scores as the gold standard of correctness, we computed the model accuracy for each subset, and then derived the absolute error in estimate of model accuracy by each of the other grading methods. These are displayed in table2. We see that GPT4 is a better estimator of human-judged correctness than GPT 3.5 Turbo, which in turn is substantially better than substring match; although there is some variance on a per-subset basis. For expediency of processing time and cost, we chose to use GPT 3.5 Turbo in this paper.

A.4 Metrics

ECE

Given N𝑁Nitalic_N samples and B𝐵Bitalic_B equally-spaced bins bjsubscript𝑏𝑗b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, examples are assigned to bins based on the confidence of the model, and ECE is estimated asECE^=j=1B|bj|N|conf(bj)acc(bj)|^ECEsuperscriptsubscript𝑗1𝐵subscript𝑏𝑗𝑁confsubscript𝑏𝑗accsubscript𝑏𝑗\widehat{\text{ECE}}=\sum_{j=1}^{B}\frac{\lvert b_{j}\rvert}{N}\left\lvert%\mathrm{conf}(b_{j})-\mathrm{acc}(b_{j})\right\rvertover^ start_ARG ECE end_ARG = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT divide start_ARG | italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG start_ARG italic_N end_ARG | roman_conf ( italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - roman_acc ( italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) |where conf(bj)confsubscript𝑏𝑗\mathrm{conf}(b_{j})roman_conf ( italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is the average confidence of samples in bin bjsubscript𝑏𝑗b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, acc(bj)accsubscript𝑏𝑗\mathrm{acc}(b_{j})roman_acc ( italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is the accuracy within the bin, and |bj|subscript𝑏𝑗\lvert b_{j}\rvert| italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | is the number of samples assigned to bin j𝑗jitalic_j. In our experiments confconf\mathrm{conf}roman_conf is equivalent to P(correct)𝑃correctP(\text{correct})italic_P ( correct ).

A.5 MMLU Supercategory Classifier

To understand the impact of the subject matter of the training data on generalization, we follow the prescription of Hendrycks etal. [18] and categorize each of the 57 tasks into one of four supercategories - Humanities, STEM, Social Sciences, and Other. Since we do not have such a categorization for the training set, we must estimate their proportions.

First, we use the OpenAI embeddings (dimension 1536) of the MMLU samples with their ground truth supercategories to train a linear 4-way classifier with 10 samples from each of the 57 tasks. We use AdamW [34] with learning rate 1e-3 and weight decay 1e-2. This classifier is then used to estimate the categories of each sample in the training set used for fine-tuning. Subsequently, the breakdown of results in fig.4 (Left) follows.

Appendix B Baseline Methods

B.1 Sampling Methods

We use two baselines which obtain an estimate of certainty by sampling the same answers n=10𝑛10n=10italic_n = 10 times and then estimating the proportion of sampled answers that agree with the greedily decoded “main" answer. There are several critical downsides to these approaches: (i) the uncertainty here depends on the sampling parameters—for example, in the limit where the sampling converges to mere greedy decoding, the LLM will produce n𝑛nitalic_n identical samples, and therefore the certainty will always be 1—(ii) these approaches require O(n)𝑂𝑛O(n)italic_O ( italic_n ) answer generations to provide a certainty estimate for a single generation. The intense computational restriction prevents us from easily searching the space of sampling parameters for the optimal set, so we choose parameters arbitrarily; here we sample with top_p=0.95_𝑝0.95\_p=0.95_ italic_p = 0.95.

Counting

In this baseline, each sampled answer is compared to the greedy answer by prompting an expert LLM with both answers and asking it to judge their equivalence. The proportion of samples that are equivalent to the greedy answer is the certainty estimate. This baseline is similar to Label prob [51]; our method differs by not choosing the argmax semantic group as the final prediction, but instead using the greedy decode for the final prediction, so as to maintain the same accuracy performance as our uncertainty query method. This met

Likelihood accumulation

In this baseline, we add up likelihoods of sampled answers to estimate the mass associated with the predicted answer. We begin by prompting an expert LLM in order to find which sampled answers are equivalent to the greedy answer—like in the counting baseline. In this method, the certainty estimate is produced by adding the length-normalized likelihoods of those sampled answers equivalent to the greedy answer, and dividing this quantity by the sum of all sampled answers’ length-normalized likelihoods. This procedure of adding likelihoods of samples in order to estimate the likelihood of an equivalence class is similar to that used by [30], although they do not use it for certainty estimates but instead to produce entropy scores. In practice, the scores produced by these two methods are actually very similar—so we report only likelihood accumulation numbers in the main text.

B.2 Verbal Elicitation

Although Tian etal. [51] introduce several strategies for prompting, involving multiple guesses or multiple stages of interleaving prompting and generation, we did not find that any strategy consistently outperformed any other. This finding was consistent with the results of Xiong etal. [59]. Ultimately, for convenience, we adopted a two stage strategy with a single guess because it can be used in tandem with logged datasets of generated answers per model.

The exact prompt we used is essentially the same at in [51], but with small modifications that improved the rate of correctly formatted responses:

“Provide the probability that your answer is correct. Give ONLY the probability, no other words or explanation.

For example:

Probability: <the probability between 0.0 and 1.0 that your guess is correct, without any extra commentary whatsoever; just the probability!>

Include probability for the answer below:Probability:”

Verbal elicitation methods typically output complex strings containing both answers and associated probabilities. This means that if any element of parsing fails, it can be challenging to construct partial results. This effect tends to diminish when using large models, which are more responsive to zero-shot prompting.

Parsing Details

The original verbal elicitation prompts are given in the appendix of [51]. However, it is not clear how the original authors decide to parse answers from the generations and how failure to parse is handled. When we fail to parse the guess from the generation we return an empty string and associated probability 0.5. When we fail to parse a probability, we also return probability 0.5. For versions with multiple guesses, if any part of the parsing processes fails in an ambiguous way, we default back to an empty string for the answer and 0.5 for the probability. The only unambiguous cases are those which explicit succeed in the generating a valid guess and probability in the first case but not subsequent cases. In this scenario, we default to using the successfully parse first guess and associated probability.

Appendix C Fine-tuning Method

C.1 Regularization Term

To keep the calibration-tuned parameters θ𝜃\thetaitalic_θ within the neighborhood of the initial parameters, θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we use a regularization term that penalizes the divergence between the original sampling distribution and the calibration-tuned model on the target sequence A𝐴Aitalic_A, yielding regularization (θ;θ0)𝜃subscript𝜃0\mathcal{R}(\theta;\theta_{0})caligraphic_R ( italic_θ ; italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), which we use with weighting parameter κ𝜅\kappaitalic_κ.

Specifically, let pθ0subscript𝑝subscript𝜃0p_{\theta_{0}}italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT be the language modeling distribution of the language model we wish to calibration-tune, and qθsubscript𝑞𝜃q_{\theta}italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT be the corresponding language modeling distribution as a consequence of calibration-tuning.We then use the Jensen-Shannon Divergence JSD(pθ0qθ)JSDconditionalsubscript𝑝subscript𝜃0subscript𝑞𝜃{\mathrm{JSD}(p_{\theta_{0}}\parallel q_{\theta})}roman_JSD ( italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) [35] between the two language modeling distributions as the regularizer, where JSD(pq)1/2(KL(pm)+KL(qm))JSDconditional𝑝𝑞12KLconditional𝑝𝑚KLconditional𝑞𝑚{\mathrm{JSD}(p\parallel q)\triangleq\nicefrac{{1}}{{2}}(\mathrm{KL}(p%\parallel m)+\mathrm{KL}(q\parallel m))}roman_JSD ( italic_p ∥ italic_q ) ≜ / start_ARG 1 end_ARG start_ARG 2 end_ARG ( roman_KL ( italic_p ∥ italic_m ) + roman_KL ( italic_q ∥ italic_m ) ), where m1/2(p+q)𝑚12𝑝𝑞m\triangleq\nicefrac{{1}}{{2}}(p+q)italic_m ≜ / start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_p + italic_q ) is the mixture distribution.JSD regularization is applied only to the logits corresponding to the target sequence A𝐴Aitalic_A.

We note that using either direction of KL-divergence, i.e. the forward-KL KL(pθ0qθ)KLconditionalsubscript𝑝subscript𝜃0subscript𝑞𝜃\mathrm{KL}(p_{\theta_{0}}\parallel q_{{}_{\theta}})roman_KL ( italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_q start_POSTSUBSCRIPT start_FLOATSUBSCRIPT italic_θ end_FLOATSUBSCRIPT end_POSTSUBSCRIPT ) or reverse-KL KL(qθpθ0)KLconditionalsubscript𝑞𝜃subscript𝑝subscript𝜃0\mathrm{KL}(q_{{}_{\theta}}\parallel p_{\theta_{0}})roman_KL ( italic_q start_POSTSUBSCRIPT start_FLOATSUBSCRIPT italic_θ end_FLOATSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) was insufficient for optimal performance with calibration tuning.The forward KL-divergence encourages a zero-avoiding behavior such that the mass of qθsubscript𝑞𝜃q_{\theta}italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is spread across multiple modes of pθ0subscript𝑝subscript𝜃0p_{\theta_{0}}italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT to minimize the KL-divergence to avoid assigning no mass to regions of the probability space.To the contrary, the reverse KL-divergence encourages a zero-forcing behavior such the qθsubscript𝑞𝜃q_{\theta}italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT only needs to cover any one mode of pθ0subscript𝑝subscript𝜃0p_{\theta_{0}}italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [6].It is not necessarily obvious which one of these behaviors one should prefer for the specific case of large language models. Therefore, as a practical choice, we pick the one that provides us the most performant calibration-tuned model.

C.2 Training Data

We reserve the following datasets for training.

  • AI2 Reasoning Challenge (ARC) [12],

  • Boolean Questions (BoolQ) [11],

  • CommonsenseQA [47],

  • CosmosQA [21],

  • HellaSwag [61],

  • MathQA [2],

  • Recognizing Textual Entailment (RTE/SNLI) [8],

  • Adversarial NLI [38],

  • OpenBookQA [36],

  • PIQA [7],

  • SciQ [57],

  • The CommitmentBank (CB) [14],

  • Multi-Sentence Reading Comprehension (MultiRC) [27],

  • Choice of Plausible Alternatives (CoPA) [16],

  • TREC [31],

  • Adversarial Winograd (Winogrande) [45].

C.3 Training Hyperparameters

We use HuggingFace Transformers [58] and PyTorch [41] for the implementation of these models.For all our experiments, we use the AdamW optimizer [34] with a learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, a cosine decay schedule, and effective batch size M=32𝑀32M=32italic_M = 32.The training runs for G=10000𝐺10000G=10000italic_G = 10000 with an initial linear warmup schedule for 1000100010001000 steps.

Appendix D Extended MMLU Results

We report the breakdown of uncertainty query accuracy and ECE on all MMLU tasks in figs.8, 9, 10, 10 and11.

Large Language Models Must Be Taught to Know What They Don’t Know (15)
Large Language Models Must Be Taught to Know What They Don’t Know (16)
Large Language Models Must Be Taught to Know What They Don’t Know (17)
Large Language Models Must Be Taught to Know What They Don’t Know (18)

Appendix E Confidence as a Function of Target Length

As we noted when motivating calibration tuning, one limitation of sequence-level probabilities is their intrinsic connection to sequence length. The probability of a sequence decreases with increasing length, regardless of the correctness of the response. By contrast, we wouldn’t expect concept-level probabilities to have any discernible relationship with sequence length. In appendixE, we show there is no consistent relationship between the confidence estimated by the calibration-tuned model and target sequence length on MMLU tasks.

A key limitation of using token likelihoods is that they necessarily decay with the length of the generation.In figs.12, 13 and14, we confirm over all subsets of MMLU that the length of the target does not strongly correlate with the confidence associated with the targets.This behavior is an essential ingredient towards an effective confidence estimation in practice, such that longer sequences are not penalized in confidence despite being correct.

Large Language Models Must Be Taught to Know What They Don’t Know (19)Large Language Models Must Be Taught to Know What They Don’t Know (20)Large Language Models Must Be Taught to Know What They Don’t Know (21)Large Language Models Must Be Taught to Know What They Don’t Know (22)
Large Language Models Must Be Taught to Know What They Don’t Know (23)Large Language Models Must Be Taught to Know What They Don’t Know (24)Large Language Models Must Be Taught to Know What They Don’t Know (25)Large Language Models Must Be Taught to Know What They Don’t Know (26)
Large Language Models Must Be Taught to Know What They Don’t Know (27)Large Language Models Must Be Taught to Know What They Don’t Know (28)Large Language Models Must Be Taught to Know What They Don’t Know (29)Large Language Models Must Be Taught to Know What They Don’t Know (30)
Large Language Models Must Be Taught to Know What They Don’t Know (31)Large Language Models Must Be Taught to Know What They Don’t Know (32)Large Language Models Must Be Taught to Know What They Don’t Know (33)Large Language Models Must Be Taught to Know What They Don’t Know (34)
Large Language Models Must Be Taught to Know What They Don’t Know (35)Large Language Models Must Be Taught to Know What They Don’t Know (36)Large Language Models Must Be Taught to Know What They Don’t Know (37)Large Language Models Must Be Taught to Know What They Don’t Know (38)
Large Language Models Must Be Taught to Know What They Don’t Know (39)Large Language Models Must Be Taught to Know What They Don’t Know (40)Large Language Models Must Be Taught to Know What They Don’t Know (41)Large Language Models Must Be Taught to Know What They Don’t Know (42)
Large Language Models Must Be Taught to Know What They Don’t Know (43)Large Language Models Must Be Taught to Know What They Don’t Know (44)Large Language Models Must Be Taught to Know What They Don’t Know (45)Large Language Models Must Be Taught to Know What They Don’t Know (46)
Large Language Models Must Be Taught to Know What They Don’t Know (47)Large Language Models Must Be Taught to Know What They Don’t Know (48)Large Language Models Must Be Taught to Know What They Don’t Know (49)Large Language Models Must Be Taught to Know What They Don’t Know (50)
Large Language Models Must Be Taught to Know What They Don’t Know (51)Large Language Models Must Be Taught to Know What They Don’t Know (52)Large Language Models Must Be Taught to Know What They Don’t Know (53)Large Language Models Must Be Taught to Know What They Don’t Know (54)
Large Language Models Must Be Taught to Know What They Don’t Know (55)Large Language Models Must Be Taught to Know What They Don’t Know (56)Large Language Models Must Be Taught to Know What They Don’t Know (57)Large Language Models Must Be Taught to Know What They Don’t Know (58)
Large Language Models Must Be Taught to Know What They Don’t Know (59)Large Language Models Must Be Taught to Know What They Don’t Know (60)Large Language Models Must Be Taught to Know What They Don’t Know (61)Large Language Models Must Be Taught to Know What They Don’t Know (62)
Large Language Models Must Be Taught to Know What They Don’t Know (63)Large Language Models Must Be Taught to Know What They Don’t Know (64)Large Language Models Must Be Taught to Know What They Don’t Know (65)Large Language Models Must Be Taught to Know What They Don’t Know (66)
Large Language Models Must Be Taught to Know What They Don’t Know (67)Large Language Models Must Be Taught to Know What They Don’t Know (68)Large Language Models Must Be Taught to Know What They Don’t Know (69)Large Language Models Must Be Taught to Know What They Don’t Know (70)

Appendix F Generalization to Coding Tasks

Because there are no coding tasks in our training dataset, we can use a coding competition task introduced in LiveCodeBench[22] to assess how well finetuned uncertainty estimation methods perform on completely out of distribution tasks.

To conduct the analysis in table3, we evaluate several base models on the 62 LeetCode easy questions from the livecodebench_generation_lite task. We asking for the model to write a Python solution and grade the solution using test cases (marking it as correct iff it passes all test cases). We then apply Lora + Prompt and Zero-Shot Classifier uncertainty estimation methods—with these methods only using training/temperature scaling data from our main dataset mixture which notably does not include any coding tasks sectionC.2. Accuracy is shown to contextualize the model’s overall level of performance on the task. On Mistral-7B, the best performing model on the coding task, the supervised Lora + Prompt approach dramatically improves calibration and selective prediction as compared to Zero-Shot Classifier; on the worse-performing Mistral-7B-Instruct and LLaMa-2-7B, selective prediction improves but calibration slightly degrades.

ModelMethodAccECEAUROC
LLaMa-2-7BZero-Shot Classifier3.2%41.0%56.9%
Lora + Prompt3.2%46.4%80.0%
Mistral-7BZero-Shot Classifier27.4%70.2%66.2%
Lora + Prompt27.4%21.4%85.1%
Mistral-7B-InstructZero-Shot Classifier21.0%52.7%47.1%
Lora + Prompt21.0%56.1%70.2%

Appendix G User Studies

G.1 Additional Details on Setup

Stimuli and Participant Selection

We closely followed the setup of [5]. We used the same 180 MMLU questions from which were pre-batched into three sets of 60 MMLU questions. Within each variant, we randomly assigned participants to one of the three batches. In total, we recruit 181181181181 participants (20 per variant333With the exception of one extra participant due to random batching allocation effects.). All participants were recruited through the crowdsourcing platform Prolific [40]; we restrict our participant pool to those based in the United States who speak English as a first language.

Compensation

Participants were told that the study would take approximately 30 minutes and were paid at a base rate of $9/hr and informed that they would receive an optional bonus up to $10 for answering questions correctly. We applied the bonus to all participants.

LLM Answers and Uncertainty Elicitation

Bhatt etal. originally used GPT-3.5 as their LLM. While at first, we explored user performance when provided with confidence scores modulated over the original GPT-3.5 responses that the authors had collected, the authors had filtered LLM performance to ensure the LLM achieved high performance on biology, computer science, and foreign policy and poor performance on mathematics. As such, we noticed that participants overwhelmingly uptook the LLM’s answer (which was rational behaviour, given the model’s high performance). To explore a more nuanced performance profile, we regenerated LLM answers using Mistral 7B Instruct via greedy decoding. We then generated confidence scores on top of the LLM responses. For our random baseline, we sample a confidence score uniformly between 0 and 100% for each question.

G.2 Important considerations

There are many reasons to heed caution in interpreting our results as definitive indications of the utility of displaying confidence to users in LLM assistive settings. In particular: (i) users are presented with feedback after each trial as in [5] – as such, they can determine (potentially rapidly) whether or not a model is reliable, even without confidence scores. However, in practical settings users may not know whether or not the model was truly correct and therefore confidence scores could have an even larger impact; (ii) MMLU questions can be challenging for non-experts – we see the biggest differences in performance for the no-LLM vs. any-LLM-assistance condition. We may see a wider range of reliance behaviors in settings wherein people have more confidence in their own abilities; (iii) we present users with numeric confidence; however, humans are not always able to reliably process confidence estimates nor appropriately calibrate uncertainty estimates themselves[26, 55, 13, 32]. It may be that alternate modes of communicating confidence improve users’ ability to appropriately leverage the confidence scores in their decision making process. We see targeted exploration of each component through interdisciplinary collaboration across AI, behavioral science, and human-computer interaction as ripe for future work.

G.3 Extended Results

Task Accuracy and Reliance Sensibility

We depict average user task accuracy and reliance sensibility across variants in Figure 15. We follow Bhatt etal. in computing reliance sensibility as the proportion of times the user appropriately sided with the model prediction when the model was correct and did not respond with the model’s prediction when the model was incorrect.

Large Language Models Must Be Taught to Know What They Don’t Know (71)

Large Language Models Must Be Taught to Know What They Don’t Know (72)

We depict per-topic accuracy, with the LLM’s average performance in Figure 16.

Large Language Models Must Be Taught to Know What They Don’t Know (73)

Large Language Models Must Be Taught to Know What They Don’t Know (74)

Large Language Models Must Be Taught to Know What They Don’t Know (75)

Large Language Models Must Be Taught to Know What They Don’t Know (76)

GPT-3.5 Confidence Generalization

As noted, we ran variants using the same GPT-3.5 generations as [5]. We show aggregate and per-topic accuracy in fig.17, as well as reliance sensibility in fig.18.

Large Language Models Must Be Taught to Know What They Don’t Know (77)

Large Language Models Must Be Taught to Know What They Don’t Know (78)

Large Language Models Must Be Taught to Know What They Don’t Know (79)

Large Language Models Must Be Taught to Know What They Don’t Know (80)

Large Language Models Must Be Taught to Know What They Don’t Know (81)
Freeform User Responses

We permitted users to provide freeform responses at the end of the study. Some users were sensitive to confidence scores being reported and came up with their own heuristics for whether to rely on the model’s output. We include a sampling of comments across confidence variants:

  • “if it had a confidence of less than 50% it made me very skeptical.”

  • “The modelś confidence indeed helped me choose and select my answer as I trusted in them most of the time.”

  • “I didnt́ really rely on the confidence level. If I had 0 confidence in the answer myself I relied on the AI regardless.”

  • “if the models confidence fell below 45 I decided to investigate it myself by remembering pieces of information. and also reasoning the question. If it was above 45 I would automatically agree to its prediction but there were some few cases I challenged it even though it was above 45”

  • “At first I was hesistant to trust the model much because of the lower confidence levels but I still trusted it enough on topics I struggled with. As it went on, I was comfortable with confidence levels above 40.”

  • “If the modelś confidence was low and I thought I knew the answer (and it was different) I chose my answer”

G.4 Interface and Instructions

We show a sample interface of our extension of Modiste with user confidence in Figure 19, and present the the full set of instructions provided to users in Figures 20 and 21. Note, for the LLM-only and no-LLM conditions, we followed the instruction text from [5] directly, i.e., participants who saw only the LLM did not see the instruction page about model confidence, and participants in the “No-LLM” variant were not instructed about any model variant and were just instructed to answer the questions as best as they could by themselves. Participants also responded to a post survey questionarre after completing the user study, which we depict in Figure 22.

Large Language Models Must Be Taught to Know What They Don’t Know (82)

Large Language Models Must Be Taught to Know What They Don’t Know (83)

Large Language Models Must Be Taught to Know What They Don’t Know (84)

Large Language Models Must Be Taught to Know What They Don’t Know (85)

Large Language Models Must Be Taught to Know What They Don’t Know (86)

Large Language Models Must Be Taught to Know What They Don’t Know (87)

Large Language Models Must Be Taught to Know What They Don’t Know (88)

Large Language Models Must Be Taught to Know What They Don’t Know (89)

Large Language Models Must Be Taught to Know What They Don’t Know (90)

Large Language Models Must Be Taught to Know What They Don’t Know (91)

Large Language Models Must Be Taught to Know What They Don’t Know (92)

Appendix H Broader Impact and Implications

The goal of this work is to make LLM outputs have better confidence values associated with them.With successful, calibrated confidence values, the machine systems ultimately become more interpretable and trustworthy by a user[23].When applied correctly, our advancements will help users be able to make decisions based off of LLM outputs in a more informed way.Similar examples in other domains, like AlphaFold[49], have shown how well-calibrated confidence scores can be useful in complex decision-making domains.Our hope is to replicate those broad findings in LLMs.

We acknowledge the ongoing debate over the appropriateness, limitations, and harms of LLMs.We do highlight that the development of more confident, interpretable, and trustworthy LLMs can lead to continued techno-solutionism in unintended applications.Specifically, we highlight that our work is limited to use-cases with fact-based questions.Many applications of text-based LLMs are generative, meaning that there is no way for our paradigm to be applied appropriately, and the use of a confidences from calibration-tuned models could be misleading or damaging without checks and guardrails.Additionally, even within the fact-based paradigm, what is true can be subjective, with ground truth in machine learning being a contested topic[3, 54].

The philosophical debate on these topics is beyond the expertise of the authors; nonetheless, we believe that the ongoing debate over the appropriateness of LLMs should be considered in context with the benefits of our approach in making LLMs more interpretable and useful.

Large Language Models Must Be Taught to Know What They Don’t Know (2024)

References

Top Articles
Latest Posts
Article information

Author: Edmund Hettinger DC

Last Updated:

Views: 5726

Rating: 4.8 / 5 (58 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Edmund Hettinger DC

Birthday: 1994-08-17

Address: 2033 Gerhold Pine, Port Jocelyn, VA 12101-5654

Phone: +8524399971620

Job: Central Manufacturing Supervisor

Hobby: Jogging, Metalworking, Tai chi, Shopping, Puzzles, Rock climbing, Crocheting

Introduction: My name is Edmund Hettinger DC, I am a adventurous, colorful, gifted, determined, precious, open, colorful person who loves writing and wants to share my knowledge and understanding with you.