AnswerLeakage¶
The AnswerLeakage metric measures how much a hint reveals the correct answer. It ensures that hints guide the user without directly revealing the solution, maintaining the educational value of the hint. This metric is essential for evaluating whether hints are subtle enough to encourage problem-solving rather than giving away the answer.
The HintEval framework provides two different methods for computing the AnswerLeakage metric:
Note
The evaluate function takes a list of Instance objects as its input, where each instance contains a question, its associated hints, and the correct answers.
Lexical¶
The Lexical method assesses the similarity between the hint and the answer at the word level, without considering deeper contextual meaning. This method checks for explicit word overlap, making it useful for detecting hints that directly repeat or use the same words as the answer. For more information, refer to the 📝original paper.
Lexical methods are available in two main variants:
With Stop-Words: Considers common stop-words when evaluating overlap, making the method more permissive.
Without Stop-Words: Ignores stop-words, making the method more precise in identifying relevant word overlap.
Example¶
from hinteval.cores import Instance, Question, Hint, Answer
from hinteval.evaluation.answer_leakage import Lexical
lexical = Lexical(method='include_stop_words')
instance_1 = Instance(
question=Question('What is the capital of Austria?'),
answers=[Answer('Vienna')],
hints=[Hint('This city, once home to Mozart and Beethoven.'),
Hint('This city is called as Vienna.')])
instance_2 = Instance(
question=Question('Who was the president of USA in 2009?'),
answers=[Answer('Barack Obama')],
hints=[Hint('His lastname is Obama.'),
Hint('He was named the 2009 Nobel Peace Prize laureate.')])
instances = [instance_1, instance_2]
results = lexical.evaluate(instances)
print(results)
# [[0, 1], [1, 0]]
metrics = [f'{metric_key}: {metric_value.value}' for
instance in instances
for hint in instance.hints for metric_key, metric_value in
hint.metrics.items()]
print(metrics)
# ['answer-leakage-lexical-include_stop_words-sm: 0', 'answer-leakage-lexical-include_stop_words-sm: 1',
# 'answer-leakage-lexical-include_stop_words-sm: 1', 'answer-leakage-lexical-include_stop_words-sm: 0']
Contextual¶
The Contextual method uses embeddings to evaluate whether the hint is semantically similar to the answer. By leveraging advanced embeddings, this method captures nuanced similarities, even when different words are used to convey the same idea. It detects subtle answer leakage that might not be evident through word overlap.
Note
This method supports SentenceBert models to compute contextualized word embeddings.
Example¶
from hinteval.cores import Instance, Question, Hint, Answer
from hinteval.evaluation.answer_leakage import ContextualEmbeddings
contextual = ContextualEmbeddings(sbert_model='paraphrase-multilingual-mpnet-base-v2')
instance_1 = Instance(
question=Question('What is the capital of Austria?'),
answers=[Answer('Vienna')],
hints=[Hint('This city, once home to Mozart and Beethoven.'),
Hint('This city is called as Vienna.')])
instance_2 = Instance(
question=Question('Who was the president of USA in 2009?'),
answers=[Answer('Barack Obama')],
hints=[Hint('His lastname is Obama.'),
Hint('He was named the 2009 Nobel Peace Prize laureate.')])
instances = [instance_1, instance_2]
results = contextual.evaluate(instances)
print(results)
# [[0.495, 1.0], [0.967, 0.332]]
metrics = [f'{metric_key}: {metric_value.value}' for
instance in instances
for hint in instance.hints for metric_key, metric_value in
hint.metrics.items()]
print(metrics)
# ['answer-leakage-lexical-include_stop_words-sm: 0.495', 'answer-leakage-lexical-include_stop_words-sm: 1.0',
# 'answer-leakage-lexical-include_stop_words-sm: 0.967', 'answer-leakage-lexical-include_stop_words-sm: 0.332']
Comparison¶
For each method, we provide details on:
Method |
Preferred Device |
Cost-Effectiveness |
Accuracy |
Execution Speed |
|---|---|---|---|---|
Lexical |
CPU |
Very High |
Low |
Very Fast |
Contextual |
GPU |
Moderate |
High |
Moderate |
Preferred Device: Indicates whether the method works best on CPU or GPU.
Cost-Effectiveness: Evaluates how computationally expensive the method is, considering the resources needed.
Accuracy: Reflects how accurate the method is in assessing the metric.
Execution Speed: How quickly the method executes (e.g., Fast, Moderate, Slow).