Relevance¶

class hinteval.cores.evaluation_metrics.relevance.Rouge(model: Literal['rouge1', 'rouge2', 'rougeL'] = 'rouge1', checkpoint: bool = False, checkpoint_step: int = 1, enable_tqdm: bool = False)¶

Class for evaluating relevance between question and hints using ROUGE metrics [3] .

checkpoint¶

Whether checkpointing is enabled.

Type:: bool

checkpoint_step¶

Step interval for checkpointing.

Type:: int

enable_tqdm¶

Whether the tqdm progress bar is enabled.

Type:: bool

References

See also

NonContextualEmbeddings: Class for evaluating relevance between question and hints using non-contextual embeddings such as word embeddings (GloVe).
ContextualEmbeddings: Class for evaluating relevance between question and hints using contextual embeddings such as BERT and RoBERTa models.
LlmBased: Class for evaluating relevance between question and hints using large language models.

evaluate(instances: List[Instance], **kwargs) → List[List[float]]¶

Evaluates the relevance of the question and hints of the given instances using the ROUGE metric [5].

Parameters:

instances (List[Instance]) – List of instances to evaluate.
**kwargs – Additional keyword arguments.

Returns:

List of relevance scores for each instance.

Return type:

List[List[float]]

Notes

This function stores the scores as Metric objects within the metrics attribute of the Hint of the instances, with names based on the model, such as “relevance-rouge1”.

Examples

>>> from hinteval.cores import Instance, Question, Hint, Answer
>>> from hinteval.evaluation.relevance import Rouge
>>>
>>> rouge = Rouge(model='rouge1')
>>> instance_1 = Instance(
...     question=Question('What is the capital of Austria?'),
...     answers=[Answer('Vienna')],
...     hints=[Hint('This city, once home to Mozart and Beethoven.'),
...            Hint('This city is the best city for life in 2024.')])
>>> instance_2 = Instance(
...     question=Question('Who was the president of USA in 2009?'),
...     answers=[Answer('Barack Obama')],
...     hints=[Hint('He was the first African-American president in U.S. history.'),
...            Hint('He was named the 2009 Nobel Peace Prize laureate.')])
>>> instances = [instance_1, instance_2]
>>> results = rouge.evaluate(instances)
>>> print(results)
# [[0.0, 0.25], [0.421, 0.353]]
>>> metrics = [f'{metric_key}: {metric_value.value}' for
...            instance in instances
...            for hint in instance.hints for metric_key, metric_value in
...            hint.metrics.items()]
>>> print(metrics)
# ['relevance-rouge1: 0.0', 'relevance-rouge1: 0.25', 'relevance-rouge1: 0.421', 'relevance-rouge1: 0.353']

References

See also

NonContextualEmbeddings: Class for evaluating relevance between question and hints using non-contextual embeddings such as word embeddings (GloVe).
ContextualEmbeddings: Class for evaluating relevance between question and hints using contextual embeddings such as BERT and RoBERTa models.
LlmBased: Class for evaluating relevance between question and hints using large language models.

release_memory()¶

Releases the memory used by the class instance.

This method deletes the instance of the class and triggers garbage collection to free up memory.

Examples

>>> from hinteval.evaluation.familiarity import Wikipedia
>>>
>>> wikipedia = Wikipedia(spacy_pipeline='en_core_web_sm')
>>> wikipedia.release_memory()

class hinteval.cores.evaluation_metrics.relevance.NonContextualEmbeddings(glove_version: Literal['glove.6B', 'glove.42B'] = 'glove.6B', spacy_pipeline: Literal['en_core_web_sm', 'en_core_web_lg', 'en_core_web_md', 'en_core_web_trf'] = 'en_core_web_sm', batch_size: int = 256, checkpoint: bool = False, checkpoint_step: int = 1, force_download=False, enable_tqdm=False)¶

Class for evaluating relevance between question and hints using non-contextual embeddings such as word embeddings (GloVe) [6].

batch_size¶

The batch size for processing.

Type:: int

checkpoint¶

Whether checkpointing is enabled.

Type:: bool

checkpoint_step¶

Step interval for checkpointing.

Type:: int

enable_tqdm¶

Whether the tqdm progress bar is enabled.

Type:: bool

References

See also

Rouge: Class for evaluating relevance between question and hints using ROUGE metrics.
ContextualEmbeddings: Class for evaluating relevance between question and hints using contextual embeddings such as BERT and RoBERTa models.
LlmBased: Class for evaluating relevance between question and hints using large language models.

evaluate(instances: List[Instance], **kwargs) → List[List[float]]¶

Evaluates the relevance of the question and hints of the given instances using non-contextual embeddings such as Glove [8].

Parameters:

instances (List[Instance]) – List of instances to evaluate.
**kwargs – Additional keyword arguments.

Returns:

List of relevance scores for each instance.

Return type:

List[List[float]]

Notes

This function stores the scores as Metric objects within the metrics attribute of the Hint of the instances, with names based on the model, such as “relevance-non-contextual-6B-sm”.

Examples

>>> from hinteval.cores import Instance, Question, Hint, Answer
>>> from hinteval.evaluation.relevance import NonContextualEmbeddings
>>>
>>> non_contextual = NonContextualEmbeddings(glove_version='glove.6B')
>>> instance_1 = Instance(
...     question=Question('What is the capital of Austria?'),
...     answers=[Answer('Vienna')],
...     hints=[Hint('This city, once home to Mozart and Beethoven.'),
...            Hint('This city is the best city for life in 2024.')])
>>> instance_2 = Instance(
...     question=Question('Who was the president of USA in 2009?'),
...     answers=[Answer('Barack Obama')],
...     hints=[Hint('He was the first African-American president in U.S. history.'),
...            Hint('He was named the 2009 Nobel Peace Prize laureate.')])
>>> instances = [instance_1, instance_2]
>>> results = non_contextual.evaluate(instances)
>>> print(results)
# [[0.867, 0.889], [0.91, 0.891]]
>>> metrics = [f'{metric_key}: {metric_value.value}' for
...            instance in instances
...            for hint in instance.hints for metric_key, metric_value in
...            hint.metrics.items()]
>>> print(metrics)
# ['relevance-non-contextual-6B-sm: 0.867', 'relevance-non-contextual-6B-sm: 0.889', 'relevance-non-contextual-6B-sm: 0.91', 'relevance-non-contextual-6B-sm: 0.891']

References

See also

Rouge: Class for evaluating relevance using ROUGE metrics.
ContextualEmbeddings: Class for evaluating relevance using contextual embeddings such as BERT and RoBERTa models.
LlmBased: Class for evaluating relevance between question and hints using large language models.

release_memory()¶

Releases the memory used by the class instance.

This method deletes the instance of the class and triggers garbage collection to free up memory.

Examples

>>> from hinteval.evaluation.familiarity import Wikipedia
>>>
>>> wikipedia = Wikipedia(spacy_pipeline='en_core_web_sm')
>>> wikipedia.release_memory()

class hinteval.cores.evaluation_metrics.relevance.ContextualEmbeddings(model_name: Literal['bert-base', 'roberta-large'] = 'bert-base', batch_size: int = 256, checkpoint: bool = False, checkpoint_step: int = 1, force_download=False, enable_tqdm=False)¶

Class for evaluating relevance between question and hints using contextual embeddings such as BERT and RoBERTa models [9].

batch_size¶

The batch size for processing.

Type:: int

checkpoint¶

Whether checkpointing is enabled.

Type:: bool

checkpoint_step¶

Step interval for checkpointing.

Type:: int

enable_tqdm¶

Whether the tqdm progress bar is enabled.

Type:: bool

References

See also

Rouge: Class for evaluating relevance between question and hints using ROUGE metrics.
NonContextualEmbeddings: Class for evaluating relevance between question and hints using non-contextual embeddings such as word embeddings (GloVe)
LlmBased: Class for evaluating relevance between question and hints using large language models.

evaluate(instances: List[Instance], **kwargs) → List[List[float]]¶

Evaluates the relevance of the question and hints of the given instances using large language models [11].

Parameters:

instances (List[Instance]) – List of instances to evaluate.
**kwargs – Additional keyword arguments.

Returns:

List of relevance scores for each instance.

Return type:

List[List[float]]

Notes

This function stores the scores as Metric objects within the metrics attribute of the Hint of the instances, with names based on the model, such as “relevance-contextual-bert-base”.

Examples

>>> from hinteval.cores import Instance, Question, Hint, Answer
>>> from hinteval.evaluation.relevance import ContextualEmbeddings
>>>
>>> contextual = ContextualEmbeddings(model_name='bert-base')
>>> instance_1 = Instance(
...     question=Question('What is the capital of Austria?'),
...     answers=[Answer('Vienna')],
...     hints=[Hint('This city, once home to Mozart and Beethoven.'),
...            Hint('This city is the best city for life in 2024.')])
>>> instance_2 = Instance(
...     question=Question('Who was the president of USA in 2009?'),
...     answers=[Answer('Barack Obama')],
...     hints=[Hint('He was the first African-American president in U.S. history.'),
...            Hint('He was named the 2009 Nobel Peace Prize laureate.')])
>>> instances = [instance_1, instance_2]
>>> results = contextual.evaluate(instances)
>>> print(results)
# [[1.0, 1.0], [1.0, 1.0]]
>>> metrics = [f'{metric_key}: {metric_value.value}' for
...            instance in instances
...            for hint in instance.hints for metric_key, metric_value in
...            hint.metrics.items()]
>>> print(metrics)
# ['relevance-contextual-bert-base: 1.0', 'relevance-contextual-bert-base: 1.0', 'relevance-contextual-bert-base: 1.0', 'relevance-contextual-bert-base: 1.0']

References

See also

Rouge: Class for evaluating relevance between question and hints using ROUGE metrics.
NonContextualEmbeddings: Class for evaluating relevance between question and hints using non-contextual embeddings such as word embeddings (GloVe)
LlmBased: Class for evaluating relevance between question and hints using large language models.

release_memory()¶

Releases the memory used by the class instance.

This method deletes the instance of the class and triggers garbage collection to free up memory.

Examples

>>> from hinteval.evaluation.familiarity import Wikipedia
>>>
>>> wikipedia = Wikipedia(spacy_pipeline='en_core_web_sm')
>>> wikipedia.release_memory()

class hinteval.cores.evaluation_metrics.relevance.LlmBased(model_name: str, api_key: str = None, base_url: str = 'https://api.together.xyz/v1', checkpoint: bool = False, checkpoint_step: int = 1, enable_tqdm=False)¶

Class for evaluating relevance between question and hints using large language models [12].

checkpoint¶

Whether checkpointing is enabled.

Type:: bool

checkpoint_step¶

Step interval for checkpointing.

Type:: int

enable_tqdm¶

Whether the tqdm progress bar is enabled.

Type:: bool

References

See also

Rouge: Class for evaluating relevance between question and hints using ROUGE metrics.
NonContextualEmbeddings: Class for evaluating relevance between question and hints using non-contextual embeddings such as word embeddings (GloVe)
ContextualEmbeddings: Class for evaluating relevance between question and hints using contextual embeddings such as BERT and RoBERTa models.

evaluate(instances: List[Instance], **kwargs) → List[List[float]]¶

Evaluates the relevance of the question and hints of the given instances using large language models [14].

Parameters:

instances (List[Instance]) – List of instances to evaluate.
**kwargs – Additional keyword arguments.

Returns:

List of relevance scores for each instance.

Return type:

List[List[float]]

Notes

This function stores the scores as Metric objects within the metrics attribute of the Hint of the instances, with names based on the model, such as “relevance-llm-meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo”.

Examples

>>> from hinteval.cores import Instance, Question, Hint, Answer
>>> from hinteval.evaluation.relevance import LlmBased
>>>
>>> llm = LlmBased(model_name='meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo', api_key='your_api_key', enable_tqdm=True)
>>> instance_1 = Instance(
...    question=Question('What is the capital of Austria?'),
...    answers=[Answer('Vienna')],
...    hints=[Hint('This city, once home to Mozart and Beethoven.'),
...           Hint('This city is the best city for life in 2024.')])
>>> instance_2 = Instance(
...    question=Question('Who was the president of USA in 2009?'),
...    answers=[Answer('Barack Obama')],
...    hints=[Hint('He was the first African-American president in U.S. history.'),
...           Hint('He was named the 2009 Nobel Peace Prize laureate.')])
>>> instances = [instance_1, instance_2]
>>> results = llm.evaluate(instances)
>>> print(results)
# [[1.00, 0.81], [1.00, 0.95]]
>>> metrics = [f'{metric_key}: {metric_value.value}' for
...           instance in instances
...           for hint in instance.hints for metric_key, metric_value in
...           hint.metrics.items()]
>>> print(metrics)
# ['relevance-llm-meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo: 1.00', 'relevance-llm-meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo: 0.81',
#  'relevance-llm-meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo: 1.00', 'relevance-llm-meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo: 0.95']

References

See also

Rouge: Class for evaluating relevance between question and hints using ROUGE metrics.
NonContextualEmbeddings: Class for evaluating relevance between question and hints using non-contextual embeddings such as word embeddings (GloVe)
ContextualEmbeddings: Class for evaluating relevance between question and hints using contextual embeddings such as BERT and RoBERTa models.

release_memory()¶

Releases the memory used by the class instance.

This method deletes the instance of the class and triggers garbage collection to free up memory.

Examples

>>> from hinteval.evaluation.familiarity import Wikipedia
>>>
>>> wikipedia = Wikipedia(spacy_pipeline='en_core_web_sm')
>>> wikipedia.release_memory()