Relevance

class hinteval.cores.evaluation_metrics.relevance.Rouge(model: Literal['rouge1', 'rouge2', 'rougeL'] = 'rouge1', checkpoint: bool = False, checkpoint_step: int = 1, enable_tqdm: bool = False)

Class for evaluating relevance between question and hints using ROUGE metrics [3] .

checkpoint

Whether checkpointing is enabled.

Type:

bool

checkpoint_step

Step interval for checkpointing.

Type:

int

enable_tqdm

Whether the tqdm progress bar is enabled.

Type:

bool

References

See also

NonContextualEmbeddings

Class for evaluating relevance between question and hints using non-contextual embeddings such as word embeddings (GloVe).

ContextualEmbeddings

Class for evaluating relevance between question and hints using contextual embeddings such as BERT and RoBERTa models.

LlmBased

Class for evaluating relevance between question and hints using large language models.

evaluate(instances: List[Instance], **kwargs) List[List[float]]

Evaluates the relevance of the question and hints of the given instances using the ROUGE metric [5].

Parameters:
  • instances (List[Instance]) – List of instances to evaluate.

  • **kwargs – Additional keyword arguments.

Returns:

List of relevance scores for each instance.

Return type:

List[List[float]]

Notes

This function stores the scores as Metric objects within the metrics attribute of the Hint of the instances, with names based on the model, such as “relevance-rouge1”.

Examples

>>> from hinteval.cores import Instance, Question, Hint, Answer
>>> from hinteval.evaluation.relevance import Rouge
>>>
>>> rouge = Rouge(model='rouge1')
>>> instance_1 = Instance(
...     question=Question('What is the capital of Austria?'),
...     answers=[Answer('Vienna')],
...     hints=[Hint('This city, once home to Mozart and Beethoven.'),
...            Hint('This city is the best city for life in 2024.')])
>>> instance_2 = Instance(
...     question=Question('Who was the president of USA in 2009?'),
...     answers=[Answer('Barack Obama')],
...     hints=[Hint('He was the first African-American president in U.S. history.'),
...            Hint('He was named the 2009 Nobel Peace Prize laureate.')])
>>> instances = [instance_1, instance_2]
>>> results = rouge.evaluate(instances)
>>> print(results)
# [[0.0, 0.25], [0.421, 0.353]]
>>> metrics = [f'{metric_key}: {metric_value.value}' for
...            instance in instances
...            for hint in instance.hints for metric_key, metric_value in
...            hint.metrics.items()]
>>> print(metrics)
# ['relevance-rouge1: 0.0', 'relevance-rouge1: 0.25', 'relevance-rouge1: 0.421', 'relevance-rouge1: 0.353']

References

See also

NonContextualEmbeddings

Class for evaluating relevance between question and hints using non-contextual embeddings such as word embeddings (GloVe).

ContextualEmbeddings

Class for evaluating relevance between question and hints using contextual embeddings such as BERT and RoBERTa models.

LlmBased

Class for evaluating relevance between question and hints using large language models.

release_memory()

Releases the memory used by the class instance.

This method deletes the instance of the class and triggers garbage collection to free up memory.

Examples

>>> from hinteval.evaluation.familiarity import Wikipedia
>>>
>>> wikipedia = Wikipedia(spacy_pipeline='en_core_web_sm')
>>> wikipedia.release_memory()
class hinteval.cores.evaluation_metrics.relevance.NonContextualEmbeddings(glove_version: Literal['glove.6B', 'glove.42B'] = 'glove.6B', spacy_pipeline: Literal['en_core_web_sm', 'en_core_web_lg', 'en_core_web_md', 'en_core_web_trf'] = 'en_core_web_sm', batch_size: int = 256, checkpoint: bool = False, checkpoint_step: int = 1, force_download=False, enable_tqdm=False)

Class for evaluating relevance between question and hints using non-contextual embeddings such as word embeddings (GloVe) [6].

batch_size

The batch size for processing.

Type:

int

checkpoint

Whether checkpointing is enabled.

Type:

bool

checkpoint_step

Step interval for checkpointing.

Type:

int

enable_tqdm

Whether the tqdm progress bar is enabled.

Type:

bool

References

See also

Rouge

Class for evaluating relevance between question and hints using ROUGE metrics.

ContextualEmbeddings

Class for evaluating relevance between question and hints using contextual embeddings such as BERT and RoBERTa models.

LlmBased

Class for evaluating relevance between question and hints using large language models.

evaluate(instances: List[Instance], **kwargs) List[List[float]]

Evaluates the relevance of the question and hints of the given instances using non-contextual embeddings such as Glove [8].

Parameters:
  • instances (List[Instance]) – List of instances to evaluate.

  • **kwargs – Additional keyword arguments.

Returns:

List of relevance scores for each instance.

Return type:

List[List[float]]

Notes

This function stores the scores as Metric objects within the metrics attribute of the Hint of the instances, with names based on the model, such as “relevance-non-contextual-6B-sm”.

Examples

>>> from hinteval.cores import Instance, Question, Hint, Answer
>>> from hinteval.evaluation.relevance import NonContextualEmbeddings
>>>
>>> non_contextual = NonContextualEmbeddings(glove_version='glove.6B')
>>> instance_1 = Instance(
...     question=Question('What is the capital of Austria?'),
...     answers=[Answer('Vienna')],
...     hints=[Hint('This city, once home to Mozart and Beethoven.'),
...            Hint('This city is the best city for life in 2024.')])
>>> instance_2 = Instance(
...     question=Question('Who was the president of USA in 2009?'),
...     answers=[Answer('Barack Obama')],
...     hints=[Hint('He was the first African-American president in U.S. history.'),
...            Hint('He was named the 2009 Nobel Peace Prize laureate.')])
>>> instances = [instance_1, instance_2]
>>> results = non_contextual.evaluate(instances)
>>> print(results)
# [[0.867, 0.889], [0.91, 0.891]]
>>> metrics = [f'{metric_key}: {metric_value.value}' for
...            instance in instances
...            for hint in instance.hints for metric_key, metric_value in
...            hint.metrics.items()]
>>> print(metrics)
# ['relevance-non-contextual-6B-sm: 0.867', 'relevance-non-contextual-6B-sm: 0.889', 'relevance-non-contextual-6B-sm: 0.91', 'relevance-non-contextual-6B-sm: 0.891']

References

See also

Rouge

Class for evaluating relevance using ROUGE metrics.

ContextualEmbeddings

Class for evaluating relevance using contextual embeddings such as BERT and RoBERTa models.

LlmBased

Class for evaluating relevance between question and hints using large language models.

release_memory()

Releases the memory used by the class instance.

This method deletes the instance of the class and triggers garbage collection to free up memory.

Examples

>>> from hinteval.evaluation.familiarity import Wikipedia
>>>
>>> wikipedia = Wikipedia(spacy_pipeline='en_core_web_sm')
>>> wikipedia.release_memory()
class hinteval.cores.evaluation_metrics.relevance.ContextualEmbeddings(model_name: Literal['bert-base', 'roberta-large'] = 'bert-base', batch_size: int = 256, checkpoint: bool = False, checkpoint_step: int = 1, force_download=False, enable_tqdm=False)

Class for evaluating relevance between question and hints using contextual embeddings such as BERT and RoBERTa models [9].

batch_size

The batch size for processing.

Type:

int

checkpoint

Whether checkpointing is enabled.

Type:

bool

checkpoint_step

Step interval for checkpointing.

Type:

int

enable_tqdm

Whether the tqdm progress bar is enabled.

Type:

bool

References

See also

Rouge

Class for evaluating relevance between question and hints using ROUGE metrics.

NonContextualEmbeddings

Class for evaluating relevance between question and hints using non-contextual embeddings such as word embeddings (GloVe)

LlmBased

Class for evaluating relevance between question and hints using large language models.

evaluate(instances: List[Instance], **kwargs) List[List[float]]

Evaluates the relevance of the question and hints of the given instances using large language models [11].

Parameters:
  • instances (List[Instance]) – List of instances to evaluate.

  • **kwargs – Additional keyword arguments.

Returns:

List of relevance scores for each instance.

Return type:

List[List[float]]

Notes

This function stores the scores as Metric objects within the metrics attribute of the Hint of the instances, with names based on the model, such as “relevance-contextual-bert-base”.

Examples

>>> from hinteval.cores import Instance, Question, Hint, Answer
>>> from hinteval.evaluation.relevance import ContextualEmbeddings
>>>
>>> contextual = ContextualEmbeddings(model_name='bert-base')
>>> instance_1 = Instance(
...     question=Question('What is the capital of Austria?'),
...     answers=[Answer('Vienna')],
...     hints=[Hint('This city, once home to Mozart and Beethoven.'),
...            Hint('This city is the best city for life in 2024.')])
>>> instance_2 = Instance(
...     question=Question('Who was the president of USA in 2009?'),
...     answers=[Answer('Barack Obama')],
...     hints=[Hint('He was the first African-American president in U.S. history.'),
...            Hint('He was named the 2009 Nobel Peace Prize laureate.')])
>>> instances = [instance_1, instance_2]
>>> results = contextual.evaluate(instances)
>>> print(results)
# [[1.0, 1.0], [1.0, 1.0]]
>>> metrics = [f'{metric_key}: {metric_value.value}' for
...            instance in instances
...            for hint in instance.hints for metric_key, metric_value in
...            hint.metrics.items()]
>>> print(metrics)
# ['relevance-contextual-bert-base: 1.0', 'relevance-contextual-bert-base: 1.0', 'relevance-contextual-bert-base: 1.0', 'relevance-contextual-bert-base: 1.0']

References

See also

Rouge

Class for evaluating relevance between question and hints using ROUGE metrics.

NonContextualEmbeddings

Class for evaluating relevance between question and hints using non-contextual embeddings such as word embeddings (GloVe)

LlmBased

Class for evaluating relevance between question and hints using large language models.

release_memory()

Releases the memory used by the class instance.

This method deletes the instance of the class and triggers garbage collection to free up memory.

Examples

>>> from hinteval.evaluation.familiarity import Wikipedia
>>>
>>> wikipedia = Wikipedia(spacy_pipeline='en_core_web_sm')
>>> wikipedia.release_memory()
class hinteval.cores.evaluation_metrics.relevance.LlmBased(model_name: str, api_key: str = None, base_url: str = 'https://api.together.xyz/v1', checkpoint: bool = False, checkpoint_step: int = 1, enable_tqdm=False)

Class for evaluating relevance between question and hints using large language models [12].

checkpoint

Whether checkpointing is enabled.

Type:

bool

checkpoint_step

Step interval for checkpointing.

Type:

int

enable_tqdm

Whether the tqdm progress bar is enabled.

Type:

bool

References

See also

Rouge

Class for evaluating relevance between question and hints using ROUGE metrics.

NonContextualEmbeddings

Class for evaluating relevance between question and hints using non-contextual embeddings such as word embeddings (GloVe)

ContextualEmbeddings

Class for evaluating relevance between question and hints using contextual embeddings such as BERT and RoBERTa models.

evaluate(instances: List[Instance], **kwargs) List[List[float]]

Evaluates the relevance of the question and hints of the given instances using large language models [14].

Parameters:
  • instances (List[Instance]) – List of instances to evaluate.

  • **kwargs – Additional keyword arguments.

Returns:

List of relevance scores for each instance.

Return type:

List[List[float]]

Notes

This function stores the scores as Metric objects within the metrics attribute of the Hint of the instances, with names based on the model, such as “relevance-llm-meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo”.

Examples

>>> from hinteval.cores import Instance, Question, Hint, Answer
>>> from hinteval.evaluation.relevance import LlmBased
>>>
>>> llm = LlmBased(model_name='meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo', api_key='your_api_key', enable_tqdm=True)
>>> instance_1 = Instance(
...    question=Question('What is the capital of Austria?'),
...    answers=[Answer('Vienna')],
...    hints=[Hint('This city, once home to Mozart and Beethoven.'),
...           Hint('This city is the best city for life in 2024.')])
>>> instance_2 = Instance(
...    question=Question('Who was the president of USA in 2009?'),
...    answers=[Answer('Barack Obama')],
...    hints=[Hint('He was the first African-American president in U.S. history.'),
...           Hint('He was named the 2009 Nobel Peace Prize laureate.')])
>>> instances = [instance_1, instance_2]
>>> results = llm.evaluate(instances)
>>> print(results)
# [[1.00, 0.81], [1.00, 0.95]]
>>> metrics = [f'{metric_key}: {metric_value.value}' for
...           instance in instances
...           for hint in instance.hints for metric_key, metric_value in
...           hint.metrics.items()]
>>> print(metrics)
# ['relevance-llm-meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo: 1.00', 'relevance-llm-meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo: 0.81',
#  'relevance-llm-meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo: 1.00', 'relevance-llm-meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo: 0.95']

References

See also

Rouge

Class for evaluating relevance between question and hints using ROUGE metrics.

NonContextualEmbeddings

Class for evaluating relevance between question and hints using non-contextual embeddings such as word embeddings (GloVe)

ContextualEmbeddings

Class for evaluating relevance between question and hints using contextual embeddings such as BERT and RoBERTa models.

release_memory()

Releases the memory used by the class instance.

This method deletes the instance of the class and triggers garbage collection to free up memory.

Examples

>>> from hinteval.evaluation.familiarity import Wikipedia
>>>
>>> wikipedia = Wikipedia(spacy_pipeline='en_core_web_sm')
>>> wikipedia.release_memory()