Evaluates the relevance of the question and hints of the given instances using the ROUGE metric [5].
Parameters:
instances (List[Instance]) – List of instances to evaluate.
**kwargs – Additional keyword arguments.
Returns:
List of relevance scores for each instance.
Return type:
List[List[float]]
Notes
This function stores the scores as Metric objects within the metrics attribute of the Hint of the instances, with names based on the model, such as “relevance-rouge1”.
Examples
>>> fromhinteval.coresimportInstance,Question,Hint,Answer>>> fromhinteval.evaluation.relevanceimportRouge>>>>>> rouge=Rouge(model='rouge1')>>> instance_1=Instance(... question=Question('What is the capital of Austria?'),... answers=[Answer('Vienna')],... hints=[Hint('This city, once home to Mozart and Beethoven.'),... Hint('This city is the best city for life in 2024.')])>>> instance_2=Instance(... question=Question('Who was the president of USA in 2009?'),... answers=[Answer('Barack Obama')],... hints=[Hint('He was the first African-American president in U.S. history.'),... Hint('He was named the 2009 Nobel Peace Prize laureate.')])>>> instances=[instance_1,instance_2]>>> results=rouge.evaluate(instances)>>> print(results)# [[0.0, 0.25], [0.421, 0.353]]>>> metrics=[f'{metric_key}: {metric_value.value}'for... instanceininstances... forhintininstance.hintsformetric_key,metric_valuein... hint.metrics.items()]>>> print(metrics)# ['relevance-rouge1: 0.0', 'relevance-rouge1: 0.25', 'relevance-rouge1: 0.421', 'relevance-rouge1: 0.353']
Evaluates the relevance of the question and hints of the given instances using non-contextual embeddings such as Glove [8].
Parameters:
instances (List[Instance]) – List of instances to evaluate.
**kwargs – Additional keyword arguments.
Returns:
List of relevance scores for each instance.
Return type:
List[List[float]]
Notes
This function stores the scores as Metric objects within the metrics attribute of the Hint of the instances, with names based on the model, such as “relevance-non-contextual-6B-sm”.
Examples
>>> fromhinteval.coresimportInstance,Question,Hint,Answer>>> fromhinteval.evaluation.relevanceimportNonContextualEmbeddings>>>>>> non_contextual=NonContextualEmbeddings(glove_version='glove.6B')>>> instance_1=Instance(... question=Question('What is the capital of Austria?'),... answers=[Answer('Vienna')],... hints=[Hint('This city, once home to Mozart and Beethoven.'),... Hint('This city is the best city for life in 2024.')])>>> instance_2=Instance(... question=Question('Who was the president of USA in 2009?'),... answers=[Answer('Barack Obama')],... hints=[Hint('He was the first African-American president in U.S. history.'),... Hint('He was named the 2009 Nobel Peace Prize laureate.')])>>> instances=[instance_1,instance_2]>>> results=non_contextual.evaluate(instances)>>> print(results)# [[0.867, 0.889], [0.91, 0.891]]>>> metrics=[f'{metric_key}: {metric_value.value}'for... instanceininstances... forhintininstance.hintsformetric_key,metric_valuein... hint.metrics.items()]>>> print(metrics)# ['relevance-non-contextual-6B-sm: 0.867', 'relevance-non-contextual-6B-sm: 0.889', 'relevance-non-contextual-6B-sm: 0.91', 'relevance-non-contextual-6B-sm: 0.891']
Evaluates the relevance of the question and hints of the given instances using large language models [11].
Parameters:
instances (List[Instance]) – List of instances to evaluate.
**kwargs – Additional keyword arguments.
Returns:
List of relevance scores for each instance.
Return type:
List[List[float]]
Notes
This function stores the scores as Metric objects within the metrics attribute of the Hint of the instances, with names based on the model, such as “relevance-contextual-bert-base”.
Examples
>>> fromhinteval.coresimportInstance,Question,Hint,Answer>>> fromhinteval.evaluation.relevanceimportContextualEmbeddings>>>>>> contextual=ContextualEmbeddings(model_name='bert-base')>>> instance_1=Instance(... question=Question('What is the capital of Austria?'),... answers=[Answer('Vienna')],... hints=[Hint('This city, once home to Mozart and Beethoven.'),... Hint('This city is the best city for life in 2024.')])>>> instance_2=Instance(... question=Question('Who was the president of USA in 2009?'),... answers=[Answer('Barack Obama')],... hints=[Hint('He was the first African-American president in U.S. history.'),... Hint('He was named the 2009 Nobel Peace Prize laureate.')])>>> instances=[instance_1,instance_2]>>> results=contextual.evaluate(instances)>>> print(results)# [[1.0, 1.0], [1.0, 1.0]]>>> metrics=[f'{metric_key}: {metric_value.value}'for... instanceininstances... forhintininstance.hintsformetric_key,metric_valuein... hint.metrics.items()]>>> print(metrics)# ['relevance-contextual-bert-base: 1.0', 'relevance-contextual-bert-base: 1.0', 'relevance-contextual-bert-base: 1.0', 'relevance-contextual-bert-base: 1.0']
Evaluates the relevance of the question and hints of the given instances using large language models [14].
Parameters:
instances (List[Instance]) – List of instances to evaluate.
**kwargs – Additional keyword arguments.
Returns:
List of relevance scores for each instance.
Return type:
List[List[float]]
Notes
This function stores the scores as Metric objects within the metrics attribute of the Hint of the instances, with names based on the model, such as “relevance-llm-meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo”.
Examples
>>> fromhinteval.coresimportInstance,Question,Hint,Answer>>> fromhinteval.evaluation.relevanceimportLlmBased>>>>>> llm=LlmBased(model_name='meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo',api_key='your_api_key',enable_tqdm=True)>>> instance_1=Instance(... question=Question('What is the capital of Austria?'),... answers=[Answer('Vienna')],... hints=[Hint('This city, once home to Mozart and Beethoven.'),... Hint('This city is the best city for life in 2024.')])>>> instance_2=Instance(... question=Question('Who was the president of USA in 2009?'),... answers=[Answer('Barack Obama')],... hints=[Hint('He was the first African-American president in U.S. history.'),... Hint('He was named the 2009 Nobel Peace Prize laureate.')])>>> instances=[instance_1,instance_2]>>> results=llm.evaluate(instances)>>> print(results)# [[1.00, 0.81], [1.00, 0.95]]>>> metrics=[f'{metric_key}: {metric_value.value}'for... instanceininstances... forhintininstance.hintsformetric_key,metric_valuein... hint.metrics.items()]>>> print(metrics)# ['relevance-llm-meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo: 1.00', 'relevance-llm-meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo: 0.81',# 'relevance-llm-meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo: 1.00', 'relevance-llm-meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo: 0.95']