Readability

The Readability metric measures how easy or difficult it is to understand a hint or question. This metric is important to ensure that hints guide the user effectively and are understandable without causing confusion.

The HintEval framework provides four different methods for computing the Readability metric.

Note

The evaluate function takes a list of Hint or Question objects as its input, where each object contains the text that needs to be evaluated for readability.

Traditional

The Traditional method evaluates readability using classic readability formulas that have been widely adopted in educational and linguistic fields. These formulas compute readability based on factors like sentence length, word length, and complexity. For more information, refer to the 📝original paper.

Traditional methods available:

  • Gunning Fog Index (G-Fox): Measures the number of years of formal education a person needs to understand the text.

  • Flesch Reading Ease: Rates text on a 100-point scale, with higher scores indicating easier readability.

  • Coleman-Liau Index: Analyzes sentence length and character count to determine the readability score.

  • SMOG Index: Estimates the years of education a person needs to comprehend the text based on complex words.

  • Automated Readability Index (ARI): Focuses on the readability of technical documents and uses sentence and word length to determine the score.

Example

from hinteval.cores import Question, Hint
from hinteval.evaluation.readability import TraditionalIndexes

traditional_indexes = TraditionalIndexes(method='flesch_kincaid_reading_ease')
sentence_1 = Question('What is the capital of Austria?')
sentence_2 = Hint('This city, once home to Mozart and Beethoven, is the capital of Austria.')
sentences = [sentence_1, sentence_2]
results = traditional_indexes.evaluate(sentences)
print(results)
# [87.945, 69.994]
metrics = [f'{metric_key}: {metric_value.value}' for sent in sentences for metric_key, metric_value in
           sent.metrics.items()]
print(metrics)
# ['readability-flesch_kincaid_reading_ease-sm: 87.945', 'readability-flesch_kincaid_reading_ease-sm: 69.994']

Machine-Learning

The Machine-Learning method evaluates readability using classic trained machine learninig models to predict readability scores based on text features. These models are trained using labeled datasets that map text to a specific readability score. For more information, refer to the 📝original paper.

Machine-Learning methods available:

  • XGBoost: A gradient boosting algorithm known for its speed and accuracy in machine learning tasks.

  • Random-Forest: A popular ensemble learning method that builds multiple decision trees to improve prediction accuracy.

Example

from hinteval.cores import Question, Hint
from hinteval.evaluation.readability import MachineLearningBased

machine_learning = MachineLearningBased(method='xgboost')
sentence_1 = Question('What is the capital of Austria?')
sentence_2 = Hint('This city, once home to Mozart and Beethoven, is the capital of Austria.')
sentences = [sentence_1, sentence_2]
results = machine_learning.evaluate(sentences)
print(results)
# [0, 0]
classes = [sent.metrics['readability-ml-xgboost-sm'].metadata['description'] for sent in sentences]
print(classes)
# ['beginner', 'beginner']
metrics = [f'{metric_key}: {metric_value.value}' for sent in sentences for metric_key, metric_value in
           sent.metrics.items()]
print(metrics)
# ['readability-ml-xgboost-sm: 0', 'readability-ml-xgboost-sm: 0']

Neural-Network

The Neural-Network method uses pre-trained transformer models to evaluate readability. These models are highly effective in understanding the deeper context and structure of the text, which allows for more nuanced readability evaluations. For more information, refer to the 📝original paper.

Neural-Network methods available:

  • BERT-base: A popular model that can capture contextual relationships between words and is trained on a large corpus of English text.

  • RoBERTa-large: A more advanced variant of BERT that is trained on more data, providing better accuracy for readability predictions.

Example

from hinteval.cores import Question, Hint
from hinteval.evaluation.readability import NeuralNetworkBased

neural_network = NeuralNetworkBased(model_name='bert-base')
sentence_1 = Question('What is the capital of Austria?')
sentence_2 = Hint('This city, once home to Mozart and Beethoven, is the capital of Austria.')
sentences = [sentence_1, sentence_2]
results = neural_network.evaluate(sentences)
print(results)
# [0, 0]
classes = [sent.metrics['readability-nn-bert-base'].metadata['description'] for sent in sentences]
print(classes)
# ['beginner', 'beginner']
metrics = [f'{metric_key}: {metric_value.value}' for sent in sentences for metric_key, metric_value in
           sent.metrics.items()]
print(metrics)
# ['readability-nn-bert-base: 0', 'readability-nn-bert-base: 0']

LLM

The LLM method leverages large language models to evaluate the readability of a hint or question. By using models like GPT-4 or Meta-LLaMA, this approach provides highly accurate and context-aware readability scores, allowing for a deeper understanding of how easily the text can be comprehended. For more information, refer to the 📝original paper.

Note

We assume you have an active API key for the TogetherAI platform and are using this platform for readability evaluation using LLM. In this example, we use meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo as the model, which is available on the TogetherAI platform. If you wish to use another platform, ensure the model name is valid for that platform.

For local execution, you can set api_key to None. HintEval supports running large language models (LLMs) locally via HuggingFace.

Warning

The output may vary from the example shown below due to the inherent non-deterministic nature of large language models.

Example

from hinteval.cores import Question, Hint
from hinteval.evaluation.readability import LlmBased

llm = LlmBased(model_name='meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo',
               api_key='your_api_key', batch_size=2)
sentence_1 = Question('What is the capital of Austria?')
sentence_2 = Hint('This city, once home to Mozart and Beethoven, is the capital of Austria.')
sentences = [sentence_1, sentence_2]
results = llm.evaluate(sentences)
print(results)
# [0, 0]
classes = [sent.metrics['readability-llm-meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo'].metadata['description'] for
           sent in sentences]
print(classes)
# ['beginner', 'beginner']
metrics = [f'{metric_key}: {metric_value.value}' for sent in sentences for metric_key, metric_value in
           sent.metrics.items()]
print(metrics)
# ['readability-llm-meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo: 0', 
# 'readability-llm-meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo: 0']

Comparison

For each method, we provide details on:

Method

Preferred Device

Cost-Effectiveness

Accuracy

Execution Speed

Traditional

CPU

High

Low

Very Fast

Machine-Learning

CPU

High

Moderate

Moderate

Neural-Network

GPU

Moderate

High

Moderate

LLM

GPU

Low

Very High

Slow

  • Preferred Device: Indicates whether the method works best on CPU or GPU.

  • Cost-Effectiveness: Evaluates how computationally expensive the method is, considering the resources needed.

  • Accuracy: Reflects how accurate the method is in assessing the metric.

  • Execution Speed: How quickly the method executes (e.g., Fast, Moderate, Slow).