Large Language Models (LLMs) are used to create questions based on given facts or context, but understanding how good these questions are can be difficult. The challenge is that questions made by LLMs often differ from those made by humans in terms of length, type, or how well they fit the context and can be answered. Checking the quality of these questions is hard because most methods need a lot of work from people or only use simple numbers that don’t show the full picture. This makes it tough to judge the questions properly and creates problems in improving how LLMs make questions or avoiding mistakes when used incorrectly.
Current question generation (QG) methods use automated techniques to generate questions based on facts. While many approaches exist, they either rely on simple statistical measures or require extensive manual labeling effort, both of which are limited in evaluating the full quality of generated questions. Statistical methods do not capture deeper meanings and context, making human labeling time-consuming and inefficient. Although LLMs have improved significantly, there has been limited exploration of how these models generate questions and evaluate their quality, resulting in gaps in understanding and improvement.
To address issues in question generation (QG), researchers from the University of California Berkeley, KACST, and the University of Washington proposed an automated evaluation framework using Large Language Models (LLMs). This framework generates questions based on a given context and evaluates them on six dimensions: question type, length, context coverage, answerability, uncommonness, and required answer length. Unlike conventional methods based on positional biases or limited metrics, this method fully analyzes the quality and characteristics of questions generated by LLMs. It compares them with human-generated questions and shows how LLMs focus on different parts of the context evenly, producing descriptive and self-contained questions that include all relevant information.
Upon evaluation, researchers explored LLM-based Question Generation (QG) using 860,000 paragraphs from the WikiText dataset to generate self-contained questions without direct context references. They analyzed question type, length, and context coverage, finding an average question length of 15 words with 51.1% word-level and 66.7% sentence-level context coverage. Answerability was very high with context but low without context, showing that context is important. Researchers reduced the number of words for the answer from 36 to 26 without losing quality, reflecting improvements in automatic QG and evaluation techniques.
In summary, the proposed method analyzed the questions generated by LLM and highlighted their specific features and differences from human-generated ones. In addition, the researchers introduced an automated evaluation method to improve the understanding and optimization of QG tasks. This work can serve as a baseline for future research to enhance LLM-based QG, exploring application-specific tasks, domain-specific contexts, and improved alignment with human-generated content.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.
FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.
The post Can LLMs Design Good Questions Based on Context? This AI Paper Evaluates Questions Generated by LLMs from Context, Comparing Them to Human-Generated Questions appeared first on MarkTechPost.