{"id":16849,"date":"2024-08-29T07:54:19","date_gmt":"2024-08-29T07:54:19","guid":{"rendered":"https:\/\/blog.datumo.com\/en\/?p=16849"},"modified":"2024-10-22T09:11:15","modified_gmt":"2024-10-22T09:11:15","slug":"what-is-llm-evaluation","status":"publish","type":"post","link":"https:\/\/blog.datumo.com\/en\/tech\/16849","title":{"rendered":"\ud83e\udd9c What is LLM Evaluation?"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"16849\" class=\"elementor elementor-16849\">\n\t\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-285dc7ff elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"285dc7ff\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-561963bb\" data-id=\"561963bb\" data-element_type=\"column\" data-settings=\"{&quot;background_background&quot;:&quot;classic&quot;}\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-58e2a43f elementor-widget elementor-widget-text-editor\" data-id=\"58e2a43f\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<style>\/*! elementor - v3.23.0 - 05-08-2024 *\/\n.elementor-widget-text-editor.elementor-drop-cap-view-stacked .elementor-drop-cap{background-color:#69727d;color:#fff}.elementor-widget-text-editor.elementor-drop-cap-view-framed .elementor-drop-cap{color:#69727d;border:3px solid;background-color:transparent}.elementor-widget-text-editor:not(.elementor-drop-cap-view-default) .elementor-drop-cap{margin-top:8px}.elementor-widget-text-editor:not(.elementor-drop-cap-view-default) .elementor-drop-cap-letter{width:1em;height:1em}.elementor-widget-text-editor .elementor-drop-cap{float:left;text-align:center;line-height:1;font-size:50px}.elementor-widget-text-editor .elementor-drop-cap-letter{display:inline-block}<\/style>\t\t\t\t<meta http-equiv=\"refresh\" content=\"0; url=https:\/\/datumo.com\/en\/what-is-llm-evaluation\/\">\n\n<span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">Since the emergence of ChatGPT (November 30, 2022) [1], a wave of Large Language Models (LLMs) has been released as if in competition. Just in 2024 alone, we&#8217;ve seen the launch of Gemini 1.5 pro (February 15) [2], LLaMA 3 (April 18) [3], GPT-4o (May 13) [4], Claude 3.5 Sonnet (June 21) [5], and LLaMA 3.1 (June 23) [6]. Each new model claims to outperform its predecessors. But how do we actually measure what&#8217;s better or worse?<\/span>\n\n<span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">To compare the performance of two models, you could place them side by side or give them a &#8216;test&#8217; to compare scores\u2014this is what we call LLM Evaluation. Why is LLM evaluation important? Simply put, it helps us verify whether the LLMs are functioning as we want them to. We set specific goals for the LLM and then assess how well it achieves those goals. For example, how accurately does it translate, answer questions, solve scientific problems, or respond fairly without social biases?<\/span>\n\n<span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">LLM evaluation is more complex than it might seem. LLMs have diverse capabilities, making it difficult to evaluate them with just one metric. Beyond functionality, we must also assess the quality of the generated text. Recently, alignment with &#8220;human preferences or values&#8221; has become an important evaluation criterion. It&#8217;s particularly crucial to ensure that the LLM does not leak personal information, generate harmful content, reinforce biases, or spread misinformation.<\/span><!-- notionvc: f78a96a3-b5d4-4ef1-93be-4ca8a72200ae -->\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-405338eb elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"405338eb\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-53ac82b8\" data-id=\"53ac82b8\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6de72c10 elementor-widget elementor-widget-pix-heading\" data-id=\"6de72c10\" data-element_type=\"widget\" data-widget_type=\"pix-heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div  class=\"pix-heading-el text-center \"><div><div class=\"slide-in-container\"><h3 class=\"font-weight-bold heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"\">LLM Evaluation Process<\/h3><\/div><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-52dd7b23 elementor-widget elementor-widget-text-editor\" data-id=\"52dd7b23\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">LLM evaluation process can be broadly divided into three parts [7]: what to evaluate, with what data to evaluate, and how to evaluate. Let&#8217;s dive into each.<\/span><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><!-- notionvc: 3e1acd93-369d-48f9-9f59-bf68c37158ef --><\/span><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4fa4efa elementor-widget elementor-widget-pix-heading\" data-id=\"4fa4efa\" data-element_type=\"widget\" data-widget_type=\"pix-heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div  class=\"pix-heading-el text-center \"><div><div class=\"slide-in-container\"><h5 class=\"font-weight-bold heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"\">1. What to Evaluate<\/h5><\/div><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3750e89 elementor-widget elementor-widget-text-editor\" data-id=\"3750e89\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">What to evaluate depends on the specific capabilities of the LLM and which aspects of the LLM&#8217;s performance are of interest. There are various metrics for evaluating the LLM&#8217;s core ability to generate text:<\/span><\/p><ul><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><strong>Accuracy<\/strong>: Does the generated content align with facts?<\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><strong>Fluency<\/strong>: Is the sentence natural and easy to read?<\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><strong>Coherence<\/strong>: Is the content logically consistent?<\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><strong>Relevance<\/strong>: How well does the text align with the topic?<\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><strong>Naturalness<\/strong>: Does the content appear to be written by a human?<\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><strong>Factual Consistency<\/strong>: Is the information accurate?<\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><strong>Bias<\/strong>: Does the generated content contain social biases?<\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><strong>Harmfulness<\/strong>: Is there harmful content in the generated output?<\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><strong>Robustness<\/strong>: Does it produce correct outputs even with incorrect input?<\/span><\/li><\/ul><p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">In addition to generation capabilities, traditional NLP tasks like sentiment analysis, text classification, natural language inference, translation, summarization, dialogue, question answering, and named entity recognition are also important evaluation factors. Besides traditional NLP task evaluation, expertise in STEM (Science, Technology, Engineering, Mathematics) fields or understanding of specific domains (e.g., finance, law, medicine) may also be assessed.<\/span><\/p><p><span style=\"font-family: helvetica, arial, sans-serif;\"><!-- notionvc: 1cf2752e-f939-4432-92ec-4a5fec2eb646 --><\/span><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b33a469 elementor-widget elementor-widget-pix-heading\" data-id=\"b33a469\" data-element_type=\"widget\" data-widget_type=\"pix-heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div  class=\"pix-heading-el text-center \"><div><div class=\"slide-in-container\"><h5 class=\"font-weight-bold heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"\">2. What Data to Use for Evaluation<\/h5><\/div><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-63d4da1 elementor-widget elementor-widget-text-editor\" data-id=\"63d4da1\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">&#8216;What data to use for evaluation&#8217; is about selecting the test for evaluation. You can think of the data as the test questions. For fairness, all models should take the same test, so we use the same dataset. These datasets are called benchmark datasets, and most models are evaluated using these benchmark datasets. When a new model claims, &#8220;We&#8217;re the best,&#8221; it often means, &#8220;We scored the highest on a benchmark dataset.&#8221; Let&#8217;s explore some representative benchmark datasets. <\/span><!-- notionvc: efb919be-d6f8-4c60-858d-b9d8ac31f1c5 --><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-893de78 elementor-widget elementor-widget-pix-heading\" data-id=\"893de78\" data-element_type=\"widget\" data-widget_type=\"pix-heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div  class=\"pix-heading-el text-center \"><div><div class=\"slide-in-container\"><h6 class=\"font-weight-bold heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"\">MMLU (Massive Multitask Language Understanding)<\/h6><\/div><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-301bd0d elementor-widget elementor-widget-text-editor\" data-id=\"301bd0d\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">Created by UC Berkeley, this benchmark, as the name suggests, evaluates comprehensive language understanding [8]. It includes 57 different topics, ranging from STEM, humanities, and social sciences to more specialized fields like law and ethics. With around 16,000 questions, the difficulty ranges from elementary to expert level, making it a tool to assess a model&#8217;s diverse language understanding. The questions are mostly multiple-choice. Want a sneak peek at some real data?<!-- notionvc: 463e23b9-6444-452f-88e5-30ac8764e25a --><\/span><br \/><!-- notionvc: efb919be-d6f8-4c60-858d-b9d8ac31f1c5 --><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7bec4718 elementor-widget elementor-widget-image\" data-id=\"7bec4718\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<style>\/*! elementor - v3.23.0 - 05-08-2024 *\/\n.elementor-widget-image{text-align:center}.elementor-widget-image a{display:inline-block}.elementor-widget-image a img[src$=\".svg\"]{width:48px}.elementor-widget-image img{vertical-align:middle;display:inline-block}<\/style>\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"640\" height=\"256\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/STEM-tasks-1024x410.jpg\" class=\"attachment-large size-large wp-image-16857\" alt=\"STEM task example data\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/STEM-tasks-1024x410.jpg 1024w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/STEM-tasks-300x120.jpg 300w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/STEM-tasks-768x308.jpg 768w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/STEM-tasks-1536x615.jpg 1536w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/STEM-tasks.jpg 1920w\" sizes=\"(max-width: 640px) 100vw, 640px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1b385ac6 elementor-widget elementor-widget-pix-text\" data-id=\"1b385ac6\" data-element_type=\"widget\" data-widget_type=\"pix-text.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div class=\"pix-el-text w-100 text-center \" ><p class=\"text-xs  text-gray-6 text-center font-weight-bold font-italic\" >(Image Source) [8]<\/p><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-592a3a8d elementor-widget elementor-widget-text-editor\" data-id=\"592a3a8d\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p style=\"text-align: left;\"><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">The baseline for random choice (guessing) is 25%, non-expert humans scored 34.5% accuracy, and GPT-3 achieved 43.9% accuracy back in 2020. As of August 2024, the top scorer on the MMLU benchmark is Gemini 1.0 Ultra [9], with a 90.0% accuracy, slightly surpassing human expert accuracy of 89.8%, highlighting the rapid advancements in LLMs over just four years.<\/span><!-- notionvc: fcef8f34-de66-4c5b-815a-75eb25253fa0 --><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f0831fb elementor-widget elementor-widget-pix-heading\" data-id=\"f0831fb\" data-element_type=\"widget\" data-widget_type=\"pix-heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div  class=\"pix-heading-el text-center \"><div><div class=\"slide-in-container\"><h6 class=\"font-weight-bold heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"\">GPQA (A Graduate-Level Google-Proof Q&A Benchmark)<\/h6><\/div><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-35b54fd elementor-widget elementor-widget-text-editor\" data-id=\"35b54fd\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">Google&#8217;s GPQA, as the name suggests, is a benchmark dataset composed of graduate-level questions [10]. It consists of 448 multiple-choice questions crafted by experts in physics, chemistry, and biology. The problem-solving accuracy for PhD holders or doctoral students in these fields is 65%, while highly trained non-experts score only 34%. GPQA is a very challenging benchmark. As of November 2023, GPT-4 scored 39% accuracy, highlighting its difficulty. Here&#8217;s a sample of the actual data. The question alone is tough enough to understand.<\/span><!-- notionvc: d07aeecb-4339-472a-a94b-202eecc0821d --><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-901f7bb elementor-widget elementor-widget-image\" data-id=\"901f7bb\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"640\" height=\"159\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/quantum-mechanics-1024x254.png\" class=\"attachment-large size-large wp-image-16858\" alt=\"\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/quantum-mechanics-1024x254.png 1024w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/quantum-mechanics-300x75.png 300w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/quantum-mechanics-768x191.png 768w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/quantum-mechanics-1536x382.png 1536w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/quantum-mechanics.png 1920w\" sizes=\"(max-width: 640px) 100vw, 640px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d39d00d elementor-widget elementor-widget-pix-text\" data-id=\"d39d00d\" data-element_type=\"widget\" data-widget_type=\"pix-text.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div class=\"pix-el-text w-100 text-center \" ><p class=\"text-xs  text-gray-6 text-center font-weight-bold font-italic\" >(Image Source) [10]<\/p><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0cff1ee elementor-widget elementor-widget-text-editor\" data-id=\"0cff1ee\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">As of August 2024, the top scorer on the GPQA benchmark is Claude 3.5 Sonnet [5], with an accuracy of 59.4%. While this score still lags behind the recent MMLU score of 90.0%, the rapid progress in AI suggests that GPQA scores will also quickly improve.<!-- notionvc: 0108e818-c030-4179-a673-d98f93ca8f95 --><\/span><br \/><!-- notionvc: d07aeecb-4339-472a-a94b-202eecc0821d --><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-92e50a7 elementor-widget elementor-widget-pix-heading\" data-id=\"92e50a7\" data-element_type=\"widget\" data-widget_type=\"pix-heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div  class=\"pix-heading-el text-center \"><div><div class=\"slide-in-container\"><h6 class=\"font-weight-bold heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"\">HumanEval<\/h6><\/div><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d37c73c elementor-widget elementor-widget-text-editor\" data-id=\"d37c73c\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">Developed by OpenAI, this dataset evaluates how well a model can generate Python functions from Python docstrings [11]. It consists of 164 programming tasks, with an average of 7.7 tests per problem. These tasks evaluate understanding of programming languages, algorithms, and basic mathematics, resembling typical software interview questions.<\/span><\/p><p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">How do you evaluate the quality of function generation? HumanEval assesses accuracy by checking whether the generated code passes predefined unit tests. If the code generated by the model passes these unit tests, it&#8217;s considered correct. This approach focuses on the practical ability of the model to generate accurate and functional code, making it a more valuable evaluation method than merely comparing the code text. Here&#8217;s a sample of the actual data.<\/span><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><!-- notionvc: 5085fd9c-381a-450b-ad68-0ac60a761c55 --><\/span><!-- notionvc: d07aeecb-4339-472a-a94b-202eecc0821d --><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b231e87 elementor-widget elementor-widget-image\" data-id=\"b231e87\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"640\" height=\"340\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/code-image-1024x544.jpg\" class=\"attachment-large size-large wp-image-16862\" alt=\"\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/code-image-1024x544.jpg 1024w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/code-image-300x159.jpg 300w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/code-image-768x408.jpg 768w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/code-image-1536x816.jpg 1536w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/code-image.jpg 1920w\" sizes=\"(max-width: 640px) 100vw, 640px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-234e488 elementor-widget elementor-widget-pix-text\" data-id=\"234e488\" data-element_type=\"widget\" data-widget_type=\"pix-text.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div class=\"pix-el-text w-100 text-center \" ><p class=\"text-xs  text-gray-6 text-center font-weight-bold font-italic\" >(Image Source) [11]<\/p><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d054e02 elementor-widget elementor-widget-text-editor\" data-id=\"d054e02\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">The code on the white background is the input prompt, and the code on the yellow background is the output generated by the model.<\/span><\/p><p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">OpenAI created the Codex-12B model to tackle this challenge, and it successfully solved 28.8% of the tasks. In 2021, the GPT-3 model had a success rate close to 0%. Currently, the top scorer on the HumanEval benchmark is the LDB model, with an accuracy of 98.2% [12].<\/span><\/p><p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><!-- notionvc: 2f618566-1eb9-4565-ba99-e345995b0793 --><\/span><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ec1bb3b elementor-widget elementor-widget-pix-heading\" data-id=\"ec1bb3b\" data-element_type=\"widget\" data-widget_type=\"pix-heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div  class=\"pix-heading-el text-center \"><div><div class=\"slide-in-container\"><h6 class=\"font-weight-bold heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"\">Benchmark Datasets<\/h6><\/div><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-84bde4c elementor-widget elementor-widget-text-editor\" data-id=\"84bde4c\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">There are many different benchmark datasets for evaluating LLMs. Let&#8217;s take a look at some:<\/span><\/p><ol><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><strong>ARC Challenge<\/strong>: A benchmark that evaluates a model&#8217;s reasoning ability through elementary school-level (U.S. grades 3-9) science exams.<\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><strong>HellaSwag<\/strong>: A benchmark that assesses common sense reasoning by predicting various scenarios that could occur in everyday situations.<\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><strong>TrustfulQA<\/strong>: A benchmark designed to evaluate the trustworthiness of a model\u2019s answers.<\/span><\/li><\/ol><p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">All the datasets introduced so far are based on English. So, how can we evaluate a model&#8217;s Korean language capabilities?<\/span><\/p><p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">Of course, there are Korean benchmark datasets available. One representative dataset is the Open Ko-LLM Leaderboard2 [13], created and managed jointly by SelectStar and Upstage. Open Ko-LLM Leaderboard (version 1) [14] is a dataset that translates existing English benchmark datasets such as ARC, HellaSwag, MMLU, and TrustfulQA into Korean. Leaderboard2 includes translated datasets like GPQA, GSM8K, and a newly added KorNAT [15] dataset, created by SelectStar, which evaluates how well the social values and basic knowledge of Koreans are reflected. This will be very useful when evaluating the Korean models you&#8217;ve created!<\/span><\/p><p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><!-- notionvc: 7174f20c-36cd-4a97-8c92-6d2d40fcae81 --><\/span><\/p><p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><!-- notionvc: 2f618566-1eb9-4565-ba99-e345995b0793 --><\/span><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4384277 elementor-widget elementor-widget-pix-heading\" data-id=\"4384277\" data-element_type=\"widget\" data-widget_type=\"pix-heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div  class=\"pix-heading-el text-center \"><div><div class=\"slide-in-container\"><h5 class=\"font-weight-bold heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"\">3. How to Evaluate<\/h5><\/div><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d2a9d2d elementor-widget elementor-widget-text-editor\" data-id=\"d2a9d2d\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">Finally, let&#8217;s talk about how to evaluate. Evaluation methods are broadly divided into automatic evaluation and human evaluation.<\/span><!-- notionvc: 9eda776d-f16c-4f87-a1b7-7bc8ac691979 --><\/p><p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><!-- notionvc: 2f618566-1eb9-4565-ba99-e345995b0793 --><\/span><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a95385f elementor-widget elementor-widget-pix-heading\" data-id=\"a95385f\" data-element_type=\"widget\" data-widget_type=\"pix-heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div  class=\"pix-heading-el text-center \"><div><div class=\"slide-in-container\"><h6 class=\"font-weight-bold heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"\">Automatic Evaluation<\/h6><\/div><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-253aca7 elementor-widget elementor-widget-text-editor\" data-id=\"253aca7\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">Automatic evaluation can be further divided into model-based evaluation and evaluation using LLMs. The most representative example of model-based evaluation is the accuracy-based assessment, where you determine how many multiple-choice questions the model gets right, as seen in the benchmark datasets above. It\u2019s simple and straightforward. Another model-based evaluation method exists where each question has a correct (or reference) text, and the model evaluates the similarity to that reference. This approach was commonly used for traditional NLP tasks like translation or summarization, utilizing evaluation metrics such as perplexity, BLEU, ROUGE, and METEOR. These metrics compare the generated text with the reference text at the token level to quantify how predictable the text is and how well it reflects the reference content.<\/span><\/p><p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">However, these evaluation metrics have limitations, such as not considering the context of the generated text (since they only compare tokens) or not aligning with actual human judgments. Therefore, alternative evaluation methods have recently been explored, namely using LLMs for evaluation. In this approach<\/span><\/p><p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">, another LLM acts as a &#8216;judge&#8217; to evaluate the text generated by the target LLM. The &#8216;judge LLM&#8217; is given a prompt with the evaluation criteria and the generated text, and it assigns scores directly. Various methods, such as LLM-derived metrics, Prompting LLMs, Fine-tuning LLMs, and Human-LLM Collaborative Evaluation, are being researched [16]. Each method has its pros and cons, so the choice depends on the evaluation purpose and context.<\/span><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><!-- notionvc: 59acff02-fdc0-417a-a4fe-906277d74edb --><\/span><!-- notionvc: 9eda776d-f16c-4f87-a1b7-7bc8ac691979 --><\/p><p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><!-- notionvc: 2f618566-1eb9-4565-ba99-e345995b0793 --><\/span><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bee656e elementor-widget elementor-widget-pix-heading\" data-id=\"bee656e\" data-element_type=\"widget\" data-widget_type=\"pix-heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div  class=\"pix-heading-el text-center \"><div><div class=\"slide-in-container\"><h6 class=\"font-weight-bold heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"\">Human Evaluation<\/h6><\/div><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6e0fe25 elementor-widget elementor-widget-text-editor\" data-id=\"6e0fe25\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">Human evaluation, as the name suggests, involves humans, rather than models, directly evaluating the output. This approach can capture subjective content or nuances that automatic evaluation might miss and also consider ethical issues. It allows for more detailed assessment of the text&#8217;s naturalness, accuracy, and coherence. Human evaluation can be further divided into expert evaluation and crowdsourcing evaluation, depending on the evaluators. Expert evaluation involves reviewers with domain-specific knowledge assessing the model&#8217;s output. For example, experts in finance, law, or medicine can evaluate the model&#8217;s answers in their respective fields. This allows for much more accurate evaluation than general user assessments. On the other hand, crowdsourcing involves general users evaluating the model&#8217;s output based on fluency, accuracy, and appropriateness.<\/span><\/p><p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">A downside to human evaluation is its lack of scalability. It requires significant time and money for humans to review and evaluate each question and answer, making large-scale evaluations challenging. Domain experts are already few in number and more costly than general evaluators, making large-scale expert evaluation even more difficult. Another drawback is inconsistency. Different evaluators may have different standards and interpretations, leading to inconsistent evaluation results. Cultural and personal differences can create high variability between evaluators, reducing the stability of the assessment.<\/span><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><!-- notionvc: 30d6e5be-4f84-4f47-a39b-003fa69ab6e8 --><\/span><\/p><p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><!-- notionvc: 2f618566-1eb9-4565-ba99-e345995b0793 --><\/span><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-3d0161f4 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3d0161f4\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-52984f2b\" data-id=\"52984f2b\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6e121ef0 elementor-widget elementor-widget-text-editor\" data-id=\"6e121ef0\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><br><\/span><\/p>\n<p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">As models continue to surpass human capabilities in more areas, evaluation standards and methodologies must evolve as well. It\u2019s essential to figure out how to incorporate complex factors like ethics, multilingual abilities, and real-world applicability into evaluations.<\/span><\/p>\n<p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">When we consider not only superior performance but also the impact on real-life applications, we can truly say we&#8217;re creating AI that benefits humanity.<\/span><!-- notionvc: e8934407-124a-4952-a105-e79f87f038d0 --><\/p><p><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><br><\/span><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b787fe4 elementor-widget elementor-widget-pix-heading\" data-id=\"b787fe4\" data-element_type=\"widget\" data-widget_type=\"pix-heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div  class=\"pix-heading-el text-center \"><div><div class=\"slide-in-container\"><div class=\"font-weight-bold heading-text el-title_custom_color mb-12\" style=\"font-size:;\" data-anim-type=\"\" data-anim-delay=\"\">References<\/div><\/div><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-183adc9f elementor-widget elementor-widget-text-editor\" data-id=\"183adc9f\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<ol><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><a href=\"https:\/\/openai.com\/index\/chatgpt\/\">https:\/\/openai.com\/index\/chatgpt\/<\/a><\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">Gemini 1.5 pro,\u00a0<a href=\"https:\/\/blog.google\/technology\/ai\/google-gemini-next-generation-model-february-2024\/\">https:\/\/blog.google\/technology\/ai\/google-gemini-next-generation-model-february-2024\/<\/a><\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">LLaMA 3,\u00a0<a href=\"https:\/\/ai.meta.com\/blog\/meta-llama-3\/\">https:\/\/ai.meta.com\/blog\/meta-llama-3\/<\/a><\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">GPT-4o,\u00a0<a href=\"https:\/\/openai.com\/index\/hello-gpt-4o\/\">https:\/\/openai.com\/index\/hello-gpt-4o\/<\/a><\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">Claude 3.5 Sonnet,\u00a0<a href=\"https:\/\/www.anthropic.com\/news\/claude-3-5-sonnet\">https:\/\/www.anthropic.com\/news\/claude-3-5-sonnet<\/a><\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\">LLaMA 3.1,\u00a0<a href=\"https:\/\/ai.meta.com\/blog\/meta-llama-3-1\/\">https:\/\/ai.meta.com\/blog\/meta-llama-3-1\/<\/a><\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><a href=\"https:\/\/arxiv.org\/abs\/2307.03109\">https:\/\/arxiv.org\/abs\/2307.03109<\/a><\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><a href=\"https:\/\/arxiv.org\/abs\/2009.03300\">https:\/\/arxiv.org\/abs\/2009.03300<\/a><\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><a href=\"https:\/\/deepmind.google\/technologies\/gemini\/ultra\/\">https:\/\/deepmind.google\/technologies\/gemini\/ultra\/<\/a><\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><a href=\"https:\/\/arxiv.org\/abs\/2311.12022\">https:\/\/arxiv.org\/abs\/2311.12022<\/a><\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><a href=\"https:\/\/arxiv.org\/abs\/2107.03374\">https:\/\/arxiv.org\/abs\/2107.03374<\/a><\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><a href=\"https:\/\/arxiv.org\/abs\/2402.16906\">https:\/\/arxiv.org\/abs\/2402.16906<\/a><\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><a href=\"https:\/\/huggingface.co\/spaces\/upstage\/open-ko-llm-leaderboard\">https:\/\/huggingface.co\/spaces\/upstage\/open-ko-llm-leaderboard<\/a><\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><a href=\"https:\/\/huggingface.co\/spaces\/choco9966\/open-ko-llm-leaderboard-old\">https:\/\/huggingface.co\/spaces\/choco9966\/open-ko-llm-leaderboard-old<\/a><\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><a href=\"https:\/\/arxiv.org\/abs\/2402.13605\">https:\/\/arxiv.org\/abs\/2402.13605<\/a><\/span><\/li><li><span style=\"font-family: helvetica, arial, sans-serif; font-size: 12pt;\"><a href=\"https:\/\/arxiv.org\/abs\/2402.01383\">https:\/\/arxiv.org\/abs\/2402.01383<\/a><\/span><\/li><\/ol>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-b551bc8 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"b551bc8\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-532653a7\" data-id=\"532653a7\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-3554a5dd elementor-invisible elementor-widget elementor-widget-pix-heading\" data-id=\"3554a5dd\" data-element_type=\"widget\" data-widget_type=\"pix-heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div  class=\"pix-heading-el text-center \"><div><div class=\"slide-in-container\"><h3 class=\"font-weight-bold animate-in heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"slide-in-up\" data-anim-delay=\"0\">Your AI Data Standard<\/h3><\/div><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-50d950c3 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"50d950c3\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-50 elementor-top-column elementor-element elementor-element-57380683\" data-id=\"57380683\" data-element_type=\"column\" data-settings=\"{&quot;background_background&quot;:&quot;classic&quot;}\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-4813205e elementor-widget elementor-widget-pix-heading\" data-id=\"4813205e\" data-element_type=\"widget\" data-widget_type=\"pix-heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div  class=\"pix-heading-el text-center \"><div><div class=\"slide-in-container\"><h5 class=\"text-white font-weight-bold heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"\">LLM Evaluation Platform<\/h5><\/div><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-41b61ab4 elementor-widget elementor-widget-pix-button\" data-id=\"41b61ab4\" data-element_type=\"widget\" data-widget_type=\"pix-button.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<span  class=\"btn m-0     text-primary btn-white d-inline-block      btn-normal\"     ><span class=\"font-weight-bold \" >Learn more<\/span> <i class=\"font-weight-bold pixicon-arrow-right2   ml-1\"><\/i><\/span>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t<div class=\"elementor-column elementor-col-50 elementor-top-column elementor-element elementor-element-15b8a6ae\" data-id=\"15b8a6ae\" data-element_type=\"column\" data-settings=\"{&quot;background_background&quot;:&quot;classic&quot;}\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-742cf95 elementor-widget elementor-widget-pix-heading\" data-id=\"742cf95\" data-element_type=\"widget\" data-widget_type=\"pix-heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div  class=\"pix-heading-el text-center \"><div><div class=\"slide-in-container\"><h5 class=\"text-primary font-weight-bold heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"\">About Datumo<\/h5><\/div><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7a3438c elementor-widget elementor-widget-pix-button\" data-id=\"7a3438c\" data-element_type=\"widget\" data-widget_type=\"pix-button.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<span  class=\"btn m-0     btn-primary d-inline-block      btn-normal\"     ><span class=\"font-weight-bold \" >Learn more<\/span> <i class=\"font-weight-bold pixicon-arrow-right2   ml-1\"><\/i><\/span>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"Since the emergence of ChatGPT (November 30, 2022) [1], a wave of Large Language Models (LLMs) has been released as if in competition. Just in 2024 alone, we&#8217;ve seen the launch of Gemini 1.5 pro (February 15) [2], LLaMA 3&#8230;","protected":false},"author":1,"featured_media":16851,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[131],"tags":[],"class_list":["post-16849","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v23.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>LLM Evaluation: Definition &amp; methods of evaluating LLMs - Datumo<\/title>\n<meta name=\"description\" content=\"LLM evaluation process in three parts: what to evaluate, with what data to evaluate, and how to evaluate LLM for its safety and performance.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.datumo.com\/en\/tech\/16849\" \/>\n<meta property=\"og:locale\" content=\"ko_KR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"LLM Evaluation: Definition &amp; methods of evaluating LLMs - Datumo\" \/>\n<meta property=\"og:description\" content=\"LLM evaluation process in three parts: what to evaluate, with what data to evaluate, and how to evaluate LLM for its safety and performance.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/blog.datumo.com\/en\/tech\/16849\" \/>\n<meta property=\"og:site_name\" content=\"DATUMO\" \/>\n<meta property=\"article:published_time\" content=\"2024-08-29T07:54:19+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-10-22T09:11:15+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/jigar-panchal-4slT2XvKnio-unsplash.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1080\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"DATUMO\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\uae00\uc4f4\uc774\" \/>\n\t<meta name=\"twitter:data1\" content=\"DATUMO\" \/>\n\t<meta name=\"twitter:label2\" content=\"\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04\" \/>\n\t<meta name=\"twitter:data2\" content=\"11\ubd84\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16849#article\",\"isPartOf\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16849\"},\"author\":{\"name\":\"DATUMO\",\"@id\":\"https:\/\/blog.datumo.com\/#\/schema\/person\/02ec2d0ba953b146878dab089dc735b6\"},\"headline\":\"\ud83e\udd9c What is LLM Evaluation?\",\"datePublished\":\"2024-08-29T07:54:19+00:00\",\"dateModified\":\"2024-10-22T09:11:15+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16849\"},\"wordCount\":1744,\"publisher\":{\"@id\":\"https:\/\/blog.datumo.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16849#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/jigar-panchal-4slT2XvKnio-unsplash.jpg\",\"articleSection\":[\"tech\"],\"inLanguage\":\"ko-KR\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16849\",\"url\":\"https:\/\/blog.datumo.com\/en\/tech\/16849\",\"name\":\"LLM Evaluation: Definition & methods of evaluating LLMs - Datumo\",\"isPartOf\":{\"@id\":\"https:\/\/blog.datumo.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16849#primaryimage\"},\"image\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16849#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/jigar-panchal-4slT2XvKnio-unsplash.jpg\",\"datePublished\":\"2024-08-29T07:54:19+00:00\",\"dateModified\":\"2024-10-22T09:11:15+00:00\",\"description\":\"LLM evaluation process in three parts: what to evaluate, with what data to evaluate, and how to evaluate LLM for its safety and performance.\",\"breadcrumb\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16849#breadcrumb\"},\"inLanguage\":\"ko-KR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/blog.datumo.com\/en\/tech\/16849\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16849#primaryimage\",\"url\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/jigar-panchal-4slT2XvKnio-unsplash.jpg\",\"contentUrl\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/jigar-panchal-4slT2XvKnio-unsplash.jpg\",\"width\":1920,\"height\":1080},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16849#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/blog.datumo.com\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"\ud83e\udd9c What is LLM Evaluation?\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/blog.datumo.com\/#website\",\"url\":\"https:\/\/blog.datumo.com\/\",\"name\":\"DATUMO\",\"description\":\"The Data for Smarter AI\",\"publisher\":{\"@id\":\"https:\/\/blog.datumo.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/blog.datumo.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"ko-KR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/blog.datumo.com\/#organization\",\"name\":\"DATUMO\",\"url\":\"https:\/\/blog.datumo.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/blog.datumo.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/2.1.webp\",\"contentUrl\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/2.1.webp\",\"width\":1080,\"height\":600,\"caption\":\"DATUMO\"},\"image\":{\"@id\":\"https:\/\/blog.datumo.com\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/blog.datumo.com\/#\/schema\/person\/02ec2d0ba953b146878dab089dc735b6\",\"name\":\"DATUMO\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/blog.datumo.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/1942a8a63e1c8fa0d9be56cda789edd6c0a866259cd5dca24952597ffa8bab3d?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/1942a8a63e1c8fa0d9be56cda789edd6c0a866259cd5dca24952597ffa8bab3d?s=96&d=mm&r=g\",\"caption\":\"DATUMO\"},\"description\":\"DATUMO, The Data for Smarter AI. We seek to drive impact in the world by providing diverse and high quality data to build smarter AI.\",\"sameAs\":[\"https:\/\/blog.datumo.com\/en\"],\"url\":\"https:\/\/blog.datumo.com\/en\/author\/selectstar\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"LLM Evaluation: Definition & methods of evaluating LLMs - Datumo","description":"LLM evaluation process in three parts: what to evaluate, with what data to evaluate, and how to evaluate LLM for its safety and performance.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.datumo.com\/en\/tech\/16849","og_locale":"ko_KR","og_type":"article","og_title":"LLM Evaluation: Definition & methods of evaluating LLMs - Datumo","og_description":"LLM evaluation process in three parts: what to evaluate, with what data to evaluate, and how to evaluate LLM for its safety and performance.","og_url":"https:\/\/blog.datumo.com\/en\/tech\/16849","og_site_name":"DATUMO","article_published_time":"2024-08-29T07:54:19+00:00","article_modified_time":"2024-10-22T09:11:15+00:00","og_image":[{"width":1920,"height":1080,"url":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/jigar-panchal-4slT2XvKnio-unsplash.jpg","type":"image\/jpeg"}],"author":"DATUMO","twitter_card":"summary_large_image","twitter_misc":{"\uae00\uc4f4\uc774":"DATUMO","\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04":"11\ubd84"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/blog.datumo.com\/en\/tech\/16849#article","isPartOf":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16849"},"author":{"name":"DATUMO","@id":"https:\/\/blog.datumo.com\/#\/schema\/person\/02ec2d0ba953b146878dab089dc735b6"},"headline":"\ud83e\udd9c What is LLM Evaluation?","datePublished":"2024-08-29T07:54:19+00:00","dateModified":"2024-10-22T09:11:15+00:00","mainEntityOfPage":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16849"},"wordCount":1744,"publisher":{"@id":"https:\/\/blog.datumo.com\/#organization"},"image":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16849#primaryimage"},"thumbnailUrl":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/jigar-panchal-4slT2XvKnio-unsplash.jpg","articleSection":["tech"],"inLanguage":"ko-KR"},{"@type":"WebPage","@id":"https:\/\/blog.datumo.com\/en\/tech\/16849","url":"https:\/\/blog.datumo.com\/en\/tech\/16849","name":"LLM Evaluation: Definition & methods of evaluating LLMs - Datumo","isPartOf":{"@id":"https:\/\/blog.datumo.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16849#primaryimage"},"image":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16849#primaryimage"},"thumbnailUrl":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/jigar-panchal-4slT2XvKnio-unsplash.jpg","datePublished":"2024-08-29T07:54:19+00:00","dateModified":"2024-10-22T09:11:15+00:00","description":"LLM evaluation process in three parts: what to evaluate, with what data to evaluate, and how to evaluate LLM for its safety and performance.","breadcrumb":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16849#breadcrumb"},"inLanguage":"ko-KR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.datumo.com\/en\/tech\/16849"]}]},{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/blog.datumo.com\/en\/tech\/16849#primaryimage","url":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/jigar-panchal-4slT2XvKnio-unsplash.jpg","contentUrl":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2024\/08\/jigar-panchal-4slT2XvKnio-unsplash.jpg","width":1920,"height":1080},{"@type":"BreadcrumbList","@id":"https:\/\/blog.datumo.com\/en\/tech\/16849#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/blog.datumo.com\/en\/"},{"@type":"ListItem","position":2,"name":"\ud83e\udd9c What is LLM Evaluation?"}]},{"@type":"WebSite","@id":"https:\/\/blog.datumo.com\/#website","url":"https:\/\/blog.datumo.com\/","name":"DATUMO","description":"The Data for Smarter AI","publisher":{"@id":"https:\/\/blog.datumo.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.datumo.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"ko-KR"},{"@type":"Organization","@id":"https:\/\/blog.datumo.com\/#organization","name":"DATUMO","url":"https:\/\/blog.datumo.com\/","logo":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/blog.datumo.com\/#\/schema\/logo\/image\/","url":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/2.1.webp","contentUrl":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/2.1.webp","width":1080,"height":600,"caption":"DATUMO"},"image":{"@id":"https:\/\/blog.datumo.com\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/blog.datumo.com\/#\/schema\/person\/02ec2d0ba953b146878dab089dc735b6","name":"DATUMO","image":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/blog.datumo.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/1942a8a63e1c8fa0d9be56cda789edd6c0a866259cd5dca24952597ffa8bab3d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1942a8a63e1c8fa0d9be56cda789edd6c0a866259cd5dca24952597ffa8bab3d?s=96&d=mm&r=g","caption":"DATUMO"},"description":"DATUMO, The Data for Smarter AI. We seek to drive impact in the world by providing diverse and high quality data to build smarter AI.","sameAs":["https:\/\/blog.datumo.com\/en"],"url":"https:\/\/blog.datumo.com\/en\/author\/selectstar"}]}},"_links":{"self":[{"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/posts\/16849","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/comments?post=16849"}],"version-history":[{"count":24,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/posts\/16849\/revisions"}],"predecessor-version":[{"id":16940,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/posts\/16849\/revisions\/16940"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/media\/16851"}],"wp:attachment":[{"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/media?parent=16849"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/categories?post=16849"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/tags?post=16849"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}