{"id":16358,"date":"2022-06-22T06:54:36","date_gmt":"2022-06-22T06:54:36","guid":{"rendered":"https:\/\/blog.datumo.com\/en\/?p=16358"},"modified":"2024-10-22T08:52:31","modified_gmt":"2024-10-22T08:52:31","slug":"diversity-accuracy-important-properties-of-your-dataset","status":"publish","type":"post","link":"https:\/\/blog.datumo.com\/en\/tech\/16358","title":{"rendered":"Diversity? Accuracy? Important Properties of your Dataset"},"content":{"rendered":"[vc_row pix_particles_check=&#8221;&#8221;][vc_column]<div id=\"el1646799961152-e3ee06c0-4e82\" class=\"w-100 d-block \"><\/div><div class=\"pix-content-box card      vc_custom_1654577545529 custom-responsive-138881266   rounded-lg bg- w-100  \"   ><div class=\"\" style=\"z-index:30;position:relative;\">[vc_column_text]\r\n<p style=\"text-align: left;\"><span style=\"font-size: 14pt;\"><strong>\ud83d\udd11<\/strong> <strong>In 9 minutes you will learn:<\/strong><\/span><\/p>\r\n&nbsp;\r\n<ul>\r\n \t<li>Characteristics of quality dataset<\/li>\r\n \t<li>How Datumo maintains data quality and quantity<\/li>\r\n<\/ul>\r\n[\/vc_column_text]<\/div><\/div>[\/vc_column][\/vc_row][vc_row pix_particles_check=&#8221;&#8221;][vc_column][vc_raw_html]JTNDbWV0YSUyMGh0dHAtZXF1aXYlM0QlMjJyZWZyZXNoJTIyJTIwY29udGVudCUzRCUyMjAlM0IlMjB1cmwlM0RodHRwcyUzQSUyRiUyRmRhdHVtby5jb20lMkZlbiUyRmRpdmVyc2l0eS1hY2N1cmFjeS1pbXBvcnRhbnQtcHJvcGVydGllcy1vZi15b3VyLWRhdGFzZXQlMkYlMjIlM0U=[\/vc_raw_html]<div id=\"el1650294698986-a1b962b5-ef42\" class=\"w-100 d-block \"><\/div>[vc_column_text css=&#8221;.vc_custom_1655880972896{padding-top: 40px !important;padding-right: 20px !important;padding-bottom: 40px !important;padding-left: 20px !important;}&#8221;]We know that a dataset is basically a collection of data. It can consist of tables where each column in a table represents a particular variable in question and each row represents a value of that variable. A dataset can also consist of various documents or files. Regardless of the format, having a good quality dataset is extremely important because it is directly linked to the sustainability of your algorithms like the Machine Learning model i.e. the model will fail to serve its purpose if the quality of the dataset is poor or not up to the mark. So in this tutorial, we will be discussing some of the characteristics that a good dataset should possess. So let\u2019s get started![\/vc_column_text][\/vc_column][\/vc_row][vc_section full_width=&#8221;stretch_row&#8221; pix_over_visibility=&#8221;&#8221; css=&#8221;.vc_custom_1650444445523{padding-top: 80px !important;padding-bottom: 80px !important;background-color: #f8f9fa !important;}&#8221; el_id=&#8221;pix_section_program&#8221;][vc_row full_width=&#8221;stretch_row&#8221; pix_particles_check=&#8221;&#8221;][vc_column content_align=&#8221;text-center&#8221; offset=&#8221;vc_col-lg-offset-0 vc_col-lg-12 vc_col-md-offset-1 vc_col-md-10&#8243;]<div id=\"el1650442503491-f5da6b2f-fa35\" class=\"mb-3 text-left \"><h2 class=\"mb-32 pix-sliding-headline font-weight-bold secondary-font\" data-class=\"secondary-font text-heading-default\" data-style=\"\">Prerequisites<\/h2><\/div>[vc_column_text css=&#8221;.vc_custom_1655880997020{padding-top: 40px !important;padding-bottom: 40px !important;}&#8221;]\r\n<p style=\"text-align: left;\">Before you go ahead, please note that there are a few prerequisites for this tutorial. To follow the code samples given, you should have some basic programming knowledge in any language (preferably in Python). You should also be familiar with basic machine learning concepts. We will be using\u00a0<a class=\"au mq\" href=\"https:\/\/colab.research.google.com\/notebooks\/intro.ipynb\" target=\"_blank\" rel=\"noopener ugc nofollow\">Google Colab<\/a>\u00a0for writing the code in our examples but you can work on any code editor of your liking.<\/p>\r\n[\/vc_column_text][\/vc_column][\/vc_row][\/vc_section][vc_row pix_particles_check=&#8221;&#8221;][vc_column]<div id=\"el1650442607008-a85a832d-43f0\" class=\"w-100 d-block \"><\/div><div  class=\"pix-heading-el text-left \"><div><div class=\"slide-in-container\"><h2 class=\"text-heading-default font-weight-bold heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"0\">Characteristics of a Good Quality Dataset<\/h2><\/div><\/div><\/div>[vc_column_text css=&#8221;.vc_custom_1655882501224{padding-top: 40px !important;padding-bottom: 30px !important;}&#8221;]\r\n<p id=\"4523\" class=\"pw-post-body-paragraph le lf jj bn b lg lh li lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma jc hk\" data-selectable-paragraph=\"\">A dataset is of high quality if it fulfills its purpose and satisfies its requirements of use by the application or the client. A good machine learning model is of no use if it is trained on poor quality data. Therefore, it is vital to have a dataset of high quality. A good quality dataset should ideally have the following propoerties:<\/p>\r\n[\/vc_column_text][vc_column_text css=&#8221;.vc_custom_1655882615924{border-top-width: 1px !important;padding-top: 60px !important;padding-bottom: 30px !important;border-top-color: rgba(0,0,0,0.2) !important;border-top-style: solid !important;}&#8221;]\r\n<h4 id=\"d0ec\" class=\"ms kh jj bn ki mt mu mv km mw mx my kq lo mz na ku ls nb nc ky lw nd ne lc nf hk\"><strong>1. High Accuracy:<\/strong><\/h4>\r\n&nbsp;\r\n\r\nAccuracy refers to the correctness of the data. To be more precise, it refers to how close a value is to the actual value that represents the problem correctly. To analyze the accuracy of the information in your dataset, you need to ask yourself whether the information correctly reflects upon the concerned situation or problem or not. For example, if you are dealing with a dataset of houses and some of the house sizes are given in square centimeters or if some of the house prices are negative, then that information is most likely to be inaccurate.[\/vc_column_text][vc_column_text css=&#8221;.vc_custom_1655884354489{border-top-width: 1px !important;padding-top: 60px !important;border-top-color: rgba(0,0,0,0.2) !important;border-top-style: solid !important;}&#8221;]\r\n<h4 id=\"f064\" class=\"ms kh jj bn ki mt mu mv km mw mx my kq lo mz na ku ls nb nc ky lw nd ne lc nf hk\"><strong>2. Reliability:<\/strong><\/h4>\r\n&nbsp;\r\n<p id=\"1804\" class=\"pw-post-body-paragraph le lf jj bn b lg lh li lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma jc hk\" data-selectable-paragraph=\"\">Reliability is an important data attribute and it basically ensures that the different dataset values or datums do not contradict each other and that overall, the information that the dataset contains can be trusted; it discusses the qualitative aspects of your dataset. A model trained on reliable data is more likely to make correct predictions than one trained on unreliable data. When measuring or determining the reliability of your dataset, you should make sure that it does not contain:<\/p>\r\n&nbsp;\r\n<h5 id=\"38e2\" class=\"pw-post-body-paragraph le lf jj bn b lg ng li lj lk nh lm ln lo ni lq lr ls nj lu lv lw nk ly lz ma jc hk\"><span style=\"color: #000000;\"><strong class=\"bn nl\">Duplicated values<\/strong><\/span><\/h5>\r\n&nbsp;\r\n<p id=\"3ba1\" class=\"pw-post-body-paragraph le lf jj bn b lg ng li lj lk nh lm ln lo ni lq lr ls nj lu lv lw nk ly lz ma jc hk\" data-selectable-paragraph=\"\">As they are repeated values, they hold no importance as they do not provide any new information in the dataset and should be removed. You can remove duplicate values in Python using Pandas\u2019 drop_duplicates function. Consider the example below which contains information about students in some universities and has some duplicates.<\/p>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\"># FirstName LastName     Sex  Age Degree  Graduation\r\n0     Jamie   Fallon    Male   20     SE        2019\r\n1      Erin   Silver  Female   23     EE        2020\r\n2      Phil   Rhodes    Male   19     ME        2021\r\n3     Jamie   Fallon    Male   20     SE        2019\r\n4      Erin   Silver  Female   23     EE        2020\r\n5     Jamie   Fallon    Male   22     SE        2020<\/pre>\r\n&nbsp;\r\n\r\nTo remove duplicate rows:\r\n\r\n&nbsp;\r\n<h6><span style=\"color: #808080;\"><strong class=\"bn nl\">Python Code:<\/strong><\/span><\/h6>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">new_info = info.drop_duplicates()\r\nprint(new_info)<\/pre>\r\n&nbsp;\r\n<h6><span style=\"color: #808080;\"><strong class=\"bn nl\">Output:<\/strong><\/span><\/h6>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\"># FirstName LastName     Sex  Age Degree  Graduation\r\n0     Jamie   Fallon    Male   20     SE        2019\r\n1      Erin   Silver  Female   23     EE        2020\r\n2      Phil   Rhodes    Male   19     ME        2021\r\n5     Jamie   Fallon    Male   22     SE        2020<\/pre>\r\n&nbsp;\r\n<p id=\"a8f2\" class=\"pw-post-body-paragraph le lf jj bn b lg ng li lj lk nh lm ln lo ni lq lr ls nj lu lv lw nk ly lz ma jc hk\" data-selectable-paragraph=\"\">To remove duplicates on the basis of columns, specify a subset or column that should be unique. In our example, there are 2 Jamie Fallons. If we want to remove the record of one who graduates later, we could do so by using the following code:<\/p>\r\n&nbsp;\r\n<h6><span style=\"color: #808080;\"><strong class=\"bn nl\">Python Code:<\/strong><\/span><\/h6>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">info = info.sort_values('Graduation', ascending=True)\r\ninfo = info.drop_duplicates(subset='FirstName', keep='first')\r\nprint(info)<\/pre>\r\n&nbsp;\r\n<h6><span style=\"color: #808080;\"><strong class=\"bn nl\">Output:<\/strong><\/span><\/h6>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\"># FirstName LastName     Sex  Age Degree  Graduation\r\n0     Jamie   Fallon    Male   20     SE        2019\r\n1      Erin   Silver  Female   23     EE        2020\r\n2      Phil   Rhodes    Male   19     ME        2021<\/pre>\r\n&nbsp;\r\n<h5><span style=\"color: #000000;\"><strong>Missed or omitted values<\/strong><\/span><\/h5>\r\n&nbsp;\r\n<p id=\"642b\" class=\"pw-post-body-paragraph le lf jj bn b lg ng li lj lk nh lm ln lo ni lq lr ls nj lu lv lw nk ly lz ma jc hk\" data-selectable-paragraph=\"\">Should also be removed or recoded i.e. represented differently. You can remove omitted values in Python using Pandas\u2019 drop_dropna function. Consider the example below with some missed values:<\/p>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">FirstName LastName     Sex  Age Degree Graduation\r\n   0      Erin   Silver     NaN   23     EE       2020\r\n   1      Phil   Rhodes    Male   19     ME       2021\r\n   2     Helen    David  Female   23     EE       2020\r\n   3     Jamie   Fallon    Male   22     SE        NaT<\/pre>\r\n&nbsp;\r\n\r\n&nbsp;\r\n<h6><span style=\"color: #808080;\"><strong>Python Code:<\/strong><\/span><\/h6>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">new_info= info.dropna()\r\nprint(new_info)<\/pre>\r\n&nbsp;\r\n<h6><span style=\"color: #808080;\"><strong class=\"bn nl\">Output:<\/strong><\/span><\/h6>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">FirstName LastName     Sex  Age Degree Graduation\r\n1      Phil   Rhodes    Male   19     ME       2021\r\n2     Helen    David  Female   23     EE       2020<\/pre>\r\n<h5><\/h5>\r\n&nbsp;\r\n<h5 id=\"bb36\" class=\"pw-post-body-paragraph le lf jj bn b lg ng li lj lk nh lm ln lo ni lq lr ls nj lu lv lw nk ly lz ma jc hk\"><span style=\"color: #000000;\"><strong class=\"bn nl\">Special characters, punctuations, stop words<\/strong><\/span><\/h5>\r\n&nbsp;\r\n\r\nIn the case of textual data, special characters, punctuation marks and stop words like \u2018a\u2019, \u2018is\u2019, \u2018and\u2019, \u2018the\u2019 etc. do not add any meaning to the text. The model does not understand the grammar of the text, rather the nouns and the adjectives used. Thus, they should be removed from your textual data. This is called text pre-processing and for this, you need to:\r\n\r\n&nbsp;\r\n<h6><span style=\"color: #808080;\"><strong class=\"bn nl\">1. Make the necessary downloads and imports.<\/strong><\/span><\/h6>\r\n&nbsp;\r\n\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import nltk\r\nnltk.download('punkt')\r\nnltk.download('stopwords')\r\nfrom nltk.tokenize import word_tokenize\r\nimport string\r\nfrom nltk.corpus import stopwords<\/pre>\r\n&nbsp;\r\n<h6><strong><span style=\"color: #808080;\">2. \u00a0Split the words in the text file into tokens or list items. Consider this text file:<\/span><\/strong><\/h6>\r\n&nbsp;\r\n<h6><span style=\"color: #808080;\"><strong class=\"bn nl\">Python Code:<\/strong><\/span><\/h6>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">text_file = 'text_file.txt'\r\ntokens = word_tokenize(text)\r\nprint(tokens)<\/pre>\r\n&nbsp;\r\n<h6><span style=\"color: #808080;\"><strong class=\"bn nl\">Output:<\/strong><\/span><\/h6>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">['This', 'is', '@', '@', '@', '@', '@', 'a', 'random', 'text', ',', 'file', 'thAt', 'CONTAINS', '(', 'punctuation', ')', 'marks', ',', 'special', '#', 'characters', '!', '!', '!', \"''\", '\"', 'and', 'spaces', 'is', 'for', 'the', 'purpose', 'of', 'understanding', 'how', 'important^^^', 'text', 'preprocessing', 'is', 'fOr', 'the', '&amp;', 'quality', 'of', 'a', 'dataset', '.', 'A', 'high', 'quality', 'dataset', 'has', 'preprocessed', '&amp;', 'clean', 'text..', '!', '!']<\/pre>\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6 id=\"75d2\" class=\"pw-post-body-paragraph le lf jj bn b lg ng li lj lk nh lm ln lo ni lq lr ls nj lu lv lw nk ly lz ma jc hk\"><span style=\"color: #808080;\"><strong class=\"bn nl\">3. Remove all punctuation marks and special characters from the list.<\/strong><\/span><\/h6>\r\n&nbsp;\r\n<h6><span style=\"color: #808080;\"><strong class=\"bn nl\">Python Code:<\/strong><\/span><\/h6>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">intermediate_step = str.maketrans('', '', string.punctuation)\r\npunctuation_removed = [a.translate(intermediate_step) for a in tokens]\r\nprint(punctuation_removed)<\/pre>\r\n<h6><span style=\"color: #808080;\"><strong class=\"bn nl\">Output:<\/strong><\/span><\/h6>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">['This', 'is', '', '', '', '', '', 'a', 'random', 'text', '', 'file', 'thAt', 'CONTAINS', '', 'punctuation', '', 'marks', '', 'special', '', 'characters', '', '', '', '', '', 'and', 'spaces', 'is', 'for', 'the', 'purpose', 'of', 'understanding', 'how', 'important', 'text', 'preprocessing', 'is', 'fOr', 'the', '', 'quality', 'of', 'a', 'dataset', '', 'A', 'high', 'quality', 'dataset', 'has', 'preprocessed', '', 'clean', 'text', '', '']<\/pre>\r\n<h6><\/h6>\r\n&nbsp;\r\n<h6 id=\"d76b\" class=\"pw-post-body-paragraph le lf jj bn b lg ng li lj lk nh lm ln lo ni lq lr ls nj lu lv lw nk ly lz ma jc hk\"><span style=\"color: #808080;\"><strong class=\"bn nl\">4. Remove all non-alphabets from the list of tokens.<\/strong><\/span><\/h6>\r\n&nbsp;\r\n<h6><span style=\"color: #808080;\"><strong class=\"bn nl\">Python Code:<\/strong><\/span><\/h6>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">only_alphabets = [a for a in punctuation_removed if a.isalpha()]\r\nprint(only_alphabets<\/pre>\r\n&nbsp;\r\n\r\n&nbsp;\r\n<h6><span style=\"color: #999999;\"><strong class=\"bn nl\">Output:<\/strong><\/span><\/h6>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">['This', 'is', 'a', 'random', 'text', 'file', 'thAt', 'CONTAINS', 'punctuation', 'marks', 'special', 'characters', 'and', 'spaces', 'is', 'for', 'the', 'purpose', 'of', 'understanding', 'how', 'important', 'text', 'preprocessing', 'is', 'fOr', 'the', 'quality', 'of', 'a', 'dataset', 'A', 'high', 'quality', 'dataset', 'has', 'preprocessed', 'clean', 'text']<\/pre>\r\n&nbsp;\r\n\r\n&nbsp;\r\n<h6><span style=\"color: #808080;\"><strong class=\"bn nl\">5. Convert all words to lower-case for consistency.<\/strong><\/span><\/h6>\r\n&nbsp;\r\n<h6><span style=\"color: #808080;\"><strong class=\"bn nl\">Python Code:<\/strong><\/span><\/h6>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">lowercase_tokens = [a.lower() for a in only_alphabets]\r\nprint(lowercase_tokens)<\/pre>\r\n&nbsp;\r\n\r\n&nbsp;\r\n<h6><span style=\"color: #808080;\"><strong class=\"bn nl\">Output:<\/strong><\/span><\/h6>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">['this', 'is', 'a', 'random', 'text', 'file', 'that', 'contains', 'punctuation', 'marks', 'special', 'characters', 'and', 'spaces', 'is', 'for', 'the', 'purpose', 'of', 'understanding', 'how', 'important', 'text', 'preprocessing', 'is', 'for', 'the', 'quality', 'of', 'a', 'dataset', 'a', 'high', 'quality', 'dataset', 'has', 'preprocessed', 'clean', 'text']<\/pre>\r\n&nbsp;\r\n\r\n&nbsp;\r\n<h6><span style=\"color: #808080;\"><strong class=\"bn nl\">6. Remove all stop words from the list.<\/strong><\/span><\/h6>\r\n&nbsp;\r\n<h6><span style=\"color: #808080;\"><strong class=\"bn nl\">Python Code:<\/strong><\/span><\/h6>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">stop_words = set(stopwords.words('english'))\r\nstop_words_removed = [w for w in words if not w in stop_words]\r\nprint(stop_words_removed)<\/pre>\r\n&nbsp;\r\n<h6><span style=\"color: #808080;\"><strong class=\"bn nl\">Output:<\/strong><\/span><\/h6>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">['random', 'text', 'file', 'contains', 'punctuation', 'marks', 'special', 'characters', 'spaces', 'purpose', 'understanding', 'important', 'text', 'preprocessing', 'quality', 'dataset', 'high', 'quality', 'dataset', 'preprocessed', 'clean', 'text']<\/pre>\r\n[\/vc_column_text][vc_column_text css=&#8221;.vc_custom_1655884460715{border-top-width: 1px !important;padding-top: 60px !important;padding-bottom: 30px !important;border-top-color: rgba(0,0,0,0.2) !important;border-top-style: solid !important;}&#8221;]\r\n<h4 id=\"a650\" class=\"ms kh jj bn ki mt mu mv km mw mx my kq lo mz na ku ls nb nc ky lw nd ne lc nf hk\" style=\"text-align: left;\"><strong>3. Consistency in Feature Representation:<\/strong><\/h4>\r\n&nbsp;\r\n<p id=\"1e87\" class=\"pw-post-body-paragraph le lf jj bn b lg lh li lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma jc hk\" data-selectable-paragraph=\"\">Consistency in feature representation is an important characteristic of a good dataset as it ensures data compatibility. For this, you need to:<\/p>\r\n\r\n<ul class=\"\">\r\n \t<li id=\"ed34\" class=\"nn no jj bn b lg ng lk nh lo np ls nq lw nr ma ns nt nu nv hk\" data-selectable-paragraph=\"\">Convert non-numeric features to numeric e.g. to perform matrix multiplication, the data needs to be numeric as such operations cannot be performed on strings.<\/li>\r\n \t<li id=\"85d9\" class=\"nn no jj bn b lg nw lk nx lo ny ls nz lw oa ma ns nt nu nv hk\" data-selectable-paragraph=\"\">Resize inputs to a fixed size especially in the case of image models as they require images in their dataset to be of the same size. This can be done through Python Imaging Library known as PIL in short by using its image resize function.<\/li>\r\n \t<li id=\"a2a0\" class=\"nn no jj bn b lg nw lk nx lo ny ls nz lw oa ma ns nt nu nv hk\" data-selectable-paragraph=\"\">Normalize numeric features i.e. change the values of numeric columns to a scale such that the differences between the ranges remain unchanged. The actual range of values is converted to a standard range of values, typically between -1 to 0 or 0 to +1 or -1 to +1. This helps models to perform better and increases overall accuracy.<\/li>\r\n<\/ul>\r\n[\/vc_column_text][vc_column_text css=&#8221;.vc_custom_1655884467199{border-top-width: 1px !important;padding-top: 60px !important;padding-bottom: 30px !important;border-top-color: rgba(0,0,0,0.2) !important;border-top-style: solid !important;}&#8221;]\r\n<h4 id=\"220c\" class=\"ms kh jj bn ki mt mu mv km mw mx my kq lo mz na ku ls nb nc ky lw nd ne lc nf hk\"><strong>4. Right Dataset Size:<\/strong><\/h4>\r\n&nbsp;\r\n<p id=\"94e0\" class=\"pw-post-body-paragraph le lf jj bn b lg lh li lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma jc hk\" data-selectable-paragraph=\"\">The right quantity or size of the dataset in an extremely important characteristic that in turn affects its overall quality. No matter how efficient your model is, the dataset size can become a bottleneck in terms of its accuracy. There is no hard and fast rule about the size of the dataset, as it is specific to the type of problem that you are trying to solve, so size is mostly based on good judgment and should be sufficient to yield expected performance outcomes. You can however ensure that your dataset is at least an order of magnitude more than the trainable parameters. You can also split the training and testing sets in the ratio 80\/20 or 70\/30 or use an alternative approach like K-fold cross-validation.<\/p>\r\n\r\n<ul class=\"\">\r\n \t<li id=\"85d9\" class=\"nn no jj bn b lg nw lk nx lo ny ls nz lw oa ma ns nt nu nv hk\" data-selectable-paragraph=\"\">Resize inputs to a fixed size especially in the case of image models as they require images in their dataset to be of the same size. This can be done through Python Imaging Library known as PIL in short by using its image resize function.<\/li>\r\n \t<li id=\"a2a0\" class=\"nn no jj bn b lg nw lk nx lo ny ls nz lw oa ma ns nt nu nv hk\" data-selectable-paragraph=\"\">Normalize numeric features i.e. change the values of numeric columns to a scale such that the differences between the ranges remain unchanged. The actual range of values is converted to a standard range of values, typically between -1 to 0 or 0 to +1 or -1 to +1. This helps models to perform better and increases overall accuracy.<\/li>\r\n<\/ul>\r\n[\/vc_column_text][vc_column_text css=&#8221;.vc_custom_1655884501078{border-top-width: 1px !important;padding-top: 60px !important;padding-bottom: 30px !important;border-top-color: rgba(0,0,0,0.2) !important;border-top-style: solid !important;}&#8221;]\r\n<h4 id=\"7b54\" class=\"ms kh jj bn ki mt mu mv km mw mx my kq lo mz na ku ls nb nc ky lw nd ne lc nf hk\"><strong>5. Diversity in Dataset:<\/strong><\/h4>\r\n&nbsp;\r\n<p id=\"df83\" class=\"pw-post-body-paragraph le lf jj bn b lg lh li lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma jc hk\" data-selectable-paragraph=\"\">Diversity is a critical factor when it comes to making a good quality dataset. Consider the example of an image recognition and classification model. If this model is trained on a dataset of images showing different breed of dogs, and for each breed, there are images taken from different angles, under changed lighting conditions, from varying distances, in contrasting backgrounds and showing their tails, paws, etc. differently, then the model is most likely to classify the dogs more accurately as compared to if it is trained on a model which has similar images. In short, non-representative or non-diverse datasets are unlikely to provide useful insights as compared to those who cover all facets of the problem in question.<\/p>\r\n[\/vc_column_text][vc_column_text css=&#8221;.vc_custom_1655884528984{border-top-width: 1px !important;padding-top: 60px !important;padding-bottom: 30px !important;border-top-color: rgba(0,0,0,0.2) !important;border-top-style: solid !important;}&#8221;]\r\n<h4 id=\"78eb\" class=\"ms kh jj bn ki mt mu mv km mw mx my kq lo mz na ku ls nb nc ky lw nd ne lc nf hk\"><strong>6. Completeness:<\/strong><\/h4>\r\n&nbsp;\r\n<p id=\"df83\" class=\"pw-post-body-paragraph le lf jj bn b lg lh li lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma jc hk\" data-selectable-paragraph=\"\">Completeness refers to how comprehensive the dataset is. This is an important attribute of measuring data quality as it makes sure that every important and relevant piece of information is put into the dataset for the model to train on. If the information is incomplete, the data may become unusable.<\/p>\r\n[\/vc_column_text][vc_column_text css=&#8221;.vc_custom_1655884571298{border-top-width: 1px !important;padding-top: 60px !important;padding-bottom: 30px !important;border-top-color: rgba(0,0,0,0.2) !important;border-top-style: solid !important;}&#8221;]\r\n<h4 id=\"8522\" class=\"ms kh jj bn ki mt mu mv km mw mx my kq lo mz na ku ls nb nc ky lw nd ne lc nf hk\"><strong>7. Up to date Data:<\/strong><\/h4>\r\n&nbsp;\r\n<p id=\"405d\" class=\"pw-post-body-paragraph le lf jj bn b lg lh li lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma jc hk\" data-selectable-paragraph=\"\">Having up to date data in your dataset is an important data quality characteristic because if the data is not up to date, it may not be applicable to the current scenario or problem that the model is intended to solve. Take the example of a house price prediction model that predicts house prices based on their sizes in square feet. If the dataset for this contains house prices from the 1960s, it may not be applicable to predict house prices for the year 2020.<\/p>\r\n[\/vc_column_text][vc_column_text css=&#8221;.vc_custom_1655884598969{border-top-width: 1px !important;padding-top: 60px !important;padding-bottom: 30px !important;border-top-color: rgba(0,0,0,0.2) !important;border-top-style: solid !important;}&#8221;]\r\n<h4 id=\"c337\" class=\"ms kh jj bn ki mt mu mv km mw mx my kq lo mz na ku ls nb nc ky lw nd ne lc nf hk\"><strong>8. Relevance:<\/strong><\/h4>\r\n&nbsp;\r\n<p id=\"94ad\" class=\"pw-post-body-paragraph le lf jj bn b lg lh li lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma jc hk\" data-selectable-paragraph=\"\">This refers to how important or relevant the data is to the concerned problem. If you gather a dataset that contains some irrelevant or unrelated information to the problem that you\u2019re trying to solve, then you will not attain the desired results and waste your time. Your dataset should strictly contain relevant information and should meet the requirements for the intended use.<\/p>\r\n[\/vc_column_text]<div id=\"el1650294913061-211813f5-5f2d\" class=\"w-100 d-block \"><\/div>[\/vc_column][\/vc_row][vc_row pix_particles_check=&#8221;&#8221;][vc_column]<div id=\"el1653971463480-ce74a014-4ae9\" class=\"w-100 d-block \"><\/div>[vc_column_text css=&#8221;.vc_custom_1655884673527{padding-top: 40px !important;padding-bottom: 0px !important;}&#8221;]\r\n<p id=\"e9f2\" class=\"pw-post-body-paragraph le lf jj bn b lg lh li lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma jc hk\" data-selectable-paragraph=\"\">Creating and maintaining the best quality dataset is not an easy task. Especially, for a small- to medium-sized companies, managing human resources and technical specialties are very challenging. Therefore, it is often more efficient to find another service that does laborious works (including both collection and preprocessing) for you. For that, we could be your perfect solution!<\/p>\r\n<p id=\"fcce\" class=\"pw-post-body-paragraph le lf jj bn b lg ng li lj lk nh lm ln lo ni lq lr ls nj lu lv lw nk ly lz ma jc hk\" data-selectable-paragraph=\"\">Here at <strong>\u00a0<a class=\"au mn\" href=\"https:\/\/www.datumo.com\" target=\"_blank\" rel=\"noopener ugc nofollow\"><em class=\"pn\">D<\/em><\/a><\/strong><a href=\"https:\/\/www.datumo.com\"><strong>ATUMO<\/strong><\/a>, we crowdsource our tasks to diverse users located globally to ensure the quality and quantity simultaneously. Moreover, our in-house managers double-check the quality of the collected or processed data! Check us out at <a class=\"au mq\" href=\"https:\/\/datumo.com\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">datumo.com<\/a>\u00a0for more information! Let us be your HELP!<\/p>\r\n[\/vc_column_text]<div id=\"el1653972293756-76a5ecd1-3d25\" class=\"w-100 d-block \"><\/div>[vc_column_text css=&#8221;.vc_custom_1655884686250{border-top-width: 1px !important;padding-top: 80px !important;padding-bottom: 0px !important;border-top-color: rgba(0,0,0,0.2) !important;border-top-style: solid !important;}&#8221;]To sum it all up, we discussed how important it is for a model to be fed a high-quality dataset as it drives the quality of the overall machine learning model. We pointed out the main characteristics of a good quality dataset such that it is accurate, complete, reliable, up to date, diverse, and relevant. We also discussed the steps required in text pre-processing. With these aforementioned factors, we can make certain that a high-performance machine learning dataset is built and that we are able to reap the benefits of a robust and accurate machine learning model that has learned from such a superior quality training dataset.[\/vc_column_text]<div id=\"el1653971463481-f4f34d7c-39ce\" class=\"w-100 d-block \"><\/div>[\/vc_column][\/vc_row][vc_row pix_particles_check=&#8221;&#8221;][vc_column width=&#8221;1\/2&#8243;]<div id=\"el1646794934167-c0c94dd3-ea74\" class=\"w-100 d-block \"><\/div><div class=\" mb-3 mb-md-0 \"  ><div class=\"card w-100 h-100 bg-white  vc_custom_1652982865548  pix-hover-item rounded-10 position-relative overflow-hidden2 text-white tilt fancy_card\" ><div class=\"card-img-overlay overflow-visible d-inline-block w-100 pix-img-overlay pix-p-30 d-flex align-items-end text-left\"><div class=\"w-100 \"><h3 class=\"card-title  text-black font-weight-bold mb-0 animate-in\" style=\"\">See what we can do for you.<\/h3><p class=\"card-text pix-pt-10 text-black \" style=\"\">Build smarter AI with us.<\/p><div class=\"card-btn-div mt-4 d-inline-block w-100\"><a  href=\"https:\/\/datumo.com\" class=\"btn mb-2     text-white btn-black d-inline-block      btn-md\" target=\"_blank\" rel=\"noopener\"    ><span class=\"font-weight-bold \" >Learn More<\/span><\/a><\/div><\/div><\/div><\/div><\/div>[\/vc_column][vc_column width=&#8221;1\/2&#8243;]<div id=\"el1646794982519-9a19190b-7fde\" class=\"w-100 d-block \"><\/div><div class=\" mb-3 mb-md-0 \"  ><div class=\"card w-100 h-100 bg-black  vc_custom_1653971438710  pix-hover-item rounded-10 position-relative overflow-hidden2 text-white tilt fancy_card\" ><div class=\"card-img-overlay overflow-visible d-inline-block w-100 pix-img-overlay pix-p-30 d-flex align-items-end text-left\"><div class=\"w-100 \"><h3 class=\"card-title  text-white font-weight-bold mb-0 animate-in\" style=\"\">We would like to support the AI industry by sharing.<\/h3><p class=\"card-text pix-pt-10 text-white \" style=\"\"><\/p><div class=\"card-btn-div mt-4 d-inline-block w-100\"><a  href=\"https:\/\/open.datumo.com\/en\" class=\"btn mb-2    vc_custom_1653971438714  btn-primary d-inline-block      btn-md\" target=\"_blank\" rel=\"noopener\"    ><span class=\"font-weight-bold \" >Download Open Datasets<\/span><\/a><\/div><\/div><\/div><\/div><\/div>[\/vc_column][\/vc_row][vc_row pix_particles_check=&#8221;&#8221;][vc_column]<div id=\"el1646799961152-e3ee06c0-4e82\" class=\"w-100 d-block \"><\/div>[\/vc_column][\/vc_row]","protected":false},"excerpt":{"rendered":"[vc_row pix_particles_check=&#8221;&#8221;][vc_column][\/vc_column][\/vc_row][vc_row pix_particles_check=&#8221;&#8221;][vc_column][vc_raw_html]JTNDbWV0YSUyMGh0dHAtZXF1aXYlM0QlMjJyZWZyZXNoJTIyJTIwY29udGVudCUzRCUyMjAlM0IlMjB1cmwlM0RodHRwcyUzQSUyRiUyRmRhdHVtby5jb20lMkZlbiUyRmRpdmVyc2l0eS1hY2N1cmFjeS1pbXBvcnRhbnQtcHJvcGVydGllcy1vZi15b3VyLWRhdGFzZXQlMkYlMjIlM0U=[\/vc_raw_html][vc_column_text css=&#8221;.vc_custom_1655880972896{padding-top: 40px !important;padding-right: 20px !important;padding-bottom: 40px !important;padding-left: 20px !important;}&#8221;]We know that a dataset is basically a collection of data. It can consist of tables where each column in a table represents a particular variable in question and&#8230;","protected":false},"author":1,"featured_media":16483,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[131],"tags":[196,127,195],"class_list":["post-16358","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech","tag-data-centric-ai","tag-datumo","tag-high-quality-dataset"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v23.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Diversity? Accuracy? Important Properties of your Dataset - DATUMO<\/title>\n<meta name=\"description\" content=\"Regardless of the format, having a good quality dataset is extremely important because it is directly linked to the sustainability of your algorithms like the Machine Learning model i.e. the model will fail to serve its purpose if the quality of the dataset is poor or not up to the mark.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.datumo.com\/en\/tech\/16358\" \/>\n<meta property=\"og:locale\" content=\"ko_KR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Diversity? Accuracy? Important Properties of your Dataset\" \/>\n<meta property=\"og:description\" content=\"Regardless of the format, having a good quality dataset is extremely important because it is directly linked to the sustainability of your algorithms like the Machine Learning model i.e. the model will fail to serve its purpose if the quality of the dataset is poor or not up to the mark.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/blog.datumo.com\/en\/tech\/16358\" \/>\n<meta property=\"og:site_name\" content=\"DATUMO\" \/>\n<meta property=\"article:published_time\" content=\"2022-06-22T06:54:36+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-10-22T08:52:31+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/richard-horvath-cPccYbPrF-A-unsplash.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1080\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"DATUMO\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:title\" content=\"Diversity? Accuracy? Important Properties of your Dataset\" \/>\n<meta name=\"twitter:description\" content=\"Regardless of the format, having a good quality dataset is extremely important because it is directly linked to the sustainability of your algorithms like the Machine Learning model i.e. the model will fail to serve its purpose if the quality of the dataset is poor or not up to the mark.\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/richard-horvath-cPccYbPrF-A-unsplash.jpg\" \/>\n<meta name=\"twitter:label1\" content=\"\uae00\uc4f4\uc774\" \/>\n\t<meta name=\"twitter:data1\" content=\"DATUMO\" \/>\n\t<meta name=\"twitter:label2\" content=\"\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04\" \/>\n\t<meta name=\"twitter:data2\" content=\"12\ubd84\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16358#article\",\"isPartOf\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16358\"},\"author\":{\"name\":\"DATUMO\",\"@id\":\"https:\/\/blog.datumo.com\/#\/schema\/person\/02ec2d0ba953b146878dab089dc735b6\"},\"headline\":\"Diversity? Accuracy? Important Properties of your Dataset\",\"datePublished\":\"2022-06-22T06:54:36+00:00\",\"dateModified\":\"2024-10-22T08:52:31+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16358\"},\"wordCount\":2607,\"publisher\":{\"@id\":\"https:\/\/blog.datumo.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16358#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/richard-horvath-cPccYbPrF-A-unsplash.jpg\",\"keywords\":[\"data-centric ai\",\"datumo\",\"high quality dataset\"],\"articleSection\":[\"tech\"],\"inLanguage\":\"ko-KR\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16358\",\"url\":\"https:\/\/blog.datumo.com\/en\/tech\/16358\",\"name\":\"Diversity? Accuracy? Important Properties of your Dataset - DATUMO\",\"isPartOf\":{\"@id\":\"https:\/\/blog.datumo.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16358#primaryimage\"},\"image\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16358#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/richard-horvath-cPccYbPrF-A-unsplash.jpg\",\"datePublished\":\"2022-06-22T06:54:36+00:00\",\"dateModified\":\"2024-10-22T08:52:31+00:00\",\"description\":\"Regardless of the format, having a good quality dataset is extremely important because it is directly linked to the sustainability of your algorithms like the Machine Learning model i.e. the model will fail to serve its purpose if the quality of the dataset is poor or not up to the mark.\",\"breadcrumb\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16358#breadcrumb\"},\"inLanguage\":\"ko-KR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/blog.datumo.com\/en\/tech\/16358\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16358#primaryimage\",\"url\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/richard-horvath-cPccYbPrF-A-unsplash.jpg\",\"contentUrl\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/richard-horvath-cPccYbPrF-A-unsplash.jpg\",\"width\":1920,\"height\":1080},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16358#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/blog.datumo.com\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Diversity? Accuracy? Important Properties of your Dataset\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/blog.datumo.com\/#website\",\"url\":\"https:\/\/blog.datumo.com\/\",\"name\":\"DATUMO\",\"description\":\"The Data for Smarter AI\",\"publisher\":{\"@id\":\"https:\/\/blog.datumo.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/blog.datumo.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"ko-KR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/blog.datumo.com\/#organization\",\"name\":\"DATUMO\",\"url\":\"https:\/\/blog.datumo.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/blog.datumo.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/2.1.webp\",\"contentUrl\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/2.1.webp\",\"width\":1080,\"height\":600,\"caption\":\"DATUMO\"},\"image\":{\"@id\":\"https:\/\/blog.datumo.com\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/blog.datumo.com\/#\/schema\/person\/02ec2d0ba953b146878dab089dc735b6\",\"name\":\"DATUMO\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/blog.datumo.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/1942a8a63e1c8fa0d9be56cda789edd6c0a866259cd5dca24952597ffa8bab3d?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/1942a8a63e1c8fa0d9be56cda789edd6c0a866259cd5dca24952597ffa8bab3d?s=96&d=mm&r=g\",\"caption\":\"DATUMO\"},\"description\":\"DATUMO, The Data for Smarter AI. We seek to drive impact in the world by providing diverse and high quality data to build smarter AI.\",\"sameAs\":[\"https:\/\/blog.datumo.com\/en\"],\"url\":\"https:\/\/blog.datumo.com\/en\/author\/selectstar\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Diversity? Accuracy? Important Properties of your Dataset - DATUMO","description":"Regardless of the format, having a good quality dataset is extremely important because it is directly linked to the sustainability of your algorithms like the Machine Learning model i.e. the model will fail to serve its purpose if the quality of the dataset is poor or not up to the mark.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.datumo.com\/en\/tech\/16358","og_locale":"ko_KR","og_type":"article","og_title":"Diversity? Accuracy? Important Properties of your Dataset","og_description":"Regardless of the format, having a good quality dataset is extremely important because it is directly linked to the sustainability of your algorithms like the Machine Learning model i.e. the model will fail to serve its purpose if the quality of the dataset is poor or not up to the mark.","og_url":"https:\/\/blog.datumo.com\/en\/tech\/16358","og_site_name":"DATUMO","article_published_time":"2022-06-22T06:54:36+00:00","article_modified_time":"2024-10-22T08:52:31+00:00","og_image":[{"width":1920,"height":1080,"url":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/richard-horvath-cPccYbPrF-A-unsplash.jpg","type":"image\/jpeg"}],"author":"DATUMO","twitter_card":"summary_large_image","twitter_title":"Diversity? Accuracy? Important Properties of your Dataset","twitter_description":"Regardless of the format, having a good quality dataset is extremely important because it is directly linked to the sustainability of your algorithms like the Machine Learning model i.e. the model will fail to serve its purpose if the quality of the dataset is poor or not up to the mark.","twitter_image":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/richard-horvath-cPccYbPrF-A-unsplash.jpg","twitter_misc":{"\uae00\uc4f4\uc774":"DATUMO","\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04":"12\ubd84"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/blog.datumo.com\/en\/tech\/16358#article","isPartOf":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16358"},"author":{"name":"DATUMO","@id":"https:\/\/blog.datumo.com\/#\/schema\/person\/02ec2d0ba953b146878dab089dc735b6"},"headline":"Diversity? Accuracy? Important Properties of your Dataset","datePublished":"2022-06-22T06:54:36+00:00","dateModified":"2024-10-22T08:52:31+00:00","mainEntityOfPage":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16358"},"wordCount":2607,"publisher":{"@id":"https:\/\/blog.datumo.com\/#organization"},"image":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16358#primaryimage"},"thumbnailUrl":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/richard-horvath-cPccYbPrF-A-unsplash.jpg","keywords":["data-centric ai","datumo","high quality dataset"],"articleSection":["tech"],"inLanguage":"ko-KR"},{"@type":"WebPage","@id":"https:\/\/blog.datumo.com\/en\/tech\/16358","url":"https:\/\/blog.datumo.com\/en\/tech\/16358","name":"Diversity? Accuracy? Important Properties of your Dataset - DATUMO","isPartOf":{"@id":"https:\/\/blog.datumo.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16358#primaryimage"},"image":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16358#primaryimage"},"thumbnailUrl":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/richard-horvath-cPccYbPrF-A-unsplash.jpg","datePublished":"2022-06-22T06:54:36+00:00","dateModified":"2024-10-22T08:52:31+00:00","description":"Regardless of the format, having a good quality dataset is extremely important because it is directly linked to the sustainability of your algorithms like the Machine Learning model i.e. the model will fail to serve its purpose if the quality of the dataset is poor or not up to the mark.","breadcrumb":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16358#breadcrumb"},"inLanguage":"ko-KR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.datumo.com\/en\/tech\/16358"]}]},{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/blog.datumo.com\/en\/tech\/16358#primaryimage","url":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/richard-horvath-cPccYbPrF-A-unsplash.jpg","contentUrl":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/richard-horvath-cPccYbPrF-A-unsplash.jpg","width":1920,"height":1080},{"@type":"BreadcrumbList","@id":"https:\/\/blog.datumo.com\/en\/tech\/16358#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/blog.datumo.com\/en\/"},{"@type":"ListItem","position":2,"name":"Diversity? Accuracy? Important Properties of your Dataset"}]},{"@type":"WebSite","@id":"https:\/\/blog.datumo.com\/#website","url":"https:\/\/blog.datumo.com\/","name":"DATUMO","description":"The Data for Smarter AI","publisher":{"@id":"https:\/\/blog.datumo.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.datumo.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"ko-KR"},{"@type":"Organization","@id":"https:\/\/blog.datumo.com\/#organization","name":"DATUMO","url":"https:\/\/blog.datumo.com\/","logo":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/blog.datumo.com\/#\/schema\/logo\/image\/","url":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/2.1.webp","contentUrl":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/2.1.webp","width":1080,"height":600,"caption":"DATUMO"},"image":{"@id":"https:\/\/blog.datumo.com\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/blog.datumo.com\/#\/schema\/person\/02ec2d0ba953b146878dab089dc735b6","name":"DATUMO","image":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/blog.datumo.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/1942a8a63e1c8fa0d9be56cda789edd6c0a866259cd5dca24952597ffa8bab3d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1942a8a63e1c8fa0d9be56cda789edd6c0a866259cd5dca24952597ffa8bab3d?s=96&d=mm&r=g","caption":"DATUMO"},"description":"DATUMO, The Data for Smarter AI. We seek to drive impact in the world by providing diverse and high quality data to build smarter AI.","sameAs":["https:\/\/blog.datumo.com\/en"],"url":"https:\/\/blog.datumo.com\/en\/author\/selectstar"}]}},"_links":{"self":[{"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/posts\/16358","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/comments?post=16358"}],"version-history":[{"count":12,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/posts\/16358\/revisions"}],"predecessor-version":[{"id":16922,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/posts\/16358\/revisions\/16922"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/media\/16483"}],"wp:attachment":[{"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/media?parent=16358"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/categories?post=16358"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/tags?post=16358"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}