{"id":16171,"date":"2022-05-30T12:49:35","date_gmt":"2022-05-30T12:49:35","guid":{"rendered":"https:\/\/blog.datumo.com\/en\/?p=16171"},"modified":"2024-10-22T08:23:36","modified_gmt":"2024-10-22T08:23:36","slug":"handling-missing-data-in-pandas-dataframes-in-python","status":"publish","type":"post","link":"https:\/\/blog.datumo.com\/en\/tech\/16171","title":{"rendered":"Handling Missing Data in Pandas Dataframes in Python"},"content":{"rendered":"[vc_row pix_particles_check=&#8221;&#8221;][vc_column]<div class=\"pix-content-box card      vc_custom_1650362800073    rounded-lg bg- w-100  \"   ><div class=\"\" style=\"z-index:30;position:relative;\">[vc_raw_html]JTNDbWV0YSUyMGh0dHAtZXF1aXYlM0QlMjJyZWZyZXNoJTIyJTIwY29udGVudCUzRCUyMjAlM0IlMjB1cmwlM0RodHRwcyUzQSUyRiUyRmRhdHVtby5jb20lMkZlbiUyRmhhbmRsaW5nLW1pc3NpbmctZGF0YS1pbi1wYW5kYXMtZGF0YWZyYW1lcy1pbi1weXRob24lMkYlMjIlM0U=[\/vc_raw_html][vc_column_text css=&#8221;.vc_custom_1653914996046{padding-top: 40px !important;padding-right: 40px !important;padding-bottom: 0px !important;padding-left: 40px !important;}&#8221;]Missing data, as the name suggests, are data observations that do not contain any value. Missing values can have a huge impact on the performance of data science models. Therefore, it is important to understand the reasons behind missing data, and the techniques that can be adopted to handle missing data.\r\n\r\nIn this article, you will study some of the common reasons for having missing data in your observations. Finally, you will study how to handle missing values in data stored in Pandas dataframe, which is one of the most common structures for storing data in the Python programming language.[\/vc_column_text]<\/div><\/div><div id=\"el1653916872425-9a5af7cc-7062\" class=\"w-100 d-block \"><\/div><div id=\"el1650442503491-f5da6b2f-fa35\" class=\"mb-3 text-left \"><h2 class=\"mb-32 pix-sliding-headline font-weight-bold secondary-font\" data-class=\"secondary-font text-heading-default\" data-style=\"\">Reasons for Missing Data<\/h2><\/div>[vc_column_text css=&#8221;.vc_custom_1653915132335{padding-top: 40px !important;padding-bottom: 40px !important;}&#8221;]There can be multiple reasons for having missing data in your observations. Three of the most common reasons are listed below:\r\n<ol>\r\n \t<li>Sometimes the data is not recorded intentionally. For instance, while completing market surveys, some people might not enter their annual revenue. In such cases, you will have missing observations in your data.<\/li>\r\n \t<li>The data is oftentimes not available at the time when the observation is recorded. For instance, some of the information about the patients who are brought to emergencies in hospitals might not be immediately available at the time of registering patients.<\/li>\r\n \t<li>Information is often lost due to technical reasons or calculations. For instance, you might have recorded a person\u2019s weight in a computer but due to application crashing or power failure the data is not recorded.<\/li>\r\n<\/ol>\r\n&nbsp;\r\n<h5><strong>Types of Missing Data<\/strong><\/h5>\r\n&nbsp;\r\n\r\nMissing data can be divided into three main categories:\r\n\r\n&nbsp;\r\n<h5><strong>Missing Data Completely Randomly<\/strong><\/h5>\r\n&nbsp;\r\n\r\nIf a missing observation for a particular attribute doesn\u2019t have any relation with other attributes, we can say that the data is missing completely randomly. For instance, if a person forgets to write his city name in a survey, you can say that this data is missing completely randomly since there is no other logical conclusion of missing data.\r\n\r\n&nbsp;\r\n<h5><strong>Missing Data Randomly<\/strong><\/h5>\r\n&nbsp;\r\n\r\nIn case if a missing observation has a relationship with one of the other attributes of the data, we can say that the data is missing randomly. For instance, it is likely that overweight patients do not write their weights while filling marketing surveys, in such cases we can say that the data is missing randomly\r\n\r\n&nbsp;\r\n<h5><strong>Missing Data Not Randomly<\/strong><\/h5>\r\n[\/vc_column_text]<div id=\"el1650294698986-a1b962b5-ef42\" class=\"w-100 d-block \"><\/div>[\/vc_column][\/vc_row][vc_section full_width=&#8221;stretch_row&#8221; pix_over_visibility=&#8221;&#8221; css=&#8221;.vc_custom_1650444445523{padding-top: 80px !important;padding-bottom: 80px !important;background-color: #f8f9fa !important;}&#8221; el_id=&#8221;pix_section_program&#8221;][vc_row full_width=&#8221;stretch_row&#8221; pix_particles_check=&#8221;&#8221;][vc_column content_align=&#8221;text-center&#8221; offset=&#8221;vc_col-lg-offset-0 vc_col-lg-12 vc_col-md-offset-1 vc_col-md-10&#8243;]<div id=\"el1653915008478-dc3e72cb-a384\" class=\"mb-3 text-left \"><h2 class=\"mb-32 pix-sliding-headline font-weight-bold secondary-font\" data-class=\"secondary-font text-heading-default\" data-style=\"\">Disadvantages of Missing Data<\/h2><\/div>[vc_column_text css=&#8221;.vc_custom_1653915187992{padding-top: 40px !important;padding-bottom: 40px !important;}&#8221;]\r\n<p style=\"text-align: left;\">There are multiple disadvantages of having missing data in your datasets. Some of the disadvantages are enlisted below:<\/p>\r\n\r\n<ol style=\"text-align: left;\">\r\n \t<li>Some data science and machine learning tools such as Scikit learn don&#8217;t expect your data to have missing values. You have to either remove or somehow handle missing values before you could feed your data to train models from these libraries,<\/li>\r\n \t<li>The data imputation techniques may distort the overall distribution of your data.<\/li>\r\n<\/ol>\r\n<p style=\"text-align: left;\">Enough of the theory, let\u2019s now see some of the techniques used for handling missing data stored in Pandas dataframes in Python.<\/p>\r\n[\/vc_column_text][\/vc_column][\/vc_row][\/vc_section][vc_row pix_particles_check=&#8221;&#8221;][vc_column]<div id=\"el1653915018756-8cb8aa7c-ea3d\" class=\"w-100 d-block \"><\/div><div  class=\"pix-heading-el text-left \"><div><div class=\"slide-in-container\"><h2 class=\"text-heading-default font-weight-bold heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"0\">Importing and Analysing the Dataset for Missing Values<\/h2><\/div><\/div><\/div>[vc_column_text css=&#8221;.vc_custom_1653915528895{padding-top: 40px !important;padding-right: 20px !important;padding-bottom: px !important;padding-left: 20px !important;}&#8221;]The dataset that you will be using in this article can be downloaded in the form of a CSV file from the following Kaggle link.\r\n\r\n<a href=\"https:\/\/www.kaggle.com\/code\/dansbecker\/handling-missing-values\/data?select=melb_data.csv\">https:\/\/www.kaggle.com\/code\/dansbecker\/handling-missing-values\/data?select=melb_data.csv<\/a>\r\n\r\nThe script below imports the dataset and displays its header.\r\n\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import pandas as pddata_path = \"\/home\/manimalik\/Datasets\/\"house_data = pd.read_csv(data_path + \"melb_data.csv\")house_data.head()<\/pre>\r\n&nbsp;\r\n\r\nThe database consists of information about houses in Melbourn (city of Australia). Some of the data attributes are Room, Price, Postcode, Building Area, Year Built etc.\r\n\r\n&nbsp;\r\n\r\nOutput:\r\n\r\n<img fetchpriority=\"high\" decoding=\"async\" class=\"aligncenter wp-image-16187 size-full\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-13-e1653915319741.png\" alt=\"\" width=\"1074\" height=\"324\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-13-e1653915319741.png 1074w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-13-e1653915319741-300x91.png 300w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-13-e1653915319741-1024x309.png 1024w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-13-e1653915319741-768x232.png 768w\" sizes=\"(max-width: 1074px) 100vw, 1074px\" \/>\r\n\r\n&nbsp;\r\n\r\nLet\u2019s try to see the number of records in our dataset.\r\n\r\nhouse_data.shape\r\n\r\n&nbsp;\r\n\r\nOutput:\r\n\r\n(13580, 21)\r\n\r\n&nbsp;\r\n\r\nThe dataset consists of 13580 records and 21 attributes.\r\n\r\n&nbsp;\r\n\r\nNext, let\u2019s try to see which attributes or columns contain null values.\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">house_data.isnull().sum()<\/pre>\r\n&nbsp;\r\n\r\n&nbsp;\r\n\r\nOutput:\r\n\r\n<img decoding=\"async\" class=\"aligncenter size-full wp-image-16175\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-1-1.png\" alt=\"\" width=\"264\" height=\"418\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-1-1.png 264w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-1-1-189x300.png 189w\" sizes=\"(max-width: 264px) 100vw, 264px\" \/>\r\n\r\n&nbsp;\r\n\r\nFrom the above output, you can see that Car, BuildingArea, YearBuilt, and CouncilArea are the columns having missing data. The BuildingArea attribute has null values in 6450 records which is almost 47.50% of the total dataset.\r\n\r\nLet\u2019s filter the columns with the missing values. We will be working with these columns only for handling missing data.\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">house_data = house_data.filter([\"Car\", \"BuildingArea\", \"YearBuilt\", \"CouncilArea\"])house_data.head()<\/pre>\r\n&nbsp;\r\n\r\nOutput:\r\n\r\n<img decoding=\"async\" class=\"aligncenter size-full wp-image-16176\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-2-1.png\" alt=\"\" width=\"338\" height=\"184\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-2-1.png 338w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-2-1-300x163.png 300w\" sizes=\"(max-width: 338px) 100vw, 338px\" \/>\r\n\r\n&nbsp;\r\n\r\nFinally, let\u2019s try to print the data types of the columns having missing values.\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">house_data.dtypes<\/pre>\r\n&nbsp;\r\n\r\nOutput:\r\n\r\n<img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-16188\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-13-1.png\" alt=\"\" width=\"229\" height=\"106\" \/>\r\n\r\n&nbsp;\r\n\r\nThe above output shows that Car, BuildingArea, and YearBuilt are numeric attributes (of type float), whereas only the CouncilArea attribute contains categorical values.[\/vc_column_text]<div id=\"el1653447584182-343c5b4e-1046\" class=\"w-100 d-block \"><\/div><div id=\"el1650294913061-211813f5-5f2d\" class=\"w-100 d-block \"><\/div>[\/vc_column][\/vc_row][vc_section full_width=&#8221;stretch_row&#8221; pix_over_visibility=&#8221;&#8221; css=&#8221;.vc_custom_1650444445523{padding-top: 80px !important;padding-bottom: 80px !important;background-color: #f8f9fa !important;}&#8221;][vc_row full_width=&#8221;stretch_row&#8221; pix_particles_check=&#8221;&#8221;][vc_column content_align=&#8221;text-center&#8221; offset=&#8221;vc_col-lg-offset-0 vc_col-lg-12 vc_col-md-offset-1 vc_col-md-10&#8243;]<div  class=\"pix-heading-el text-left \"><div><div class=\"slide-in-container\"><h2 class=\"text-heading-default font-weight-bold heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"0\">Remove Complete Rows or Columns with Missing Data<\/h2><\/div><\/div><\/div>[vc_column_text css=&#8221;.vc_custom_1653915610586{padding-top: 40px !important;}&#8221;]\r\n<p style=\"text-align: left;\">The first and the simplest approach of handling missing data is by removing all the records where any column contains missing value. Or alternatively, you can also drop a column if the majority of values in that column is null.<\/p>\r\n<p style=\"text-align: left;\">Let\u2019s try to remove all the rows containing null values from our sample dataset<\/p>\r\n\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">house_data_filteres = house_data.dropna()print(house_data_filteres.shape)<\/pre>\r\n&nbsp;\r\n<p style=\"text-align: left;\">Output:<\/p>\r\n<p style=\"text-align: left;\">(6196, 21)<\/p>\r\n<p style=\"text-align: left;\">After removing all the rows where any column contains a null or missing value, we are left with only 6196 records.<\/p>\r\n<p style=\"text-align: left;\">Another option is to only remove those rows where a specific column contains missing values. For example, the script below removes all the rows where the BuildingArea attribute contains null values.<\/p>\r\n\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">house_data_filteres = house_data[house_data['BuildingArea'].notna()]print(house_data_filteres.shape)<\/pre>\r\n&nbsp;\r\n<p style=\"text-align: left;\">Output:<\/p>\r\n<p style=\"text-align: left;\">(7130, 21)<\/p>\r\n&nbsp;\r\n<p style=\"text-align: left;\">The main advantage of removing all the rows containing missing values is that this technique is extremely simple to implement and works for both numeric and categorical data types.<\/p>\r\n<p style=\"text-align: left;\">The main disadvantage of removing rows with missing values is that if a large number or rows contain missing values, you will lose a large chunk of useful data contained in attributes that do not have any missing values.<\/p>\r\n<p style=\"text-align: left;\">As a rule of thumb, handle missing data by removing rows only if less than 5% of the rows contain missing data.<\/p>\r\n<p style=\"text-align: left;\">In addition to removing complete rows, imputation techniques exist that can be used to fill missing data by inferencing values via interpolation from the dataset. In the next sections, you will see imputation techniques for handling missing numeric and categorical data.<\/p>\r\n[\/vc_column_text][\/vc_column][\/vc_row][\/vc_section][vc_row pix_particles_check=&#8221;&#8221;][vc_column]<div id=\"el1650362147064-486b7dc2-a9b3\" class=\"w-100 d-block \"><\/div><div id=\"el1650450433074-0be5e40e-928e\" class=\"w-100 d-block \"><\/div><div  class=\"pix-heading-el text-left \"><div><div class=\"slide-in-container\"><h2 class=\"text-heading-default font-weight-bold heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"0\">Handling Numeric Missing Values using Imputation<\/h2><\/div><\/div><\/div>[vc_column_text css=&#8221;.vc_custom_1653915651682{padding-top: 40px !important;padding-bottom: 0px !important;}&#8221;]The main imputation techniques for handling missing numeric values are:\r\n<ul>\r\n \t<li>Mean\/Median Imputation<\/li>\r\n \t<li>End of Distribution Imputation<\/li>\r\n \t<li>Arbitrary Value Imputation<\/li>\r\n<\/ul>\r\n[\/vc_column_text][vc_column_text css=&#8221;.vc_custom_1653916076430{padding-top: 40px !important;padding-bottom: 0px !important;}&#8221;]\r\n<h3><strong>Mean Median Imputation<\/strong><\/h3>\r\n&nbsp;\r\n\r\nIn mean or median imputation, as the name suggests, the numerical missing data for an attribute is replaced by the mean or median of the remaining values for that attribute.\r\n\r\nAs an example, we will perform mean or median imputation for missing values in the\u00a0 BuildingArea attribute of our dataset.\r\n\r\nThe script below calculates the median and mean values for the BuildingArea attribute.\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">median_BuildingArea = house_data.BuildingArea.median()print(\"Median of BuildingArea:\", median_BuildingArea)mean_BuildingArea = house_data.BuildingArea.mean()print(\"Mean of BuildingArea:\", mean_BuildingArea)<\/pre>\r\n&nbsp;\r\n\r\nOutput:\r\n\r\nMedian of BuildingArea: 126.0\r\n\r\nMean of BuildingArea: 151.96764988779805\r\n\r\n&nbsp;\r\n\r\nAs a next step, we will create two new columns in our dataset:\r\n<ol>\r\n \t<li><strong>Median_BuildingArea:<\/strong> that will contain the median value of the BuidingArea attribute for the rows that contain a missing or null value in the BuildingArea column.<\/li>\r\n \t<li><strong>Mean_BuildingArea:<\/strong> column will contain the mean value of the BuidingArea attribute for the rows that contain a missing or null value in the BuildingArea column.<\/li>\r\n<\/ol>\r\n&nbsp;\r\n\r\nThe following script performs the mean and median imputation for the BuildingArea attribute.\r\n\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import numpy as nphouse_data['Median_BuildingArea'] = house_data.BuildingArea.fillna(median_BuildingArea)house_data['Mean_BuildingArea'] = np.round(house_data.BuildingArea.fillna(mean_BuildingArea), 1)house_data.head()<\/pre>\r\n&nbsp;\r\n\r\nOutput:\r\n\r\n<img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-16178\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-4-1.png\" alt=\"\" width=\"621\" height=\"180\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-4-1.png 621w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-4-1-300x87.png 300w\" sizes=\"(max-width: 621px) 100vw, 621px\" \/>\r\n\r\n&nbsp;\r\n\r\nFrom the above output, you can see that the 1st and 4th rows contain a NaN (or null) value in the BuildingArea column. You can see the median and mean values for these rows in the Median_BuildingArea and Mean_BuildingArea columns, respectively.\r\n\r\nThe main advantage of the mean and median imputation is that they are extremely easy to calculate. Furthermore, the mean and median imputations can also be implemented during production. Mean and Median imputations are also good for data missing randomly.\r\n\r\nThe major disadvantage of the mean and median imputation is that they affect the default data distribution, particularly the variance of the data.\r\n\r\n&nbsp;\r\n<h3><strong>End of Distribution Imputation<\/strong><\/h3>\r\n&nbsp;\r\n\r\nFor the data not missing randomly, the mean and median imputations are not good approaches. Rather the end of distribution imputation (also known as the end of tail imputation) is the commonly used technique. The end of distribution imputation tells the data model that the data is not missing randomly and hence cannot be inferred from the existing data via interpolation.\r\n\r\n&nbsp;\r\n\r\nIt is always a good practice to remove data outliers before performing the end of distribution imputation.\r\n\r\nTo view data outliers, you can plot a box plot.\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import seaborn as snssns.boxplot(y=house_data['BuildingArea'])<\/pre>\r\n&nbsp;\r\n\r\nOutput:\r\n\r\n<img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-16179\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-5-1.png\" alt=\"\" width=\"573\" height=\"391\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-5-1.png 573w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-5-1-300x205.png 300w\" sizes=\"(max-width: 573px) 100vw, 573px\" \/>\r\n\r\n&nbsp;\r\n\r\nThe box pot shows that most of our values are below 1000. The dots shown in the above figure are outliers.\r\n\r\nNormally outliers are removed using interquartile range technique. However, for the sake of simplicity, we will simply remove the n-largest numbers from our dataset.\r\n\r\nLet\u2019s print the 10 largest values from the BuildingArea column.\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">house_data['BuildingArea'].nlargest(n=10)<\/pre>\r\n&nbsp;\r\n\r\nOutput:\r\n\r\n<img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-16180\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-6-1.png\" alt=\"\" width=\"337\" height=\"216\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-6-1.png 337w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-6-1-300x192.png 300w\" sizes=\"(max-width: 337px) 100vw, 337px\" \/>\r\n\r\n&nbsp;\r\n\r\nThe output shows that the largest value is 44515 after that there is a huge difference of values and the second largest is 6791.\r\n\r\nWe remove all the values greater than 1000 from our dataset. As a result, 8 values will be removed.\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">house_data['BuildingArea'] = np.where(house_data['BuildingArea'] &lt; 1000, \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 house_data['BuildingArea'], \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 np.nan)<\/pre>\r\n&nbsp;\r\n\r\nNow if you plot a histogram of values in the BuildingArea column, you will see that the data is normally distributed.\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">house_data['BuildingArea'].hist(bins=50)<\/pre>\r\n&nbsp;\r\n\r\nOutput:\r\n\r\n<img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-16181\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-7-1.png\" alt=\"\" width=\"544\" height=\"389\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-7-1.png 544w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-7-1-300x215.png 300w\" sizes=\"(max-width: 544px) 100vw, 544px\" \/>\r\n\r\n&nbsp;\r\n\r\nNow you can perform the end of distribution. To do so, you have to multiply the mean value of the BuildingArea column with three standard deviations.\r\n\r\nThe script below finds the end of distribution value for the BuildingArea column.\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">end_of_val = house_data['BuildingArea'].mean() + 3 * house_data['BuildingArea'].std()print(end_of_val)<\/pre>\r\n&nbsp;\r\n\r\nOutput:\r\n\r\n392.3636082673994\r\n\r\nFinally, the script below creates a new column EOD_BuildingArea column that contains the end of distribution values for the rows where the original BuildingArea column contains missing data.\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import numpy as nphouse_data['EOD_BuildingArea'] = house_data.BuildingArea.fillna(end_of_val)house_data.head()<\/pre>\r\n&nbsp;\r\n\r\nOutput:\r\n\r\n<img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-16182\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-8-1.png\" alt=\"\" width=\"738\" height=\"183\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-8-1.png 738w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-8-1-300x74.png 300w\" sizes=\"(max-width: 738px) 100vw, 738px\" \/>\r\n\r\n&nbsp;\r\n<h3><\/h3>\r\n<h3><strong>Arbitrary Value Imputation<\/strong><\/h3>\r\n&nbsp;\r\n\r\nIn arbitrary value imputation, a totally arbitrary value is selected to replace missing values. The arbitrary value should not belong to the dataset. Rather, it signifies a missing value. A good value can be 99, 999, 9999 or any number containing the digit 9. In case of all positive values, you can use -1 as the arbitrary value.\r\n\r\nIn our dataset, the BuildingArea column only contains positive values. Therefore, to perform arbitrary value imputation, we create a new column i.e. AVE_BuildingArea that contains -1 for the rows where the original BuildingArea column contained missing data.\r\n\r\nThe following script performs the arbitrary value imputation for the BuildingArea column.\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import numpy as nphouse_data['AVE_BuildingArea'] = house_data.BuildingArea.fillna(-1)house_data.head()<\/pre>\r\n&nbsp;\r\n\r\nOutput:\r\n\r\n<img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-16183\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-9-1.png\" alt=\"\" width=\"861\" height=\"177\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-9-1.png 861w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-9-1-300x62.png 300w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-9-1-768x158.png 768w\" sizes=\"(max-width: 861px) 100vw, 861px\" \/>\r\n\r\nArbitrary value imputation is also suitable for replacing data that is not missing randomly.[\/vc_column_text]<div id=\"el1650362652282-42ee7789-aa09\" class=\"w-100 d-block \"><\/div>[\/vc_column][\/vc_row][vc_section full_width=&#8221;stretch_row&#8221; pix_over_visibility=&#8221;&#8221; css=&#8221;.vc_custom_1650444445523{padding-top: 80px !important;padding-bottom: 80px !important;background-color: #f8f9fa !important;}&#8221;][vc_row full_width=&#8221;stretch_row&#8221; pix_particles_check=&#8221;&#8221;][vc_column content_align=&#8221;text-center&#8221; offset=&#8221;vc_col-lg-offset-0 vc_col-lg-12 vc_col-md-offset-1 vc_col-md-10&#8243;]<div  class=\"pix-heading-el text-left \"><div><div class=\"slide-in-container\"><h2 class=\"text-heading-default font-weight-bold heading-text el-title_custom_color mb-12\" style=\"\" data-anim-type=\"\" data-anim-delay=\"0\">Handling Categorical Missing Values<\/h2><\/div><\/div><\/div>[vc_column_text css=&#8221;.vc_custom_1653917006934{padding-top: 40px !important;}&#8221;]\r\n<p style=\"text-align: left;\">Two main techniques exists for handling categorical missing values:<\/p>\r\n\r\n<ul style=\"text-align: left;\">\r\n \t<li style=\"text-align: left;\">Frequent Category Imputation<\/li>\r\n \t<li style=\"text-align: left;\">Missing Category Imputation<\/li>\r\n<\/ul>\r\n&nbsp;\r\n\r\n&nbsp;\r\n<h3 style=\"text-align: left;\"><strong>Frequent Category Imputation<\/strong><\/h3>\r\n&nbsp;\r\n<p style=\"text-align: left;\">In frequent category imputation, the missing categorical value is replaced by the most frequently occurring category in that column.<\/p>\r\n<p style=\"text-align: left;\">As an example, we will replace missing values in the CouncilArea categorical column using the frequent category imputation.<\/p>\r\n<p style=\"text-align: left;\">Let\u2019s see the most frequently occurring value in the CouncilArea column.<\/p>\r\n\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">house_data['CouncilArea'].value_counts()<\/pre>\r\n<p style=\"text-align: left;\">The output below shows that \u201cMoreland\u201d council area is the most frequently occurring value in the CouncilArea column.<\/p>\r\n&nbsp;\r\n<p style=\"text-align: left;\">Output:<\/p>\r\n<p style=\"text-align: left;\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-16184\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-10-1.png\" alt=\"\" width=\"319\" height=\"648\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-10-1.png 319w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-10-1-148x300.png 148w\" sizes=\"(max-width: 319px) 100vw, 319px\" \/><\/p>\r\n&nbsp;\r\n<p style=\"text-align: left;\">Another way to get the most frequently occurring value from a categorical column is by using the \u201cmode()\u201d method, as shown in the following script.<\/p>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">house_data['CouncilArea'].mode()\r\n\r\nOutput:\r\n\r\n0\u00a0 \u00a0 Moreland\r\n\r\ndtype: object<\/pre>\r\n&nbsp;\r\n<p style=\"text-align: left;\">Now we know that the value \u201cMoreland\u201d is the most frequently occuring value in the CouncilArea column. We can replace the missing value in the CouncilArea column with this value and add them in a new column.<\/p>\r\n<p style=\"text-align: left;\">The following script does that and then prints the first five null values in the CouncilArea column along with the replaced values in the Mode_CouncilArea column.<\/p>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import numpy as nphouse_data['Mode_CouncilArea'] = house_data.CouncilArea.fillna(\"Moreland\")house_data[house_data['CouncilArea'].isna()].filter([\"CouncilArea\", \"Mode_CouncilArea\"], axis = 1).head()<\/pre>\r\n<p style=\"text-align: left;\">Output:<\/p>\r\n<p style=\"text-align: left;\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-16194\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-14.png\" alt=\"\" width=\"285\" height=\"180\" \/><\/p>\r\n&nbsp;\r\n<p style=\"text-align: left;\">Frequent category imputation should be used when a categorical column contains randomly missing data.<\/p>\r\n\r\n<h3><\/h3>\r\n&nbsp;\r\n\r\n&nbsp;\r\n<h3 style=\"text-align: left;\"><strong>Missing Category Imputation<\/strong><\/h3>\r\n&nbsp;\r\n<p style=\"text-align: left;\">In missing category imputation, missing values in a categorical column are simply replaced by any dummy value that does not exist in that column. Missing category imputation is used to tell the data models that the data is not missing at random and should not be replaced by any other value.<\/p>\r\n<p style=\"text-align: left;\">The following script creates a new column \u201cMSA_CouncilArea\u201d where the missing values from the CouncilArea column are replaced by the string \u201cMissing\u201d.<\/p>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import numpy as nphouse_data['MSA_CouncilArea'] = house_data.CouncilArea.fillna(\"Missing\")<\/pre>\r\n<p style=\"text-align: left;\">The script below shows the header of the original CouncilArea column containing null values, along with the \u201cMode_CouncilArea\u201d column that displays the frequent category imputation values, and the \u201cMSA_CouncilArea\u201d column that contains the missing category imputation values.<\/p>\r\n\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">house_data[house_data['CouncilArea'].isna()].filter([\"CouncilArea\", \"Mode_CouncilArea\", \"MSA_CouncilArea\"], axis = 1).head()<\/pre>\r\n<p style=\"text-align: left;\">Output:<\/p>\r\n<img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-16186 alignleft\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-12.png\" alt=\"\" width=\"412\" height=\"177\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-12.png 412w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/pasted-image-0-12-300x129.png 300w\" sizes=\"(max-width: 412px) 100vw, 412px\" \/>[\/vc_column_text][\/vc_column][\/vc_row][\/vc_section][vc_row pix_particles_check=&#8221;&#8221;][vc_column]<div id=\"el1653916933602-6ba84952-a220\" class=\"w-100 d-block \"><\/div>[vc_column_text]Missing values can hugely affect the performance of your data models. Therefore, it is immensely important to handle missing values in your datasets. To this end, several approaches exist for missing data handling as you saw in this article. However the decision to select a missing data handling technique depends on the type and the reason behind missing data.\r\n\r\nAs a rule of thumb, if less than 5% of records in your dataset contain missing values, you can simply remove those records. Else, if your numeric attribute contains missing data, and the data is missing randomly, you can select mean\/median imputation. On the other hand if your numeric data is not missing randly, you should perform end of distribution, or arbitrary value imputation.\r\n\r\nIn case of categorical columns, if your data is randomly missing, frequent category imputation is the approach to go by. On the other hand if your categorical data are not missing randomly, you should opt for missing category imputation.[\/vc_column_text][\/vc_column][\/vc_row][vc_row pix_particles_check=&#8221;&#8221;][vc_column width=&#8221;1\/2&#8243;]<div id=\"el1646794934167-c0c94dd3-ea74\" class=\"w-100 d-block \"><\/div><div class=\" mb-3 mb-md-0 \"  ><div class=\"card w-100 h-100 bg-white  vc_custom_1652982865548  pix-hover-item rounded-10 position-relative overflow-hidden2 text-white tilt fancy_card\" ><div class=\"card-img-overlay overflow-visible d-inline-block w-100 pix-img-overlay pix-p-30 d-flex align-items-end text-left\"><div class=\"w-100 \"><h3 class=\"card-title  text-black font-weight-bold mb-0 animate-in\" style=\"\">See what we can do for you.<\/h3><p class=\"card-text pix-pt-10 text-black \" style=\"\">Build smarter AI with us.<\/p><div class=\"card-btn-div mt-4 d-inline-block w-100\"><a  href=\"https:\/\/datumo.com\" class=\"btn mb-2     text-white btn-black d-inline-block      btn-md\" target=\"_blank\" rel=\"noopener\"    ><span class=\"font-weight-bold \" >Learn More<\/span><\/a><\/div><\/div><\/div><\/div><\/div>[\/vc_column][vc_column width=&#8221;1\/2&#8243;]<div id=\"el1646794982519-9a19190b-7fde\" class=\"w-100 d-block \"><\/div><div class=\" mb-3 mb-md-0 \"  ><div class=\"card w-100 h-100 bg-black  vc_custom_1653916974837  pix-hover-item rounded-10 position-relative overflow-hidden2 text-white tilt fancy_card\" ><div class=\"card-img-overlay overflow-visible d-inline-block w-100 pix-img-overlay pix-p-30 d-flex align-items-end text-left\"><div class=\"w-100 \"><h3 class=\"card-title  text-white font-weight-bold mb-0 animate-in\" style=\"\">We would like to support the AI industry by sharing.<\/h3><p class=\"card-text pix-pt-10 text-white \" style=\"\"><\/p><div class=\"card-btn-div mt-4 d-inline-block w-100\"><a  href=\"https:\/\/open.datumo.com\/en\" class=\"btn mb-2    vc_custom_1653916974840  btn-primary d-inline-block      btn-md\" target=\"_blank\" rel=\"noopener\"    ><span class=\"font-weight-bold \" >Download Open Datasets<\/span><\/a><\/div><\/div><\/div><\/div><\/div>[\/vc_column][\/vc_row][vc_row pix_particles_check=&#8221;&#8221;][vc_column]<div id=\"el1646799961152-e3ee06c0-4e82\" class=\"w-100 d-block \"><\/div>[\/vc_column][\/vc_row]","protected":false},"excerpt":{"rendered":"[vc_row pix_particles_check=&#8221;&#8221;][vc_column][vc_column_text css=&#8221;.vc_custom_1653915132335{padding-top: 40px !important;padding-bottom: 40px !important;}&#8221;]There can be multiple reasons for having missing data in your observations. Three of the most common reasons are listed below: Sometimes the data is not recorded intentionally. For instance, while completing market surveys,&#8230;","protected":false},"author":1,"featured_media":2764,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[131],"tags":[26,149,166,130,127,167,168,165],"class_list":["post-16171","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech","tag-ai","tag-data","tag-dataframes","tag-datasets","tag-datumo","tag-missing-data","tag-pandas","tag-python"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v23.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Handling Missing Data in Pandas Dataframes in Python - DATUMO<\/title>\n<meta name=\"description\" content=\"In this article, you will study some of the common reasons for having missing data in your observations.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.datumo.com\/en\/tech\/16171\" \/>\n<meta property=\"og:locale\" content=\"ko_KR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Handling Missing Data in Pandas Dataframes in Python\" \/>\n<meta property=\"og:description\" content=\"In this article, you will study some of the common reasons for having missing data in your observations.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/blog.datumo.com\/en\/tech\/16171\" \/>\n<meta property=\"og:site_name\" content=\"DATUMO\" \/>\n<meta property=\"article:published_time\" content=\"2022-05-30T12:49:35+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-10-22T08:23:36+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/2.1.webp\" \/>\n\t<meta property=\"og:image:width\" content=\"1080\" \/>\n\t<meta property=\"og:image:height\" content=\"600\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/webp\" \/>\n<meta name=\"author\" content=\"DATUMO\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:title\" content=\"Handling Missing Data in Pandas Dataframes in Python\" \/>\n<meta name=\"twitter:description\" content=\"In this article, you will study some of the common reasons for having missing data in your observations.\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/2.1.webp\" \/>\n<meta name=\"twitter:label1\" content=\"\uae00\uc4f4\uc774\" \/>\n\t<meta name=\"twitter:data1\" content=\"DATUMO\" \/>\n\t<meta name=\"twitter:label2\" content=\"\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04\" \/>\n\t<meta name=\"twitter:data2\" content=\"16\ubd84\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"TechArticle\",\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16171#article\",\"isPartOf\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16171\"},\"author\":{\"name\":\"DATUMO\",\"@id\":\"https:\/\/blog.datumo.com\/#\/schema\/person\/02ec2d0ba953b146878dab089dc735b6\"},\"headline\":\"Handling Missing Data in Pandas Dataframes in Python\",\"datePublished\":\"2022-05-30T12:49:35+00:00\",\"dateModified\":\"2024-10-22T08:23:36+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16171\"},\"wordCount\":3011,\"publisher\":{\"@id\":\"https:\/\/blog.datumo.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16171#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2020\/05\/portfolio-12.jpg\",\"keywords\":[\"AI\",\"data\",\"dataframes\",\"datasets\",\"datumo\",\"Missing Data\",\"Pandas\",\"python\"],\"articleSection\":[\"tech\"],\"inLanguage\":\"ko-KR\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16171\",\"url\":\"https:\/\/blog.datumo.com\/en\/tech\/16171\",\"name\":\"Handling Missing Data in Pandas Dataframes in Python - DATUMO\",\"isPartOf\":{\"@id\":\"https:\/\/blog.datumo.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16171#primaryimage\"},\"image\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16171#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2020\/05\/portfolio-12.jpg\",\"datePublished\":\"2022-05-30T12:49:35+00:00\",\"dateModified\":\"2024-10-22T08:23:36+00:00\",\"description\":\"In this article, you will study some of the common reasons for having missing data in your observations.\",\"breadcrumb\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16171#breadcrumb\"},\"inLanguage\":\"ko-KR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/blog.datumo.com\/en\/tech\/16171\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16171#primaryimage\",\"url\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2020\/05\/portfolio-12.jpg\",\"contentUrl\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2020\/05\/portfolio-12.jpg\",\"width\":1820,\"height\":1660},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16171#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/blog.datumo.com\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Handling Missing Data in Pandas Dataframes in Python\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/blog.datumo.com\/#website\",\"url\":\"https:\/\/blog.datumo.com\/\",\"name\":\"DATUMO\",\"description\":\"The Data for Smarter AI\",\"publisher\":{\"@id\":\"https:\/\/blog.datumo.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/blog.datumo.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"ko-KR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/blog.datumo.com\/#organization\",\"name\":\"DATUMO\",\"url\":\"https:\/\/blog.datumo.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/blog.datumo.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/2.1.webp\",\"contentUrl\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/2.1.webp\",\"width\":1080,\"height\":600,\"caption\":\"DATUMO\"},\"image\":{\"@id\":\"https:\/\/blog.datumo.com\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/blog.datumo.com\/#\/schema\/person\/02ec2d0ba953b146878dab089dc735b6\",\"name\":\"DATUMO\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/blog.datumo.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/1942a8a63e1c8fa0d9be56cda789edd6c0a866259cd5dca24952597ffa8bab3d?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/1942a8a63e1c8fa0d9be56cda789edd6c0a866259cd5dca24952597ffa8bab3d?s=96&d=mm&r=g\",\"caption\":\"DATUMO\"},\"description\":\"DATUMO, The Data for Smarter AI. We seek to drive impact in the world by providing diverse and high quality data to build smarter AI.\",\"sameAs\":[\"https:\/\/blog.datumo.com\/en\"],\"url\":\"https:\/\/blog.datumo.com\/en\/author\/selectstar\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Handling Missing Data in Pandas Dataframes in Python - DATUMO","description":"In this article, you will study some of the common reasons for having missing data in your observations.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.datumo.com\/en\/tech\/16171","og_locale":"ko_KR","og_type":"article","og_title":"Handling Missing Data in Pandas Dataframes in Python","og_description":"In this article, you will study some of the common reasons for having missing data in your observations.","og_url":"https:\/\/blog.datumo.com\/en\/tech\/16171","og_site_name":"DATUMO","article_published_time":"2022-05-30T12:49:35+00:00","article_modified_time":"2024-10-22T08:23:36+00:00","og_image":[{"width":1080,"height":600,"url":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/2.1.webp","type":"image\/webp"}],"author":"DATUMO","twitter_card":"summary_large_image","twitter_title":"Handling Missing Data in Pandas Dataframes in Python","twitter_description":"In this article, you will study some of the common reasons for having missing data in your observations.","twitter_image":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/2.1.webp","twitter_misc":{"\uae00\uc4f4\uc774":"DATUMO","\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04":"16\ubd84"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"TechArticle","@id":"https:\/\/blog.datumo.com\/en\/tech\/16171#article","isPartOf":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16171"},"author":{"name":"DATUMO","@id":"https:\/\/blog.datumo.com\/#\/schema\/person\/02ec2d0ba953b146878dab089dc735b6"},"headline":"Handling Missing Data in Pandas Dataframes in Python","datePublished":"2022-05-30T12:49:35+00:00","dateModified":"2024-10-22T08:23:36+00:00","mainEntityOfPage":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16171"},"wordCount":3011,"publisher":{"@id":"https:\/\/blog.datumo.com\/#organization"},"image":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16171#primaryimage"},"thumbnailUrl":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2020\/05\/portfolio-12.jpg","keywords":["AI","data","dataframes","datasets","datumo","Missing Data","Pandas","python"],"articleSection":["tech"],"inLanguage":"ko-KR"},{"@type":"WebPage","@id":"https:\/\/blog.datumo.com\/en\/tech\/16171","url":"https:\/\/blog.datumo.com\/en\/tech\/16171","name":"Handling Missing Data in Pandas Dataframes in Python - DATUMO","isPartOf":{"@id":"https:\/\/blog.datumo.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16171#primaryimage"},"image":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16171#primaryimage"},"thumbnailUrl":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2020\/05\/portfolio-12.jpg","datePublished":"2022-05-30T12:49:35+00:00","dateModified":"2024-10-22T08:23:36+00:00","description":"In this article, you will study some of the common reasons for having missing data in your observations.","breadcrumb":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16171#breadcrumb"},"inLanguage":"ko-KR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.datumo.com\/en\/tech\/16171"]}]},{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/blog.datumo.com\/en\/tech\/16171#primaryimage","url":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2020\/05\/portfolio-12.jpg","contentUrl":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2020\/05\/portfolio-12.jpg","width":1820,"height":1660},{"@type":"BreadcrumbList","@id":"https:\/\/blog.datumo.com\/en\/tech\/16171#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/blog.datumo.com\/en\/"},{"@type":"ListItem","position":2,"name":"Handling Missing Data in Pandas Dataframes in Python"}]},{"@type":"WebSite","@id":"https:\/\/blog.datumo.com\/#website","url":"https:\/\/blog.datumo.com\/","name":"DATUMO","description":"The Data for Smarter AI","publisher":{"@id":"https:\/\/blog.datumo.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.datumo.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"ko-KR"},{"@type":"Organization","@id":"https:\/\/blog.datumo.com\/#organization","name":"DATUMO","url":"https:\/\/blog.datumo.com\/","logo":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/blog.datumo.com\/#\/schema\/logo\/image\/","url":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/2.1.webp","contentUrl":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/2.1.webp","width":1080,"height":600,"caption":"DATUMO"},"image":{"@id":"https:\/\/blog.datumo.com\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/blog.datumo.com\/#\/schema\/person\/02ec2d0ba953b146878dab089dc735b6","name":"DATUMO","image":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/blog.datumo.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/1942a8a63e1c8fa0d9be56cda789edd6c0a866259cd5dca24952597ffa8bab3d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1942a8a63e1c8fa0d9be56cda789edd6c0a866259cd5dca24952597ffa8bab3d?s=96&d=mm&r=g","caption":"DATUMO"},"description":"DATUMO, The Data for Smarter AI. We seek to drive impact in the world by providing diverse and high quality data to build smarter AI.","sameAs":["https:\/\/blog.datumo.com\/en"],"url":"https:\/\/blog.datumo.com\/en\/author\/selectstar"}]}},"_links":{"self":[{"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/posts\/16171","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/comments?post=16171"}],"version-history":[{"count":16,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/posts\/16171\/revisions"}],"predecessor-version":[{"id":16911,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/posts\/16171\/revisions\/16911"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/media\/2764"}],"wp:attachment":[{"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/media?parent=16171"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/categories?post=16171"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/tags?post=16171"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}