{"id":16328,"date":"2022-06-22T03:57:03","date_gmt":"2022-06-22T03:57:03","guid":{"rendered":"https:\/\/blog.datumo.com\/en\/?p=16328"},"modified":"2024-10-22T08:51:54","modified_gmt":"2024-10-22T08:51:54","slug":"data-crawling-everything-case-of-social-media","status":"publish","type":"post","link":"https:\/\/blog.datumo.com\/en\/tech\/16328","title":{"rendered":"Data Crawling Everything: Case of Social Media"},"content":{"rendered":"[vc_row pix_particles_check=&#8221;&#8221;][vc_column]<div id=\"el1646799961152-e3ee06c0-4e82\" class=\"w-100 d-block \"><\/div><div class=\"pix-content-box card      vc_custom_1654577545529 custom-responsive-153334890   rounded-lg bg- w-100  \"   ><div class=\"\" style=\"z-index:30;position:relative;\">[vc_column_text]\r\n<p style=\"text-align: left;\"><span style=\"font-size: 14pt;\"><strong>\ud83d\udd11<\/strong> <strong>In 6 minutes you will learn:<\/strong><\/span><\/p>\r\n&nbsp;\r\n<ul>\r\n \t<li>The definition of data scraping<\/li>\r\n \t<li>The process of data scraping, including examples from Twitter and Reddit<\/li>\r\n<\/ul>\r\n[\/vc_column_text]<\/div><\/div>[\/vc_column][\/vc_row][vc_row pix_particles_check=&#8221;&#8221;][vc_column]<div id=\"el1650294698986-a1b962b5-ef42\" class=\"w-100 d-block \"><\/div>[vc_column_text css=&#8221;.vc_custom_1655870344947{padding-top: 40px !important;padding-right: 20px !important;padding-bottom: 40px !important;padding-left: 20px !important;}&#8221;]\r\n<p id=\"361f\" class=\"pw-post-body-paragraph le lf jj bn b lg lh li lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma jc hk\" data-selectable-paragraph=\"\">Data scraping is the process of importing information or data from a website and displaying it in a spreadsheet or a local file on your computer. We will be exploring how one can scrape data from social media sites, particularly Twitter. We will also learn the same for Reddit using its official API. Lastly, we will learn how to generally scrape the content in web pages and pictures from different web pages as well. So, let\u2019s get started!<\/p>\r\n[\/vc_column_text][\/vc_column][\/vc_row][vc_section full_width=&#8221;stretch_row&#8221; pix_over_visibility=&#8221;&#8221; css=&#8221;.vc_custom_1650444445523{padding-top: 80px !important;padding-bottom: 80px !important;background-color: #f8f9fa !important;}&#8221; el_id=&#8221;pix_section_program&#8221;][vc_row full_width=&#8221;stretch_row&#8221; pix_particles_check=&#8221;&#8221;][vc_column content_align=&#8221;text-center&#8221; offset=&#8221;vc_col-lg-offset-0 vc_col-lg-12 vc_col-md-offset-1 vc_col-md-10&#8243;][vc_raw_html]JTNDbWV0YSUyMGh0dHAtZXF1aXYlM0QlMjJyZWZyZXNoJTIyJTIwY29udGVudCUzRCUyMjAlM0IlMjB1cmwlM0RodHRwcyUzQSUyRiUyRmRhdHVtby5jb20lMkZlbiUyRmRhdGEtY3Jhd2xpbmctZXZlcnl0aGluZy1jYXNlLW9mLXNvY2lhbC1tZWRpYSUyRiUyMiUzRQ==[\/vc_raw_html]<div id=\"el1650442503491-f5da6b2f-fa35\" class=\"mb-3 text-left \"><h2 class=\"mb-32 pix-sliding-headline font-weight-bold secondary-font\" data-class=\"secondary-font text-heading-default\" data-style=\"\">Prerequisites<\/h2><\/div>[vc_column_text css=&#8221;.vc_custom_1655878939124{padding-top: 40px !important;padding-bottom: 40px !important;}&#8221;]\r\n<p id=\"7c10\" class=\"pw-post-body-paragraph le lf jj bn b lg lh li lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma jc hk\" style=\"text-align: left;\" data-selectable-paragraph=\"\">Before you go ahead, please note that there are a few prerequisites for this tutorial. You should have some prior basic knowledge of Machine Learning, as well as basic programming knowledge in any language (preferably in Python). We will be using Jupyter Notebook for writing our code. If you do not already have it installed, visit\u00a0<a class=\"au mq\" href=\"https:\/\/jupyter.org\/install.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">Jupyter Notebook<\/a>\u00a0or work on any other code editor of your liking.<\/p>\r\n&nbsp;\r\n<h4><\/h4>\r\n<h4 class=\"kg kh jj bn ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld hk\" style=\"text-align: left;\"><strong>1. Scraping tweets from Twitter using Twint<\/strong><\/h4>\r\n&nbsp;\r\n\r\n&nbsp;\r\n<h4 class=\"kg kh jj bn ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld hk\" style=\"text-align: left;\"><\/h4>\r\n<img fetchpriority=\"high\" decoding=\"async\" class=\"aligncenter size-full wp-image-16330\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_M39HOoQ1r9cLtshE.jpeg\" alt=\"\" width=\"670\" height=\"475\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_M39HOoQ1r9cLtshE.jpeg 670w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_M39HOoQ1r9cLtshE-300x213.jpeg 300w\" sizes=\"(max-width: 670px) 100vw, 670px\" \/>\r\n\r\n&nbsp;\r\n<p style=\"text-align: left;\">There are a number of ways to scrape tweets from Twitter. You can do so using the Twitter API but a shortcoming of this is that it limits the number of tweets that can be scraped. Manually scraping the tweets is also one option but requires unnecessary time and effort. This is why we will be using Twint to collect our tweets from Twitter.\u00a0<strong class=\"bn nl\">Twint\u00a0<\/strong>is a tool that allows you to scrape tweets on different basis e.g. the tweets of a particular user, tweets containing a particular keyword, tweets that are tweeted after or within a certain time, etc.<\/p>\r\n&nbsp;\r\n<h5 id=\"0e67\" class=\"kg kh jj bn ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld hk\" style=\"text-align: left;\"><strong>Installations<\/strong><\/h5>\r\n&nbsp;\r\n<p style=\"text-align: left;\">You can install\u00a0<strong class=\"bn nl\">Twint\u00a0<\/strong>by typing the following command in your terminal<\/p>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">pip install twint<\/pre>\r\n&nbsp;\r\n<h5 id=\"2cc3\" class=\"kg kh jj bn ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld hk\" style=\"text-align: left;\"><strong>Scraping Twitter tweets using Twint<\/strong><\/h5>\r\n&nbsp;\r\n<h5 id=\"b70a\" class=\"ms kh jj bn ki mt mu mv km mw mx my kq lo mz na ku ls nb nc ky lw nd ne lc nf hk\" style=\"text-align: left;\"><strong>Scraping tweets of a particular user<\/strong><\/h5>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import twint\r\n    config = twint.Config()\r\n    # Search tweets tweeted by user 'BarackObama'\r\n    config.Username = \"BarackObama\"\r\n    # Limit search results to 20\r\n    config.Limit = 20\r\n    # Return tweets that were published after Jan 1st, 2020\r\n    config.Since = \"2020-01-1 20:30:15\"\r\n    # Formatting the tweets\r\n    config.Format = \"Tweet Id {id}, tweeted at {time}, {date}, by {username} says: {tweet}\"\r\n    # Storing tweets in a csv file\r\n    config.Store_csv = True\r\n    config.Output = \"Barack Obama\"\r\n    twint.run.Search(config)<\/pre>\r\n&nbsp;\r\n<p style=\"text-align: left;\"><strong class=\"bn nl\">Output:<\/strong><\/p>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">Tweet Id 1261004586359422979, tweeted at 18:44:56, 2020-05-14, by BarackObama says: Vote.\r\n    Tweet Id 1260955716644470784, tweeted at 15:30:44, 2020-05-14, by BarackObama says: Michelle and I want to do our part to give all you parents a break today, so we\u2019re reading \u201cThe Word Collector\u201d for @chipublib. It\u2019s a fun book that vividly illustrates the transformative power of words\u2013\u2013and we hope you enjoy it as much as we did. pic.twitter.com\/ADYbL6Dzg4\r\n    Tweet Id 1260707691900612615, tweeted at 23:05:11, 2020-05-13, by BarackObama says: Despite all the time that\u2019s been lost, we can still make real progress against the virus, protect people from the economic fallout, and more safely approach something closer to normal if we start making better policy decisions now. https:\/\/www.vox.com\/2020\/5\/13\/21248157\/testing-quarantine-masks-stimulus \u2026\r\n    ....<\/pre>\r\n&nbsp;\r\n<h5 id=\"350e\" class=\"ms kh jj bn ki mt mu mv km mw mx my kq lo mz na ku ls nb nc ky lw nd ne lc nf hk\" style=\"text-align: left;\"><strong>Scraping tweets with a particular keyword<\/strong><\/h5>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import twint\r\n    # Configure\r\n    config = twint.Config()\r\n    # Search tweets that mention Taylor Swift\r\n    config.Search = \"taylor swift\"\r\n    # Limit search results to 10\r\n    config.Limit = 20\r\n    # Return tweets that were published after Jan 1st, 2020\r\n    config.Since = \"2020-01-1 20:30:15\"\r\n    # Formatting the tweets\r\n    config.Format = \"Tweet Id {id}, tweeted at {time}, {date}, by {username} says: {tweet}\"\r\n    # Storing tweets in a csv file\r\n    config.Store_csv = True\r\n    config.Output = \"Taylor Swift\"\r\n    twint.run.Search(config)<\/pre>\r\n[\/vc_column_text][vc_column_text css=&#8221;.vc_custom_1655880114290{border-top-width: 1px !important;padding-top: 60px !important;padding-bottom: 40px !important;border-top-color: rgba(0,0,0,0.2) !important;border-top-style: solid !important;}&#8221;]\r\n<h4 id=\"aaaa\" class=\"kg kh jj bn ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld hk\" style=\"text-align: left;\"><strong>2. Scraping Reddit using Reddit API<\/strong><\/h4>\r\n&nbsp;\r\n\r\n<img decoding=\"async\" class=\"aligncenter size-full wp-image-16334\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_WLNk1mMKeIk9CvfB.jpeg\" alt=\"\" width=\"700\" height=\"466\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_WLNk1mMKeIk9CvfB.jpeg 700w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_WLNk1mMKeIk9CvfB-300x200.jpeg 300w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/>\r\n\r\n&nbsp;\r\n<p id=\"d130\" class=\"pw-post-body-paragraph le lf jj bn b lg ng li lj lk nh lm ln lo ni lq lr ls nj lu lv lw nk ly lz ma jc hk\" style=\"text-align: left;\" data-selectable-paragraph=\"\">We will be scraping donation requests made on Reddit by using the official Reddit API. To access it, you need to:<\/p>\r\n\r\n<ol class=\"\">\r\n \t<li id=\"a2df\" class=\"nn no jj bn b lg ng lk nh lo np ls nq lw nr ma ajo nt nu nv hk\" style=\"text-align: left;\" data-selectable-paragraph=\"\">Go to the official\u00a0<a class=\"au mq\" href=\"https:\/\/www.reddit.com\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Reddit<\/a>\u00a0website<\/li>\r\n \t<li id=\"de80\" class=\"nn no jj bn b lg nw lk nx lo ny ls nz lw oa ma ajo nt nu nv hk\" style=\"text-align: left;\" data-selectable-paragraph=\"\">Log into your Reddit account or create a new one<\/li>\r\n \t<li id=\"02de\" class=\"nn no jj bn b lg nw lk nx lo ny ls nz lw oa ma ajo nt nu nv hk\" style=\"text-align: left;\" data-selectable-paragraph=\"\">Go to User Settings<img decoding=\"async\" class=\"aligncenter size-full wp-image-16336\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/\u1103\u1161\u1110\u116e\u1106\u11691.png\" alt=\"\" width=\"504\" height=\"862\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/\u1103\u1161\u1110\u116e\u1106\u11691.png 504w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/\u1103\u1161\u1110\u116e\u1106\u11691-175x300.png 175w\" sizes=\"(max-width: 504px) 100vw, 504px\" \/><\/li>\r\n<\/ol>\r\n&nbsp;\r\n<p style=\"text-align: left;\">4. Go to Privacy and Security<\/p>\r\n&nbsp;\r\n\r\n<img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-16337\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_bXJJ5lJBZZk9IoMa.png\" alt=\"\" width=\"700\" height=\"499\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_bXJJ5lJBZZk9IoMa.png 700w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_bXJJ5lJBZZk9IoMa-300x214.png 300w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/>\r\n\r\n&nbsp;\r\n<p style=\"text-align: left;\">5. Go to App authorization<\/p>\r\n&nbsp;\r\n<figure class=\"mc md me mf hb mg gp gq paragraph-image\">\r\n<div class=\"mh mi dq mj cf mk\" tabindex=\"0\" role=\"button\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-16338\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_e4jLIZoz4DYWroBo.png\" alt=\"\" width=\"700\" height=\"220\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_e4jLIZoz4DYWroBo.png 700w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_e4jLIZoz4DYWroBo-300x94.png 300w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/><\/div><\/figure>\r\n<div style=\"text-align: left;\" tabindex=\"0\" role=\"button\"><\/div>\r\n<div style=\"text-align: left;\" tabindex=\"0\" role=\"button\">6. Click on \u2018are you a developer? create an app\u2019<\/div>\r\n&nbsp;\r\n<div tabindex=\"0\" role=\"button\"><\/div>\r\n<div tabindex=\"0\" role=\"button\"><\/div>\r\n<div tabindex=\"0\" role=\"button\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-16339\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_1bhpVOAeabH8KCYe.png\" alt=\"\" width=\"700\" height=\"352\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_1bhpVOAeabH8KCYe.png 700w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_1bhpVOAeabH8KCYe-300x151.png 300w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/><\/div>\r\n<div tabindex=\"0\" role=\"button\"><\/div>\r\n<div tabindex=\"0\" role=\"button\"><\/div>\r\n<div tabindex=\"0\" role=\"button\">\r\n\r\n&nbsp;\r\n<p id=\"97ec\" class=\"pw-post-body-paragraph le lf jj bn b lg ng li lj lk nh lm ln lo ni lq lr ls nj lu lv lw nk ly lz ma jc hk\" style=\"text-align: left;\" data-selectable-paragraph=\"\">7. Create a name for your application and fill in the other relevant credentials. In redirect URL, put the URL of your localhost.<\/p>\r\n<p id=\"f4b7\" class=\"pw-post-body-paragraph le lf jj bn b lg ng li lj lk nh lm ln lo ni lq lr ls nj lu lv lw nk ly lz ma jc hk\" style=\"text-align: left;\" data-selectable-paragraph=\"\">8, Click on \u2018create app\u2019<\/p>\r\n&nbsp;\r\n\r\n<\/div>\r\n<div class=\"mh mi dq mj cf mk\" style=\"text-align: left;\" tabindex=\"0\" role=\"button\"><\/div>\r\n<div tabindex=\"0\" role=\"button\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-16340\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_b5SkO5YSgRtX0aop.png\" alt=\"\" width=\"700\" height=\"326\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_b5SkO5YSgRtX0aop.png 700w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_b5SkO5YSgRtX0aop-300x140.png 300w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/><\/div>\r\n<div tabindex=\"0\" role=\"button\"><\/div>\r\n<div tabindex=\"0\" role=\"button\"><\/div>\r\n&nbsp;\r\n<div style=\"text-align: left;\" tabindex=\"0\" role=\"button\">9. Copy the characters underneath \u2018personal use script\u2019 and next to \u2018secret\u2019 and save them in a file or notepad. You will be needing them to gain access to the API.<\/div>\r\n&nbsp;\r\n<div tabindex=\"0\" role=\"button\"><\/div>\r\n<div tabindex=\"0\" role=\"button\"><\/div>\r\n<div tabindex=\"0\" role=\"button\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-16341\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/\u1103\u1161\u1110\u116e\u1106\u11692.png\" alt=\"\" width=\"886\" height=\"400\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/\u1103\u1161\u1110\u116e\u1106\u11692.png 886w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/\u1103\u1161\u1110\u116e\u1106\u11692-300x135.png 300w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/\u1103\u1161\u1110\u116e\u1106\u11692-768x347.png 768w\" sizes=\"(max-width: 886px) 100vw, 886px\" \/><\/div>\r\n<div tabindex=\"0\" role=\"button\"><\/div>\r\n<div tabindex=\"0\" role=\"button\">\r\n<h5 id=\"e474\" class=\"ms kh jj bn ki mt mu mv km mw mx my kq lo mz na ku ls nb nc ky lw nd ne lc nf hk\" style=\"text-align: left;\"><strong>Installations<\/strong><\/h5>\r\n<\/div>\r\n<p id=\"f160\" class=\"pw-post-body-paragraph le lf jj bn b lg lh li lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma jc hk\" style=\"text-align: left;\" data-selectable-paragraph=\"\">We will be using a Python framework named\u00a0<strong class=\"bn nl\">Praw\u00a0<\/strong>to easily use the Reddit API. To install it, run the following command in your terminal:<\/p>\r\n\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">pip install praw<\/pre>\r\n&nbsp;\r\n\r\n&nbsp;\r\n<h5 id=\"77a5\" class=\"ms kh jj bn ki mt mu mv km mw mx my kq lo mz na ku ls nb nc ky lw nd ne lc nf hk\" style=\"text-align: left;\"><strong>Python Code<\/strong><\/h5>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import praw  \r\n    import pandas as pd  \r\n    import numpy as np\r\n    # Fill in your own credentials for client_id, client_secret and user_agent. Characters in'Personal use script' make your client_id, those in 'secret' make client_secret and user_agent is the name of your application.\r\n    reddit = praw.Reddit(client_id = '',  \r\n                         client_secret = '', \r\n                         user_agent = '') \r\n    # Get posts from the subreddits related to donations \r\n    hot_post_1 = reddit.subreddit ('donate').hot(limit = 10) \r\n    hot_post_2 = reddit.subreddit ('Assistance').hot(limit = 10) # Offers\r\n    hot_post_3 = reddit.subreddit ('Charity').hot(limit = 10) \r\n    hot_post_4 = reddit.subreddit ('Donation').hot(limit = 10) \r\n    hot_post_5 = reddit.subreddit ('gofundme').hot(limit = 10) # lots of categories\r\n    hot_post_6 = reddit.subreddit ('RandomKindness').hot(limit = 10) \r\n    hot_post_7 = reddit.subreddit ('donationrequest').hot(limit = 10 )\r\n    # Saving donation posts in an empty list\r\n    posts = []\r\n    for post in hot_post_1:\r\n        posts.append ([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])\r\n        \r\n    for post in hot_post_2:\r\n        posts.append ([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])\r\n    for post in hot_post_3:\r\n        posts.append ([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])\r\n    for post in hot_post_4:\r\n        posts.append ([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])\r\n        \r\n    for post in hot_post_5:\r\n        posts.append ([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])\r\n        \r\n    for post in hot_post_6:\r\n        posts.append ([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])\r\n        \r\n    posts = pd.DataFrame (posts, columns = ['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])\r\n    #posts\r\n    df = pd.DataFrame (data = posts)\r\n    dataframe = df.to_csv (r'donations.csv', index = False)\r\n    # Data Processing\r\n    df = pd.read_csv ('donations.csv')\r\n    df = df.drop (['id', 'subreddit', 'num_comments', 'url', 'created'],1)\r\n    df = df[['title', 'score','body']]\r\n    print (df.head ())\r\n    print(df.shape)\r\n    # Saving donation posts to a csv file\r\n    dataframe = df.to_csv (r'donations.csv', index = False)<\/pre>\r\n&nbsp;\r\n<p style=\"text-align: left;\"><strong>Output:<\/strong><\/p>\r\n&nbsp;\r\n\r\n<img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-16348\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_0jUJ9n6z9ChVVMmF.png\" alt=\"\" width=\"700\" height=\"116\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_0jUJ9n6z9ChVVMmF.png 700w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_0jUJ9n6z9ChVVMmF-300x50.png 300w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/>[\/vc_column_text][vc_column_text css=&#8221;.vc_custom_1655880443993{border-top-width: 1px !important;padding-top: 60px !important;padding-bottom: 40px !important;border-top-color: rgba(0,0,0,0.2) !important;border-top-style: solid !important;}&#8221;]\r\n<h4 id=\"0059\" class=\"kg kh jj bn ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld hk\" style=\"text-align: left;\"><strong>3. Scraping contents of a web page<\/strong><\/h4>\r\n<figure class=\"mc md me mf hb mg gp gq paragraph-image\">\r\n<div class=\"mh mi dq mj cf mk\" tabindex=\"0\" role=\"button\"><\/div><\/figure>\r\n<img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-16351\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_CXVSBQ04nESiRYTF.jpg\" alt=\"\" width=\"700\" height=\"556\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_CXVSBQ04nESiRYTF.jpg 700w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_CXVSBQ04nESiRYTF-300x238.jpg 300w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/>\r\n\r\n&nbsp;\r\n<p style=\"text-align: left;\">We will be scraping the text content of a Wikipedia page about\u00a0<a class=\"au mq\" href=\"https:\/\/en.wikipedia.org\/wiki\/Coronavirus\" target=\"_blank\" rel=\"noopener ugc nofollow\">Reddit<\/a>\u00a0using a simple and powerful Python library named BeautifulSoup. It is also important for you to be familiar with some of the basics of\u00a0<a class=\"au mq\" href=\"https:\/\/html.com\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">HTML<\/a>\u00a0for web scraping. First, right-click and open your browser\u2019s inspector to inspect the webpage. Hover your cursor on the desired section whose content you want to scrape, and you should be able to see a blue box surrounding it. If you click it, the related HTML will be selected in the browser console. The section that we wish to scrape is a div that contains the entire text within the page.<\/p>\r\n&nbsp;\r\n\r\n<img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-16350\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_ekA6YVZ7IjL8ERlp.png\" alt=\"\" width=\"700\" height=\"436\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_ekA6YVZ7IjL8ERlp.png 700w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_ekA6YVZ7IjL8ERlp-300x187.png 300w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/>\r\n\r\n&nbsp;\r\n<h5 id=\"c5a2\" class=\"ms kh jj bn ki mt mu mv km mw mx my kq lo mz na ku ls nb nc ky lw nd ne lc nf hk\" style=\"text-align: left;\"><strong>Installations<\/strong><\/h5>\r\n<p style=\"text-align: left;\">To install BeautifulSoup, run the following command in your terminal:<\/p>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">pip install BeautifulSoup4<\/pre>\r\n&nbsp;\r\n<h5 id=\"5490\" class=\"ms kh jj bn ki mt mu mv km mw mx my kq lo mz na ku ls nb nc ky lw nd ne lc nf hk\" style=\"text-align: left;\"><strong>Python Code<\/strong><\/h5>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\"># import libraries\r\n    import urllib\r\n    from bs4 import BeautifulSoup\r\n    # specify url of webpage whose content you need to scrape\r\n    url = \"https:\/\/en.wikipedia.org\/wiki\/Coronavirus\"\r\n    request = urllib.request.Request (url)\r\n    # query the website and return the html of the webpage\r\n    response = urllib.request.urlopen (request)\r\n    # parse the html using beautiful soup \r\n    var = BeautifulSoup (response,'html.parser')\r\n    # Take out the &lt;div&gt; and get its value\r\n    text_box = var.find ('div', attrs = {'id': 'bodyContent'})\r\n    text = text_box.text.strip () \r\n    print (text)<\/pre>\r\n&nbsp;\r\n<p style=\"text-align: left;\"><strong class=\"bn nl\">Output:<\/strong><\/p>\r\n&nbsp;\r\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">From Wikipedia, the free encyclopedia\r\nJump to navigation\r\nJump to search\r\nThis article is about the group of viruses. For the ongoing disease involved in the COVID-19 pandemic, see Coronavirus disease 2019. For the virus that causes this disease, see Severe acute respiratory syndrome coronavirus 2.\r\nSubfamily of viruses in the family Coronaviridae\r\nOrthocoronavirinae\r\nTransmission electron micrograph (TEM) of avian infectious bronchitis virus\r\nIllustration of the morphology of coronaviruses; the club-shaped viral spike peplomers, colored red, create the look of a corona surrounding the virion when observed with an electron microscope.<\/pre>\r\n[\/vc_column_text][vc_column_text css=&#8221;.vc_custom_1655880682779{border-top-width: 1px !important;padding-top: 60px !important;padding-bottom: 40px !important;border-top-color: rgba(0,0,0,0.2) !important;border-top-style: solid !important;}&#8221;]\r\n<h4 id=\"b32f\" class=\"kg kh jj bn ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld hk\" style=\"text-align: left;\"><strong>4. Scraping images<\/strong><\/h4>\r\n&nbsp;\r\n<p style=\"text-align: left;\">We will be scraping images in batch through the Fatkun Batch Download Image extension.<\/p>\r\n&nbsp;\r\n\r\n<img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-16353\" src=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_cNx6SfbNRbFK4mcM.png\" alt=\"\" width=\"700\" height=\"400\" srcset=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_cNx6SfbNRbFK4mcM.png 700w, https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/0_cNx6SfbNRbFK4mcM-300x171.png 300w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/>\r\n<h5><\/h5>\r\n&nbsp;\r\n<h5 id=\"bfa5\" class=\"ms kh jj bn ki mt mu mv km mw mx my kq lo mz na ku ls nb nc ky lw nd ne lc nf hk\" style=\"text-align: left;\"><strong>Prerequisites<\/strong><\/h5>\r\n&nbsp;\r\n<p id=\"1988\" class=\"pw-post-body-paragraph le lf jj bn b lg lh li lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma jc hk\" style=\"text-align: left;\" data-selectable-paragraph=\"\">You will be needing\u00a0<a class=\"au mq\" href=\"https:\/\/www.google.com\/chrome\/?brand=CHBD&amp;gclsrc=aw.ds&amp;&amp;gclid=EAIaIQobChMI_LejjK-Y6AIVw9DeCh0MowmOEAAYASAAEgJsZvD_BwE\" target=\"_blank\" rel=\"noopener ugc nofollow\">Google Chrome<\/a>\u00a0Browser along with\u00a0<a class=\"au mq\" href=\"https:\/\/chrome.google.com\/webstore\/detail\/fatkun-batch-download-ima\/nnjjahlikiabnchcpehcpkdeckfgnohf\/related?hl=en\" target=\"_blank\" rel=\"noopener ugc nofollow\">Fatkun Batch Download Image<\/a>\u00a0extension.<\/p>\r\n&nbsp;\r\n<h5 id=\"9382\" class=\"ms kh jj bn ki mt mu mv km mw mx my kq lo mz na ku ls nb nc ky lw nd ne lc nf hk\" style=\"text-align: left;\"><strong>Steps:<\/strong><\/h5>\r\n&nbsp;\r\n<ol class=\"\">\r\n \t<li id=\"ed21\" class=\"nn no jj bn b lg lh lk ll lo aks ls akt lw aku ma ajo nt nu nv hk\" style=\"text-align: left;\" data-selectable-paragraph=\"\">After you are finished with the installation, search for the website and the pictures that you want to download<\/li>\r\n \t<li id=\"5036\" class=\"nn no jj bn b lg nw lk nx lo ny ls nz lw oa ma ajo nt nu nv hk\" style=\"text-align: left;\" data-selectable-paragraph=\"\">Click on the extension\u2019s icon<\/li>\r\n \t<li id=\"260c\" class=\"nn no jj bn b lg nw lk nx lo ny ls nz lw oa ma ajo nt nu nv hk\" style=\"text-align: left;\" data-selectable-paragraph=\"\">Now an extension will get opened which would display a new tab showing all images that have been detected by it. All the pictures that appear on the extension\u2019s tab by default have opted for the purpose of download. After making the choice, click on \u2018save image\u2019.<\/li>\r\n \t<li id=\"49f9\" class=\"nn no jj bn b lg nw lk nx lo ny ls nz lw oa ma ajo nt nu nv hk\" style=\"text-align: left;\" data-selectable-paragraph=\"\">The extension would now provide you with the warning and will ask where to save the file before it is being downloaded and you have to give the confirmation for each image.<\/li>\r\n \t<li id=\"2b73\" class=\"nn no jj bn b lg nw lk nx lo ny ls nz lw oa ma ajo nt nu nv hk\" style=\"text-align: left;\" data-selectable-paragraph=\"\">The extension would create for you a new folder based on the title of the website and there you could download all the desired images. You could even click on \u2018more options\u2019 so that with the aid of link you could simply filter the images, rename and sort them as per size.<\/li>\r\n<\/ol>\r\n[\/vc_column_text][\/vc_column][\/vc_row][\/vc_section][vc_row pix_particles_check=&#8221;&#8221;][vc_column]<div id=\"el1653971463480-ce74a014-4ae9\" class=\"w-100 d-block \"><\/div>[vc_column_text css=&#8221;.vc_custom_1655880772222{padding-top: 40px !important;padding-bottom: 0px !important;}&#8221;]\r\n<p id=\"b4ae\" class=\"pw-post-body-paragraph le lf jj bn b lg lh li lj lk ll lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma jc hk\" data-selectable-paragraph=\"\">While crawling presents easy access to many web-based data collections, most times, such data also accompanies heavy noises and contaminations to be used as a dataset right away. Therefore, companies or researchers need to devote heavy efforts in quality controlling; having enough human resources is always a great challenge. Therefore, it is often more efficient to find another service that does laborious works (including both collection and preprocessing) for you. For that, we could be your perfect solution!<\/p>\r\n<p id=\"198e\" class=\"pw-post-body-paragraph le lf jj bn b lg ng li lj lk nh lm ln lo ni lq lr ls nj lu lv lw nk ly lz ma jc hk\" data-selectable-paragraph=\"\">Here at <strong class=\"bn nl\"><em class=\"ob\"><strong><a class=\"au mn\" href=\"https:\/\/www.datumo.com\" target=\"_blank\" rel=\"noopener ugc nofollow\"><em class=\"pn\">D<\/em><\/a><\/strong><a href=\"https:\/\/www.datumo.com\"><strong>ATUMO<\/strong><\/a><\/em><\/strong>, we crowdsource our tasks to diverse users located globally to ensure the quality and quantity simultaneously. Moreover, our in-house managers double-check the quality of the collected or processed data! Check us out at <a class=\"au mq\" href=\"https:\/\/datumo.com\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">datumo.com<\/a>\u00a0for more information!<\/p>\r\n[\/vc_column_text]<div id=\"el1653972293756-76a5ecd1-3d25\" class=\"w-100 d-block \"><\/div>[vc_column_text css=&#8221;.vc_custom_1655880711487{border-top-width: 1px !important;padding-top: 80px !important;padding-bottom: 0px !important;border-top-color: rgba(0,0,0,0.2) !important;border-top-style: solid !important;}&#8221;]To sum it all up, we started off by getting an introduction to data scraping or data crawling. We applied it in 4 different ways, i.e. how to extract tweets from Twitter without using its API, how to scrape posts from subreddits using the official Reddit API. We learned to scrape content from web pages using Python\u2019s BeautifulSoup library and lastly, we learned how to download images in batch using Google\u2019s Fatkun Batch Image Download extension.[\/vc_column_text]<div id=\"el1653971463481-f4f34d7c-39ce\" class=\"w-100 d-block \"><\/div>[\/vc_column][\/vc_row][vc_row pix_particles_check=&#8221;&#8221;][vc_column width=&#8221;1\/2&#8243;]<div id=\"el1646794934167-c0c94dd3-ea74\" class=\"w-100 d-block \"><\/div><div class=\" mb-3 mb-md-0 \"  ><div class=\"card w-100 h-100 bg-white  vc_custom_1652982865548  pix-hover-item rounded-10 position-relative overflow-hidden2 text-white tilt fancy_card\" ><div class=\"card-img-overlay overflow-visible d-inline-block w-100 pix-img-overlay pix-p-30 d-flex align-items-end text-left\"><div class=\"w-100 \"><h3 class=\"card-title  text-black font-weight-bold mb-0 animate-in\" style=\"\">See what we can do for you.<\/h3><p class=\"card-text pix-pt-10 text-black \" style=\"\">Build smarter AI with us.<\/p><div class=\"card-btn-div mt-4 d-inline-block w-100\"><a  href=\"https:\/\/datumo.com\" class=\"btn mb-2     text-white btn-black d-inline-block      btn-md\" target=\"_blank\" rel=\"noopener\"    ><span class=\"font-weight-bold \" >Learn More<\/span><\/a><\/div><\/div><\/div><\/div><\/div>[\/vc_column][vc_column width=&#8221;1\/2&#8243;]<div id=\"el1646794982519-9a19190b-7fde\" class=\"w-100 d-block \"><\/div><div class=\" mb-3 mb-md-0 \"  ><div class=\"card w-100 h-100 bg-black  vc_custom_1653971438710  pix-hover-item rounded-10 position-relative overflow-hidden2 text-white tilt fancy_card\" ><div class=\"card-img-overlay overflow-visible d-inline-block w-100 pix-img-overlay pix-p-30 d-flex align-items-end text-left\"><div class=\"w-100 \"><h3 class=\"card-title  text-white font-weight-bold mb-0 animate-in\" style=\"\">We would like to support the AI industry by sharing.<\/h3><p class=\"card-text pix-pt-10 text-white \" style=\"\"><\/p><div class=\"card-btn-div mt-4 d-inline-block w-100\"><a  href=\"https:\/\/open.datumo.com\/en\" class=\"btn mb-2    vc_custom_1653971438714  btn-primary d-inline-block      btn-md\" target=\"_blank\" rel=\"noopener\"    ><span class=\"font-weight-bold \" >Download Open Datasets<\/span><\/a><\/div><\/div><\/div><\/div><\/div>[\/vc_column][\/vc_row][vc_row pix_particles_check=&#8221;&#8221;][vc_column]<div id=\"el1646799961152-e3ee06c0-4e82\" class=\"w-100 d-block \"><\/div>[\/vc_column][\/vc_row]","protected":false},"excerpt":{"rendered":"[vc_row pix_particles_check=&#8221;&#8221;][vc_column][\/vc_column][\/vc_row][vc_row pix_particles_check=&#8221;&#8221;][vc_column][vc_column_text css=&#8221;.vc_custom_1655870344947{padding-top: 40px !important;padding-right: 20px !important;padding-bottom: 40px !important;padding-left: 20px !important;}&#8221;] Data scraping is the process of importing information or data from a website and displaying it in a spreadsheet or a local file on your computer. We will&#8230;","protected":false},"author":1,"featured_media":16480,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[131],"tags":[191,127,194,192,193],"class_list":["post-16328","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech","tag-data-crawling","tag-datumo","tag-reddit","tag-social-media","tag-twitter"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v23.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Data Crawling Everything: Case of Social Media - DATUMO<\/title>\n<meta name=\"description\" content=\"Data scraping is the process of importing information or data from a website and displaying it in a spreadsheet or a local file on your computer.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.datumo.com\/en\/tech\/16328\" \/>\n<meta property=\"og:locale\" content=\"ko_KR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Data Crawling Everything: Case of Social Media\" \/>\n<meta property=\"og:description\" content=\"Data scraping is the process of importing information or data from a website and displaying it in a spreadsheet or a local file on your computer.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/blog.datumo.com\/en\/tech\/16328\" \/>\n<meta property=\"og:site_name\" content=\"DATUMO\" \/>\n<meta property=\"article:published_time\" content=\"2022-06-22T03:57:03+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-10-22T08:51:54+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/maxim-berg-6-NP_CdNqtU-unsplash.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1080\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"DATUMO\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:title\" content=\"Data Crawling Everything: Case of Social Media\" \/>\n<meta name=\"twitter:description\" content=\"Data scraping is the process of importing information or data from a website and displaying it in a spreadsheet or a local file on your computer.\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/maxim-berg-6-NP_CdNqtU-unsplash.jpg\" \/>\n<meta name=\"twitter:label1\" content=\"\uae00\uc4f4\uc774\" \/>\n\t<meta name=\"twitter:data1\" content=\"DATUMO\" \/>\n\t<meta name=\"twitter:label2\" content=\"\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04\" \/>\n\t<meta name=\"twitter:data2\" content=\"10\ubd84\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16328#article\",\"isPartOf\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16328\"},\"author\":{\"name\":\"DATUMO\",\"@id\":\"https:\/\/blog.datumo.com\/#\/schema\/person\/02ec2d0ba953b146878dab089dc735b6\"},\"headline\":\"Data Crawling Everything: Case of Social Media\",\"datePublished\":\"2022-06-22T03:57:03+00:00\",\"dateModified\":\"2024-10-22T08:51:54+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16328\"},\"wordCount\":1721,\"publisher\":{\"@id\":\"https:\/\/blog.datumo.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16328#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/maxim-berg-6-NP_CdNqtU-unsplash.jpg\",\"keywords\":[\"data crawling\",\"datumo\",\"reddit\",\"social media\",\"twitter\"],\"articleSection\":[\"tech\"],\"inLanguage\":\"ko-KR\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16328\",\"url\":\"https:\/\/blog.datumo.com\/en\/tech\/16328\",\"name\":\"Data Crawling Everything: Case of Social Media - DATUMO\",\"isPartOf\":{\"@id\":\"https:\/\/blog.datumo.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16328#primaryimage\"},\"image\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16328#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/maxim-berg-6-NP_CdNqtU-unsplash.jpg\",\"datePublished\":\"2022-06-22T03:57:03+00:00\",\"dateModified\":\"2024-10-22T08:51:54+00:00\",\"description\":\"Data scraping is the process of importing information or data from a website and displaying it in a spreadsheet or a local file on your computer.\",\"breadcrumb\":{\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16328#breadcrumb\"},\"inLanguage\":\"ko-KR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/blog.datumo.com\/en\/tech\/16328\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16328#primaryimage\",\"url\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/maxim-berg-6-NP_CdNqtU-unsplash.jpg\",\"contentUrl\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/maxim-berg-6-NP_CdNqtU-unsplash.jpg\",\"width\":1920,\"height\":1080},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/blog.datumo.com\/en\/tech\/16328#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/blog.datumo.com\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Data Crawling Everything: Case of Social Media\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/blog.datumo.com\/#website\",\"url\":\"https:\/\/blog.datumo.com\/\",\"name\":\"DATUMO\",\"description\":\"The Data for Smarter AI\",\"publisher\":{\"@id\":\"https:\/\/blog.datumo.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/blog.datumo.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"ko-KR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/blog.datumo.com\/#organization\",\"name\":\"DATUMO\",\"url\":\"https:\/\/blog.datumo.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/blog.datumo.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/2.1.webp\",\"contentUrl\":\"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/2.1.webp\",\"width\":1080,\"height\":600,\"caption\":\"DATUMO\"},\"image\":{\"@id\":\"https:\/\/blog.datumo.com\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/blog.datumo.com\/#\/schema\/person\/02ec2d0ba953b146878dab089dc735b6\",\"name\":\"DATUMO\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/blog.datumo.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/1942a8a63e1c8fa0d9be56cda789edd6c0a866259cd5dca24952597ffa8bab3d?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/1942a8a63e1c8fa0d9be56cda789edd6c0a866259cd5dca24952597ffa8bab3d?s=96&d=mm&r=g\",\"caption\":\"DATUMO\"},\"description\":\"DATUMO, The Data for Smarter AI. We seek to drive impact in the world by providing diverse and high quality data to build smarter AI.\",\"sameAs\":[\"https:\/\/blog.datumo.com\/en\"],\"url\":\"https:\/\/blog.datumo.com\/en\/author\/selectstar\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Data Crawling Everything: Case of Social Media - DATUMO","description":"Data scraping is the process of importing information or data from a website and displaying it in a spreadsheet or a local file on your computer.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.datumo.com\/en\/tech\/16328","og_locale":"ko_KR","og_type":"article","og_title":"Data Crawling Everything: Case of Social Media","og_description":"Data scraping is the process of importing information or data from a website and displaying it in a spreadsheet or a local file on your computer.","og_url":"https:\/\/blog.datumo.com\/en\/tech\/16328","og_site_name":"DATUMO","article_published_time":"2022-06-22T03:57:03+00:00","article_modified_time":"2024-10-22T08:51:54+00:00","og_image":[{"width":1920,"height":1080,"url":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/maxim-berg-6-NP_CdNqtU-unsplash.jpg","type":"image\/jpeg"}],"author":"DATUMO","twitter_card":"summary_large_image","twitter_title":"Data Crawling Everything: Case of Social Media","twitter_description":"Data scraping is the process of importing information or data from a website and displaying it in a spreadsheet or a local file on your computer.","twitter_image":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/maxim-berg-6-NP_CdNqtU-unsplash.jpg","twitter_misc":{"\uae00\uc4f4\uc774":"DATUMO","\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04":"10\ubd84"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/blog.datumo.com\/en\/tech\/16328#article","isPartOf":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16328"},"author":{"name":"DATUMO","@id":"https:\/\/blog.datumo.com\/#\/schema\/person\/02ec2d0ba953b146878dab089dc735b6"},"headline":"Data Crawling Everything: Case of Social Media","datePublished":"2022-06-22T03:57:03+00:00","dateModified":"2024-10-22T08:51:54+00:00","mainEntityOfPage":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16328"},"wordCount":1721,"publisher":{"@id":"https:\/\/blog.datumo.com\/#organization"},"image":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16328#primaryimage"},"thumbnailUrl":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/maxim-berg-6-NP_CdNqtU-unsplash.jpg","keywords":["data crawling","datumo","reddit","social media","twitter"],"articleSection":["tech"],"inLanguage":"ko-KR"},{"@type":"WebPage","@id":"https:\/\/blog.datumo.com\/en\/tech\/16328","url":"https:\/\/blog.datumo.com\/en\/tech\/16328","name":"Data Crawling Everything: Case of Social Media - DATUMO","isPartOf":{"@id":"https:\/\/blog.datumo.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16328#primaryimage"},"image":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16328#primaryimage"},"thumbnailUrl":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/maxim-berg-6-NP_CdNqtU-unsplash.jpg","datePublished":"2022-06-22T03:57:03+00:00","dateModified":"2024-10-22T08:51:54+00:00","description":"Data scraping is the process of importing information or data from a website and displaying it in a spreadsheet or a local file on your computer.","breadcrumb":{"@id":"https:\/\/blog.datumo.com\/en\/tech\/16328#breadcrumb"},"inLanguage":"ko-KR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.datumo.com\/en\/tech\/16328"]}]},{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/blog.datumo.com\/en\/tech\/16328#primaryimage","url":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/maxim-berg-6-NP_CdNqtU-unsplash.jpg","contentUrl":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/06\/maxim-berg-6-NP_CdNqtU-unsplash.jpg","width":1920,"height":1080},{"@type":"BreadcrumbList","@id":"https:\/\/blog.datumo.com\/en\/tech\/16328#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/blog.datumo.com\/en\/"},{"@type":"ListItem","position":2,"name":"Data Crawling Everything: Case of Social Media"}]},{"@type":"WebSite","@id":"https:\/\/blog.datumo.com\/#website","url":"https:\/\/blog.datumo.com\/","name":"DATUMO","description":"The Data for Smarter AI","publisher":{"@id":"https:\/\/blog.datumo.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.datumo.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"ko-KR"},{"@type":"Organization","@id":"https:\/\/blog.datumo.com\/#organization","name":"DATUMO","url":"https:\/\/blog.datumo.com\/","logo":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/blog.datumo.com\/#\/schema\/logo\/image\/","url":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/2.1.webp","contentUrl":"https:\/\/blog.datumo.com\/en\/wp-content\/uploads\/2022\/05\/2.1.webp","width":1080,"height":600,"caption":"DATUMO"},"image":{"@id":"https:\/\/blog.datumo.com\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/blog.datumo.com\/#\/schema\/person\/02ec2d0ba953b146878dab089dc735b6","name":"DATUMO","image":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/blog.datumo.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/1942a8a63e1c8fa0d9be56cda789edd6c0a866259cd5dca24952597ffa8bab3d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1942a8a63e1c8fa0d9be56cda789edd6c0a866259cd5dca24952597ffa8bab3d?s=96&d=mm&r=g","caption":"DATUMO"},"description":"DATUMO, The Data for Smarter AI. We seek to drive impact in the world by providing diverse and high quality data to build smarter AI.","sameAs":["https:\/\/blog.datumo.com\/en"],"url":"https:\/\/blog.datumo.com\/en\/author\/selectstar"}]}},"_links":{"self":[{"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/posts\/16328","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/comments?post=16328"}],"version-history":[{"count":20,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/posts\/16328\/revisions"}],"predecessor-version":[{"id":16921,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/posts\/16328\/revisions\/16921"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/media\/16480"}],"wp:attachment":[{"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/media?parent=16328"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/categories?post=16328"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.datumo.com\/en\/wp-json\/wp\/v2\/tags?post=16328"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}