🔑 In 6 minutes you will learn:
- The 2 types of datasets and examples
- How you could make your own image dataset
You would most likely require an image dataset for one of the two purposes: a commercial project, or a project that you’re doing out of interest to enhance your machine learning skills. So let’s get to it!
Dataset Types
Before we go any further, you should know that there are two types of datasets available on the Internet:
1. Public Datasets
These datasets are scraped and annotated by groups of people who were working on making an application that required a dataset of that kind (usually research groups). Once they’re done with their application, they make the dataset public and free to use for two purposes:
- To enable people reading their papers to recreate the results they have published,
- To allow people to reuse the dataset they collected for different use in their own applications
2. Commercial Datasets
These datasets have been gathered by companies or groups for the purpose of making a profit. You can use these datasets, but you have to pay a fee to access them. In this post, we are only going to cover the public ones that you can use for free!
At present, there are millions of public datasets available. If you wish to build a computer vision application, you should consider searching for a public dataset before attempting to create your own, because it can take a lot of time and effort to get a large and good quality dataset.
This tool has gained a lot of popularity since its launch. It gives you access to 25 million datasets! YES, you heard it right. 25 million! However, not all of these are public (most are) and neither are all of them Image Datasets. It allows you to apply different filters like searching based on ‘Upload Date’, ‘Download Format’, ‘Usage Rights’, and ‘Free/ Paid’. This can be really helpful in fishing out the right dataset from a sea of datasets.
Image Net
Used For: Classification and Detection
Image Net is one of the most popular image datasets out there with 14 million image data instances that are organized in a hierarchy. To explain that further, Image Net has images from around 22,000 different categories; however, those categories can be further grouped together to form 27 higher-level categories.
Google’s Open Image Dataset
Used For: Classification and Detection
This dataset contains 9 million annotated images consisting of 600 different classes, and it is very likely that it would contain sufficient instances/ images of the object class that you’re looking for.
Labeled Faces in the Wild Dataset
Used For: Recognition
If you’re planning to build a face detection and/ or recognition application, this is the dataset for you. It contains 13,000 instances of human faces that you can use to train your model. You can use the download link below to get the dataset in the format of your preference.
CIFAR-10 and CIFAR-100
Used For: Classification
These two are by far the most popular image datasets out there. The 10 and 100 in their respective names represent the number of object classes that they contain. CIFAR-10 contains 60,000 images of 10 different classes of objects including airplanes, ships, horses, dogs, etc. CIFAR-100 also has the same number of images but of 100 different object classes.
Plant’s Datasets
Used For: Classification
Here’s another great resource if you’re looking for Image Datasets related to plants. Go to the site and search for the plant type that you’re interested in and proceed to download.
MS COCO
Used For: Object detection and segmentation
MS COCO also happens to be one of the most extensive Image Datasets at this time. It has 80 different object categories, has more than 200k labeled images, and is sponsored by Facebook and Microsoft. Plenty of research papers have used this dataset to test their machine learning algorithms and show their power using this dataset as the standard.
There are thousands of other image datasets available as well, and to explore further options, you can also check out these two articles which list some very useful Image Datasets:
The last resource that we are going to talk about is Tensorflow Datasets. These are easy to use datasets stored in the cloud, and you can download them to your system directly through TensorFlow, for use in your application. Below is a code snippet showing you can download and use the MNIST dataset directly through TensorFlow:
import tensorflow.compat.v2 as tf import tensorflow_datasets as tfdsdataset = tfds.load ("mnist") # download mnist dataset from cloud and load it in dataset variable train_data, test_data = dataset ['train'], dataset ['test'] # split the dataset into train and test set
To see the list of all datasets available through TensorFlow, run the following command:
print(tfds.list_builders()) # print names of all public datasets available in tensorflow datasets
If you are developing an application or writing a research paper, more often than not, a public dataset that you can use for your computer vision model would not be available. In such a scenario, you have to scrape the dataset yourself and then annotate it for your use. In such a scenario, two tools can come in really handy to build your own dataset:
- Fatkun Batch Download Image is a very effective Google Chrome tool to batch download all images from a webpage. For instance, you wish to build a dataset of animals. All you would have to do is go to Google Images, type in “fish” and then let fatkun download all images from the results page. You can then go on and do the same for the rest of the animals until you have a sufficiently large dataset.
- Colabeler is a powerful Image Annotation tool to draw bounding boxes around the objects of your interest in an image and save them in a variety of formats. It also gives you the ability to collaborate on annotating the images, so multiple people can work on annotating a single dataset.
Most time, it is very difficult to collect and label datasets specifically needed for your team. For that, one of the best ways to handle all these burdens is to hire a professional team to do the task for you; DATUMO is your best solution! Here at DATUMO, we crowdsource our tasks to diverse users located globally to ensure the quality and quantity on time. Moreover, our in-house managers double-check the quality of the collected or processed data.
In this tutorial, we talked about the importance of a dataset in a Computer Vision application and discussed ways in which we can search for a public image dataset on the Internet. Furthermore, we also explored some popular image datasets which cover dozens of object classes and can be used in our own applications. Lastly, we talked about building our own Image Dataset through scraping and annotation if a public dataset for our application isn’t available.