5 Questions to Ask Yourself before Working with a Dataset

August 10, 2020
Written by

header img dataset

Recently I was looking for some datasets to analyze and make predictions based on data for a blog post project. I got stuck when I kept running into datasets that were fun but not good: oftentimes, AI projects struggle due in part to data-related issues.

What makes a dataset good or bad? What are some things you should consider when looking at a dataset? How should you interpret a dataset, from beginning to end? Read on to find out!

5 questions to ask yourself before working with a dataset

1. How was the data compiled? 

Was it aggregated from multiple sources? Did you compile it? If it came from different sources, you may have to format the data so different variables within a given attribute are consistent, such as dates, times, addresses, states or cities (maybe they’re abbreviated?), numbers like currency, etc., and some of the data may not be consistent. The data should also be comprehensive, meaning it is complete, readable, and understandable.

2. Is the data accurate? 

The dataset should not have data that is already summarized, like averages or totals, because those single numbers already represent a series of data points and thus will not help us model the data.

If the data is aggregated like it was compiled from multiple sources, are there any outliers or elements that stand out? You may have to check the values and see if they make sense: if a basketball player was not shooting well and all of a sudden did well for one game but went back to playing poorly, that one good game could be a bad data point. You have to modify the data so an outlier or a questionable value does not negatively affect your model.

3. Is the data clean?

Data cleaning is the process of preparing data to be analyzed by removing or editing some of the data. That may include

  1. Searching for and removing extra spaces
  2. Converting some strings or text to integers, or vice versa
  3. Finding and removing duplicates
  4. Handling blank cells

Record sampling is the process of removing records (values) that may be missing, incorrect, or not representative to improve your prediction. If there is too much missing data, it is probably not the most accurate dataset and it will be difficult to model, interpret, and analyze, so you should look for a new one.

However, in machine learning assumed or approximated values are better for an algorithm than just missing values. You may also consider substituting missing values with fake or dummy values like "n/a" for categorical data or 0 or mean figures for numerical values. Those values could be called synthetic data, which is usually used for both training and testing datasets when there is not enough real data.

4. How much data should you have?

This is a common question in machine learning. You need a lot of training data, but is it too much, too little, or enough? How much you need will depend on your project, but typically using texts, images, and videos means more data. However, model performance can also influence the amount of data needed: a smaller dataset will often be enough or okay for a working demo, but having the model in production requires more data.

Without knowing the context of what someone is using the data for, you should aim for a few thousand samples, and no less than a few hundred. Ideally, most modeling problems should have tens to hundreds of thousands, and tougher deep learning problems should have millions.

You may wish to perform attribute sampling where if you know the value you wish to predict (target attribute), you can assume which values will be valuable, add more dimensions and complexity to the dataset, and ignore some other attributes.

5. Remember why: what problem do you want to tackle?

What do you hope to gleam from this dataset? Which column or row is the most important?

  1. Regression predicts a numeric value, and a problem involving it could be predicting the cost of a house or a vacation.
  2. Ranking problems predict an ordered list or ranking of objects, querying order results. This would be handy if you want to return a list of items someone is most likely to buy or a list of songs they are most likely to enjoy.
  3. Classification predicts a label, and problems include binary "yes-or-no" problems like "is this picture a corgi or a blueberry muffin" as well as multi-classification problems like "is this good, bad, or average". With classification, the correct answers need to be labeled so your algorithm can learn from them.
  4. Clustering problems also predicts a label. They group a set of observations into subsets (clusters) such that observations of the same cluster share common aspects. This is handy for companies who want to analyze their customers and see if some share similarities like age, location, gender, etc.

Classification

Clustering

Used in supervised learning

Used in unsupervised learning

Classify instances according to their corresponding class labels

Group instances according to their similarity without class labels

Needs labels, so also needs training and testing datasets to verify the model created

No need of training and testing dataset

Common in financial sector, detecting fraud

Common in retail/online shopping, recommending a series of objects

More complex

Less complex

What's next for datasets?

data vis image

Hopefully now you know what to look for when you are looking at and working with datasets. Some public commercial datasets cost money, but there are some nice free ones to use on GitHub, in Google's Open Images dataset, and also on Kaggle. Looking at some commonly-used machine learning algorithms may also be helpful when interpreting a dataset. Let me know in the comments or online what you're working on--I can't wait to see what you build.