How to Create a Dataset with Python?
Sometimes, to test models or perform simulations, you may need to create a dataset with python. We will not import this simulated or fake dataset from real-world data, but we will generate it from scratch using a couple of lines of code. In this article, I will be showing how to create a dataset for regression, classification, and clustering problems using python.
How to create a dataset for a classification problem with python?
To create a dataset for a classification problem with python, we use the make_classification method available in the sci-kit learn library. Let’s import the library.
from sklearn.datasets import make_regression, make_classification, make_blobs
import pandas as pd
import matplotlib.pyplot as plt
The make_classification method returns by default, ndarrays which corresponds to the variable/feature and the target/output. To generate a classification dataset, the method will require the following parameters:
- n_samples: the number of samples/rows.
- n_features: the number of features/columns.
- n_informative: the number of features that have a role in the prediction of the output.
- n_redundant: the number of features that are not related to the output class.
- n_classes: the number of classes/labels for the classification problem.
- weights: the proportion of samples for each output/class. Inserting None means balanced classes.
Let’s go ahead and generate the classification dataset using the above parameters.
# How to create a dataset for a classification problem
variables, target = make_classification(
n_samples = 1000,
n_features = 12,
n_informative = 7,
n_redundant = 3,
n_repeated = 2,
n_classes = 4,
# Distribution of classes 20% Output1
# 20%> output 2, 30% output 3 and 4
weights = [.2,.2, .3, .3],
random_state = 8)
Let’s visualize the variable dataframe.
# View the some sample rows classification dataset
pd.DataFrame(variables,
columns=["col_name "+ str(i) for i in range(variables.shape[1])])
col_name 0 col_name 1 col_name 2 col_name 3 col_name 4 col_name 5 col_name 6 col_name 7 col_name 8 col_name 9 col_name 10 col_name 11
0 0.331723 0.509881 -0.175577 1.075973 -0.831905 -6.934965 1.075973 2.287544 -0.963859 -5.114866 -0.831905 -1.146492
1 0.195848 1.620299 1.106739 1.367136 -1.896274 -0.183035 1.367136 -2.046089 -0.047922 -0.678143 -1.896274 0.624624
2 -1.110011 -0.556873 1.688196 -1.093200 -0.784965 -3.192678 -1.093200 0.687897 -1.036923 -1.843979 -0.784965 -1.493613
3 -2.201790 -1.327092 -0.612005 4.771862 1.333772 -4.806040 4.771862 -0.140957 -3.054470 -0.449136 1.333772 -2.776526
4 -0.309355 -1.481214 -0.348062 -0.323696 1.007431 0.398226 -0.323696 0.890199 0.330016 1.649357 1.007431 -0.162269
... ... ... ... ... ... ... ... ... ... ... ... ...
995 2.667335 -0.417175 2.833952 0.914925 0.527134 -3.312023 0.914925 3.440490 1.359939 2.461748 0.527134 2.621739
996 -1.580031 -0.866029 0.304893 4.219823 -2.325976 -8.120002 4.219823 1.534977 4.459700 0.684456 -2.325976 3.559462
997 -4.297403 -0.778561 -1.248011 -2.836641 1.092944 3.943337 -2.836641 -1.993144 0.334962 1.419927 1.092944 -0.862761
998 0.996732 0.883047 5.827281 0.396347 -2.989223 -2.003943 0.396347 -1.810176 -2.486957 -0.346523 -2.989223 -1.164679
999 -0.362272 -1.051120 -1.444873 -0.235483 1.939468 -1.478216 -0.235483 2.507234 0.776497 0.359553 1.939468 0.404604
Let’s visualize the output dataframe.
# View the target column
pd.DataFrame(target).head()
0
0 2
1 0
2 2
3 0
4 1
In the last word, if you have a multilabel classification problem, you can use the make_multilable_classification method to generate your data. The procedure for it is similar to the one we have above.
How to create a dataset for regression problems with python?
Similarly to make_classification, the make_regression method returns by default, ndarrays which corresponds to the variable/feature and the target/output. To generate a regression dataset, the method will require the following parameters:
- n_samples: the number of samples/rows
- n_features: the number of features/columns
- n_informative: the number of informative variables
- n_target: the number of regression targets/output. So a value of 2 means each sample will have 2 outputs.
- Noise: the standard deviation of the gaussian noise on the output
- shuffle: mix the samples and the features.
- coef: Return or not the coefficients of the underlying linear model.
- random state: state the seed for the random number generator, to reproduce the same dataset in case of reuse
Let’s go ahead and generate the regression dataset using the above parameters.
# How to create a dataset for a regression problem
variables, target = make_regression(n_samples = 1000,
n_features = 10,
n_informative = 8,
n_targets = 1,
noise = .5,
random_state = 8
)
Let’s visualize the variable dataframe.
# View the some sample rows classification dataset
pd.DataFrame(variables,
columns=["col_name "+ str(i) for i in range(variables.shape[1])]).head()
col_name 0 col_name 1 col_name 2 col_name 3 col_name 4 col_name 5 col_name 6 col_name 7 col_name 8 col_name 9
0 -1.840549 1.242176 -1.390003 -0.722689 1.632301 -0.577799 -0.222739 0.405041 -0.571550 -0.553471
1 -0.537855 1.669537 0.903670 0.420327 1.219688 2.431992 1.267145 -0.345046 0.020319 -0.540176
2 -1.142355 -1.221051 0.107340 -1.336418 0.876019 0.055713 0.306274 1.181244 0.479989 0.429356
3 0.178291 0.150575 -0.693476 -0.789158 0.307253 -0.693905 -0.280938 1.329255 0.949204 -1.348216
4 0.669372 1.204146 0.123464 -0.288869 0.012149 0.923918 -0.332575 -0.403318 0.728422 -0.046430
Let’s visualize the output dataframe.
# Vizualize the target dataframe
pd.DataFrame(target, columns=['Bills']).head()
Bills
0 37.230636
1 238.246502
2 84.744504
3 14.400635
4 70.771206
How to create a dataset for a clustering problem with python?
The make_blob method returns by default, ndarrays which corresponds to the variable/feature/columns containing the data, and the target/output containing the labels for the cluster’s numbers. To generate a clustering dataset, the method will require the following parameters:
- n_samples: the number of samples/rows. Passed as an integer, it divides the various points equally among clusters. And Passed as an array, each element shows the number of samples per cluster.
- n_features: the number of features/columns
- centers: the number of centers (fixed center locations) to generate your clusters.
- clusters_std: the standard deviation of each cluster
- shuffle: mixes the various rows/samples
- random state: defines the random number used for the generation of the dataset
Let’s go ahead and generate the clustering dataset using the above parameters.
# How to create a dataset for a clustering problem
X, y = make_blobs(
n_samples = 1000,
n_features = 2,
centers = 5,
cluster_std = 0.5,
shuffle = True
random_state = 9)
Let’s visualize the various clusters.
plt.scatter(X[:,0],
X[:,1],
c = ["red"])
plt.figure(figsize=(16,10)).show()
Bonus on creating your own dataset with python
The above were the main ways to create a handmade dataset for your data science testings. There are even more default architectures ways to generate datasets and even real-world data for free. Those datasets and functions are all available in the Scikit learn library, under sklearn.datasets. Feel free to check it out.
If you made this far in the article, I would like to thank you so much.
I hope it was of use to you.
Feel free to use any information from this page. I’d appreciate it if you can simply link to this article as the source. If you have any additional questions, you can reach out to malick@malicksarr.com or message me on Twitter. If you want more content like this, join my email list to receive the latest articles. I promise I do not spam.
[boldgrid_component type=”wp_mc4wp_form_widget”]
If you liked this article, maybe you will like these too.
Hyperparameter Tuning with Random Search in Python
How to Split your Dataset to Train, Test and Validation sets? [Python]
Hyperparameter Tuning with Grid Search in Python
SQL Data Science: Most Common Queries all Data Scientists should know
Predicting heart diseases with Python