Blog Post

How to Create a Dataset with Python?

September 25, 2021 Python, Data Analysis, Data Science by Malick Sarr

Sometimes, to test models or perform simulations, you may need to create a dataset with python. We will not import this simulated or fake dataset from real-world data, but we will generate it from scratch using a couple of lines of code. In this article, I will be showing how to create a dataset for regression, classification, and clustering problems using python.

How to create a dataset for a classification problem with python?

To create a dataset for a classification problem with python, we use the make_classification method available in the sci-kit learn library. Let’s import the library.

from sklearn.datasets import make_regression, make_classification, make_blobs
import pandas as pd
import matplotlib.pyplot as plt

The make_classification method returns by default, ndarrays which corresponds to the variable/feature and the target/output. To generate a classification dataset, the method will require the following parameters:

n_samples: the number of samples/rows.
n_features: the number of features/columns.
n_informative: the number of features that have a role in the prediction of the output.
n_redundant: the number of features that are not related to the output class.
n_classes: the number of classes/labels for the classification problem.
weights: the proportion of samples for each output/class. Inserting None means balanced classes.

Let’s go ahead and generate the classification dataset using the above parameters.

# How to create a dataset for a classification problem
variables, target  = make_classification(
                    n_samples = 1000,
                    n_features = 12,
                    n_informative = 7,
                    n_redundant = 3,
                    n_repeated = 2,
                    n_classes = 4,
                    # Distribution of classes 20% Output1
                    # 20%> output 2, 30% output 3 and 4        
                    weights = [.2,.2, .3, .3],
                    random_state = 8)

Let’s visualize the variable dataframe.

# View the some sample rows classification dataset
pd.DataFrame(variables, 
columns=["col_name "+ str(i) for i in range(variables.shape[1])])


     col_name 0	    col_name 1	    col_name 2	    col_name 3	   col_name 4	   col_name 5	   col_name 6	    col_name 7	    col_name 8	    col_name 9	    col_name 10	    col_name 11
0	0.331723	0.509881	-0.175577	1.075973	-0.831905	-6.934965	1.075973	2.287544	-0.963859	-5.114866	-0.831905	-1.146492
1	0.195848	1.620299	1.106739	1.367136	-1.896274	-0.183035	1.367136	-2.046089	-0.047922	-0.678143	-1.896274	0.624624
2	-1.110011	-0.556873	1.688196	-1.093200	-0.784965	-3.192678	-1.093200	0.687897	-1.036923	-1.843979	-0.784965	-1.493613
3	-2.201790	-1.327092	-0.612005	4.771862	1.333772	-4.806040	4.771862	-0.140957	-3.054470	-0.449136	1.333772	-2.776526
4	-0.309355	-1.481214	-0.348062	-0.323696	1.007431	0.398226	-0.323696	0.890199	0.330016	1.649357	1.007431	-0.162269
...	...	...	...	...	...	...	...	...	...	...	...	...
995	2.667335	-0.417175	2.833952	0.914925	0.527134	-3.312023	0.914925	3.440490	1.359939	2.461748	0.527134	2.621739
996	-1.580031	-0.866029	0.304893	4.219823	-2.325976	-8.120002	4.219823	1.534977	4.459700	0.684456	-2.325976	3.559462
997	-4.297403	-0.778561	-1.248011	-2.836641	1.092944	3.943337	-2.836641	-1.993144	0.334962	1.419927	1.092944	-0.862761
998	0.996732	0.883047	5.827281	0.396347	-2.989223	-2.003943	0.396347	-1.810176	-2.486957	-0.346523	-2.989223	-1.164679
999	-0.362272	-1.051120	-1.444873	-0.235483	1.939468	-1.478216	-0.235483	2.507234	0.776497	0.359553	1.939468	0.404604

Let’s visualize the output dataframe.

# View the target column
pd.DataFrame(target).head()

In the last word, if you have a multilabel classification problem, you can use the make_multilable_classification method to generate your data. The procedure for it is similar to the one we have above.

How to create a dataset for regression problems with python?

Similarly to make_classification, the make_regression method returns by default, ndarrays which corresponds to the variable/feature and the target/output. To generate a regression dataset, the method will require the following parameters:

n_samples: the number of samples/rows
n_features: the number of features/columns
n_informative: the number of informative variables
n_target: the number of regression targets/output. So a value of 2 means each sample will have 2 outputs.
Noise: the standard deviation of the gaussian noise on the output
shuffle: mix the samples and the features.
coef: Return or not the coefficients of the underlying linear model.
random state: state the seed for the random number generator, to reproduce the same dataset in case of reuse

Let’s go ahead and generate the regression dataset using the above parameters.

# How to create a dataset for a regression problem
variables, target = make_regression(n_samples = 1000,
                                    n_features =  10,
                                    n_informative = 8,
                                    n_targets = 1,
                                    noise = .5,
                                    random_state = 8
                                    )

Let’s visualize the variable dataframe.

# View the some sample rows classification dataset
pd.DataFrame(variables,
             columns=["col_name "+ str(i) for i in range(variables.shape[1])]).head()

	col_name 0	col_name 1	col_name 2	col_name 3	col_name 4	col_name 5	col_name 6	col_name 7	col_name 8	col_name 9
0	-1.840549	1.242176	-1.390003	-0.722689	1.632301	-0.577799	-0.222739	0.405041	-0.571550	-0.553471
1	-0.537855	1.669537	0.903670	0.420327	1.219688	2.431992	1.267145	-0.345046	0.020319	-0.540176
2	-1.142355	-1.221051	0.107340	-1.336418	0.876019	0.055713	0.306274	1.181244	0.479989	0.429356
3	0.178291	0.150575	-0.693476	-0.789158	0.307253	-0.693905	-0.280938	1.329255	0.949204	-1.348216
4	0.669372	1.204146	0.123464	-0.288869	0.012149	0.923918	-0.332575	-0.403318	0.728422	-0.046430

Let’s visualize the output dataframe.

# Vizualize the target dataframe
pd.DataFrame(target, columns=['Bills']).head()

        Bills
0	37.230636
1	238.246502
2	84.744504
3	14.400635
4	70.771206

How to create a dataset for a clustering problem with python?

The make_blob method returns by default, ndarrays which corresponds to the variable/feature/columns containing the data, and the target/output containing the labels for the cluster’s numbers. To generate a clustering dataset, the method will require the following parameters:

n_samples: the number of samples/rows. Passed as an integer, it divides the various points equally among clusters. And Passed as an array, each element shows the number of samples per cluster.
n_features: the number of features/columns
centers: the number of centers (fixed center locations) to generate your clusters.
clusters_std: the standard deviation of each cluster
shuffle: mixes the various rows/samples
random state: defines the random number used for the generation of the dataset

Let’s go ahead and generate the clustering dataset using the above parameters.

# How to create a dataset for a clustering problem
X, y = make_blobs(
        n_samples = 1000,
        n_features = 2,
        centers = 5,
        cluster_std = 0.5,
        shuffle = True
        random_state = 9)

Let’s visualize the various clusters.

plt.scatter(X[:,0],
            X[:,1],
            c = ["red"])
plt.figure(figsize=(16,10)).show()

Bonus on creating your own dataset with python

The above were the main ways to create a handmade dataset for your data science testings. There are even more default architectures ways to generate datasets and even real-world data for free. Those datasets and functions are all available in the Scikit learn library, under sklearn.datasets. Feel free to check it out.

If you made this far in the article, I would like to thank you so much.

I hope it was of use to you.

Feel free to use any information from this page. I’d appreciate it if you can simply link to this article as the source. If you have any additional questions, you can reach out to malick@malicksarr.com or message me on Twitter. If you want more content like this, join my email list to receive the latest articles. I promise I do not spam.

[boldgrid_component type=”wp_mc4wp_form_widget”]

If you liked this article, maybe you will like these too.

Hyperparameter Tuning with Random Search in Python

How to Split your Dataset to Train, Test and Validation sets? [Python]

Hyperparameter Tuning with Grid Search in Python

SQL Data Science: Most Common Queries all Data Scientists should know

Predicting heart diseases with Python

Tags: Data Science Data Scientist datasets Python sklearn

Write a comment

You must be logged in to post a comment.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.