Data Pre-processing using Scikit-learn

Rutvi Pan
3 min readAug 23, 2021

Dataset Description

Here I have used house pricing dataset. this dataset contains information about kitchen,bedroom,housestyle,area,street,price etc.

Data Encoding

Data encoding is the transformation of categorical variables to binary or numerical counterparts. In this we assign unique values to all the categorical there are two types so data encoding (1)label encoding (2)Onehot encoding

(1)Label encoding

If we will have more than one category in the dataset that to convert those categories into numerical features we can use a Label encoder.

(2)Onehot encoder

One hot encoder does the same things but in a different way. Label Encoder initializes the particular number but one hot encoder will assign a whole new column to particular categories.


Before Normalization
After Normalization


Imputing Missing Values

Missing data are values that are not recorded in a dataset. They can be a single value missing in a single cell or missing of an entire observation (row). Missing data can occur both in a continuous variable (e.g. height of students) or a categorical variable (e.g. gender of a population).

Simple Imputer


Quantile Discretization Transform

Uniform Discretization Transform

KMeans Discretization Transform