Data Pre-processing using Scikit-learn

Rutvi Pan
3 min readAug 23, 2021

--

Dataset Description

Here I have used house pricing dataset. this dataset contains information about kitchen,bedroom,housestyle,area,street,price etc.

Data Encoding

Data encoding is the transformation of categorical variables to binary or numerical counterparts. In this we assign unique values to all the categorical attribute.so there are two types so data encoding (1)label encoding (2)Onehot encoding

(1)Label encoding

If we will have more than one category in the dataset that to convert those categories into numerical features we can use a Label encoder.

(2)Onehot encoder

One hot encoder does the same things but in a different way. Label Encoder initializes the particular number but one hot encoder will assign a whole new column to particular categories.

Normalization

Before Normalization
After Normalization

Standardization

Imputing Missing Values

Missing data are values that are not recorded in a dataset. They can be a single value missing in a single cell or missing of an entire observation (row). Missing data can occur both in a continuous variable (e.g. height of students) or a categorical variable (e.g. gender of a population).

Simple Imputer

Discretization

Quantile Discretization Transform

Uniform Discretization Transform

KMeans Discretization Transform

--

--