Data Preprocessing

Garbage In , Garage Out

Machine learning and Data Science are the trending technologies in this decade.Machine learning is the driving force for artificial intelligence. Everybody wants to apply machine learning algorithms on their business.Machine learning algorithms will work better with better data.If we feed lots and lots of data the algorithm will work perfectly.The data needs to be prepared well before feeding into a machine learning model.

In this article i will explain the steps involved in the data preprocessing technique with python.”Data preprocessing means the transformation involved in the data before feeding in to a machine learning algorithm“

Steps Involved

Importing Libraries
dealing with missing data
cleaning data

Importing Libraries.

We are mainly using three important libraries Numpy, pandas and matplotlib NumPy is the fundamental package for scientific computing with Python. pandas are using for the array operations.this is the best library for importing and manipulating datasets matplotlib is a 2D python plotting library.You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc., with just a few lines of code.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Importing Datasets.

The dataset will be available mainly CSV or sql formats.we are using pandas to import the datasets

datasets = pd.read_csv('Data.csv')

After importing datasets we need to distinguish the matrix of features and dependent variable.We are using matrix of features as X and the dependent variable as Y.for assigning X and Y we are using iloc from pandas library.

X = datasets.iloc[:, :-1].values
Y = datasets.iloc[:, -1].values

Missing Data.

Sometimes the data contains some missing data.Either we can remove the entire row or we can replace the data by mean or median of the data.For than we are using imputer class from sklearn.preprocessing library.

from sklearn.preprocessing import Imputer

The imputer class will take some arguments missing values,strategy , axis.

imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)

Now we need to fit our imputer object into the data into matrix of features.

imputer = imputer.fit(X[:,2:4])
X[:,2:4] = imputer.transform(X[:,2:4])