Table Of Contents
Machine Learning A-Z
Introduction
1.1 Applications of ML
- Facebook image/face tagging.
- Kinect motion detection.
- VR headset movement.
- Speech to text, text to speech.
- Swipe keyboard prediction.
- IOT robot dogs.
- Facebook ads.
- Amazon netflix use ml to recommender systems.
- Used in the field of medicine for detection.
- Used to explore new areas through space satellites.
- Explore new areas like mars.
1.2 ML is the future
- Since the dawn of time up until 2005 humans have created 130 Exabytes of data.
- 2005 - 2010 that has become 1200 Exabytes of data.
- 2010 - 2015 it is 7900 Exabytes.
- By 2020, 40900 Exabytes of data will be created.
1.3 Installations
-
Install Anaconda
Install Anaconda 4.2.0 if you are facing some compatibility issues.
2 Data preprocessing
2.1 Get the data sets
The first data set contains 4 columns country, age, salary and purchased
So it’s a data set of the customers of a company with the customers information and whether or not the purchased their product. The first 3 are the independent variables/features. And the last column is the dependent variable/label which is what we need to predict.
2.3 Importing the libraries
Let’s begin by importing necessary libraries.
import numpy as np # n-dim array math library
import matplotlib.pyplot as plt # plotting/charting library
import pandas as pd # importing data sets and managing data sets
2.4 Importing the dataset
2.4.1 Setting the working directory in spyder
Go to file explorer - click on a button to set a folder as working dir or save and run the py file from the same folder as CSV file to set it as working dir (F5 to run)
dataset = pd.read_csv(‘data.csv’) # importing the CSV file
You can verify the above import statement by looking at the variable explorer in spyder. Now let’s create our matrix of features.
X = dataset.iloc[:, :-1].values # the left side of the comma is the rows to include and the right side it the columns and we have excluded the last colum as its the y and not x
Y = dataset.iloc[:, 3] #indexes in python start at 0 so we need the 4th column so we enter 3
2.6 Missing data
We cant remove the rows containing missing data because it may contain other crucial data that we might need the other safer method is to fill the missing data using any of the 3 methods Mean , Median or Mode if you cant see the full array in the variable explorer just add the below code
np.set_printoptions(threshold = np.nan)
Taking care of the missing data
from sklearn.preprocessing import Imputer
# Create an object of the above class
imputer = Imputer(missing_values = NaN, strategy=“mean”, axis = 0) # missing values is NaN because that is what shows up for the missing values in the variable explorer, axis 0 means along the columns
imputer = imputer.fit(X[:, 1:3]) # its 3 because the upper bound is excluded here
X[:, 1:3] = imputer.transform(X[:, 1:3]) # Now we take the transformed values and replace into the actual data
To inspect any imported method or class, keep the cursor over it and press cmd + i
to open up the docs for that method/class.
There are 3 strategies available to fill the missing values. mean
, median
and most_frequent
.
axis = 0
means fill along the columns, axis = 1
means fill along the rows.
We fit the X data to the imputer for the columns where there are missing data. Hence X[:, 1:3]
which means all the rows and columns 1 and 2.
To run just a lock of code in spyder
just select the lock of new code and hit cmd + enter
.
2.7 Categorical data
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
We use the fit_transform
method because we want to fit and transform at the same time.
Now that we have converted the categorical values of france, germany and spain
into numbers 0, 1, 2
, We now need to convert it to one hot encoding.
As currently the ml algorithm will think Germany is greater than France and Spain is greater than Germany which is not the case.
So lets import the OneHotEncoder
from sklearn.preprocessing
from sklearn.preprocessing import Imputer, OneHotEncoder
...
onehotencoder = OneHotEncoder(categorical_features = [0]) # The columns which needs to be transformed
X = onehotencoder.fit_transform(X).toarray()
Now lets do the Label encoding to Y
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
2.8 Train set/Test set splitting
# Splitting the dataset into train set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)