Your First Machine Learning Project in Python Step-By-Step

11月 05, 2019

Your First Machine Learning Project in Python Step-By-Step

Downloading, Installing and Starting Python SciPy

SciPy (pronounced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science, and engineering. In particular, these are some of the core packages

NumPy
SciPy library
Matplotlib
IPython
Sympy
pandas

The ways of installation:

python -m pip install --user numpy scipy matplotlib ipython jupyter pandas sympy nose

--user

Install System-wide via a Package Manager

sudo apt-get install python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose

Check versions of libraries:

# Python version
import sys

print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
# numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))

Define Problem.

Prepare Data.

dataset

We are going to use the iris flowers dataset.

150 observations of iris flowers.
3 classes of 50 instances each
Attribute Information

sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
class

4.9,3.1,1.5,0.2,"Iris-setosa"

Load the dataset

Import libraries

import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

Load the Dataset

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url, names=names)

read_csv

DataFrame

names

Investigate the Dataset

Dimensions of Dataset

# shape
print(dataset.shape)

Look at the first 20 rows of data

# head
print(dataset.head(20))

summarize all numeric columns

# descriptions
print(dataset.describe())

distribution of class

# class distribution
print(dataset.groupby('class').size())

Data Visualization

The plot method on Panda's Series and DataFrame is just a simple wrapper around matplotlib.pyplot.plot().
On DataFrame, plot() is a convenience to plot all of the columns with labels.

the distribution of the input attributes

dataset.plot( subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()

the histogram of each input variable

dataset.hist()
plt.show()

the interactions between the attributes

scatter_matrix(dataset)
plt.show()

Create a Validation Dataset

We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.

# retrieve the numpy array
array = dataset.values
# X is the input
X = array[:,0:3]
# Y is the label
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

test_size : float, int or None, optional (default=None)

the proportion of the dataset to include in the test split

Examples of train_test_split:

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]

>>> X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
       [8, 9]])
>>> y_test
[1, 4]

Now, we have training data in the X_train and Y_train for training models and validation data X_validation and Y_validation sets for validating the trained model.

Evaluate Algorithms.

We will split the dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.
We reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.

seed = 7

Build Models

Let’s evaluate 6 different algorithms:

Logistic Regression (LR)
Linear Discriminant Analysis (LDA)
K-Nearest Neighbors (KNN).
Classification and Regression Trees (CART).
Gaussian Naive Bayes (NB).
Support Vector Machines (SVM).

Let’s build and evaluate our models:

models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

搜尋此網誌

I'm Jay's father

Your First Machine Learning Project in Python Step-By-Step

Your First Machine Learning Project in Python Step-By-Step

Downloading, Installing and Starting Python SciPy

Define Problem.

Prepare Data.

dataset

Load the dataset

Create a Validation Dataset

Evaluate Algorithms.

Build Models

Improve Results.

Present Results.

留言

熱門文章

A Tutorial on the Device Tree

Linux Modem Manager