Python-Jupyter basics tutorial for ML 2018 course

This notebook contains some information and examples for getting started with Jupyter and Python.

Now we can start with the tutorial. First thing is to import the necessary modules.

In [1]:
# we will use the pandas module because it allows us to work with R-like dataframes
import pandas as pd

# often we need some functions from numpy as well
import numpy as np

# the next two lines will force jypyter to output all results from a cell (by default only the last one is shown)
# using semicolon (;) in the end of a line will force that line not to output the result
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

1. Data manipulation with pandas dataframes

Pandas dataframe is a convenient structure to work with data and has a lot of useful functionality for data analysis. If you are already familiar with R dataframes, then this is something really-really similar.

1.1 Creating a dataframe

In [2]:
# You can create a dataframe from scratch if you want to
# (although probably rarely need it during this course)
# you have to specify column names and data
df = pd.DataFrame({ 'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series([1,2,3,4],dtype='int32'), # you can be very specific about the data types
                    'D' : [4,2,3,4], # if you are not, pandas will try to guess
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : ["test", "train", "test", "train"],
                    'G' : 'foo' })

1.2 Information about a dataframe

In [3]:
# you can see information about the data in different ways
df.head() # first rows
df.tail(2) # last rows
df.shape # dimensions
df.describe() # summary statistics
Out[3]:
A B C D E F G
0 1.0 2013-01-02 1 4 test test foo
1 1.0 2013-01-02 2 2 train train foo
2 1.0 2013-01-02 3 3 test test foo
3 1.0 2013-01-02 4 4 train train foo
Out[3]:
A B C D E F G
2 1.0 2013-01-02 3 3 test test foo
3 1.0 2013-01-02 4 4 train train foo
Out[3]:
(4, 7)
Out[3]:
A C D
count 4.0 4.000000 4.000000
mean 1.0 2.500000 3.250000
std 0.0 1.290994 0.957427
min 1.0 1.000000 2.000000
25% 1.0 1.750000 2.750000
50% 1.0 2.500000 3.500000
75% 1.0 3.250000 4.000000
max 1.0 4.000000 4.000000

1.3 Series

In [4]:
# we can also create series - all dataframe columns are also series
s = pd.Series([1,5,2,6,4,5])

# and we can count different values in there
s.value_counts()

# as already said, we can do this also on the columns of dataframes
# (these two commands are identical)
df["D"].value_counts()
df.D.value_counts()
Out[4]:
5    2
6    1
4    1
2    1
1    1
dtype: int64
Out[4]:
4    2
3    1
2    1
Name: D, dtype: int64
Out[4]:
4    2
3    1
2    1
Name: D, dtype: int64

1.4 Reading a dataframe from file

In [5]:
# usually we read the dataset from some file (for example csv), you can download it from the course webpage
irisdf = pd.read_csv("iris.csv", header=None)

# you can read directly from an URL as well
irisdf = pd.read_csv("https://courses.cs.ut.ee/MTAT.03.227/2018_spring/uploads/Main/iris.csv", header = None)

# we can assign (new) column names
irisdf.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]

# see the data
irisdf.head()

# data size
irisdf.shape

# ask for specific statistic of a column
irisdf.sepal_length.mean()

# or of all columns
irisdf.mean()
Out[5]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
Out[5]:
(150, 5)
Out[5]:
5.843333333333335
Out[5]:
sepal_length    5.843333
sepal_width     3.054000
petal_length    3.777181
petal_width     1.212162
dtype: float64

1.5 Working with NA's

In [6]:
# how many NA elements in every column
irisdf.isnull().sum()
Out[6]:
sepal_length    0
sepal_width     0
petal_length    1
petal_width     2
species         0
dtype: int64
In [7]:
# remove rows that have NA's
irisdf = irisdf.dropna()

# now the shape is
irisdf.shape
Out[7]:
(147, 5)
In [8]:
# you can also write data to a file
irisdf.to_csv("iris_no_na.csv")
In [9]:
# we can see that we have a typo in one of the class names, let's fix it
irisdf.species.value_counts()
Out[9]:
Iris-versicolor    50
Iris-virginica     49
Iris-setosa        47
Iris-virginicas     1
Name: species, dtype: int64
In [10]:
# now we have 3 classes as we should
irisdf.species = irisdf.species.replace("Iris-virginicas", "Iris-virginica")
irisdf.species.value_counts()
Out[10]:
Iris-versicolor    50
Iris-virginica     50
Iris-setosa        47
Name: species, dtype: int64

1.6 Subsetting dataframes

In [11]:
# selecting only some column/columns (head() command is just for convenient printing)
irisdf["sepal_length"].head()
irisdf[["sepal_length", "sepal_width"]].head()
Out[11]:
0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: sepal_length, dtype: float64
Out[11]:
sepal_length sepal_width
0 5.1 3.5
1 4.9 3.0
2 4.7 3.2
3 4.6 3.1
4 5.0 3.6
In [12]:
# selecting rows
irisdf[irisdf["sepal_width"] > 4].head()
irisdf[(irisdf["sepal_width"] > 4) & (irisdf["petal_length"] > 1.4)].head()
irisdf[(irisdf["sepal_width"] > 4) | ((irisdf["species"] == "Iris-versicolor") & (irisdf["petal_length"] > 1.4))].head()
Out[12]:
sepal_length sepal_width petal_length petal_width species
15 5.7 4.4 1.5 0.4 Iris-setosa
32 5.2 4.1 1.5 0.1 Iris-setosa
33 5.5 4.2 1.4 0.2 Iris-setosa
Out[12]:
sepal_length sepal_width petal_length petal_width species
15 5.7 4.4 1.5 0.4 Iris-setosa
32 5.2 4.1 1.5 0.1 Iris-setosa
Out[12]:
sepal_length sepal_width petal_length petal_width species
15 5.7 4.4 1.5 0.4 Iris-setosa
32 5.2 4.1 1.5 0.1 Iris-setosa
33 5.5 4.2 1.4 0.2 Iris-setosa
50 7.0 3.2 4.7 1.4 Iris-versicolor
51 6.4 3.2 4.5 1.5 Iris-versicolor

1.7 Adding new columns

In [13]:
# adding a new columns
irisdf["petal_sum"] = irisdf["petal_length"] + irisdf["petal_width"]
irisdf["petal_max"] = irisdf.apply(lambda row: max([row["petal_length"], row["petal_width"]]), axis=1)
irisdf["flower"] = irisdf.apply(lambda row: "small" if row["sepal_length"] < 5 else "big", axis=1)

irisdf.head()
irisdf.flower.value_counts()
Out[13]:
sepal_length sepal_width petal_length petal_width species petal_sum petal_max flower
0 5.1 3.5 1.4 0.2 Iris-setosa 1.6 1.4 big
1 4.9 3.0 1.4 0.2 Iris-setosa 1.6 1.4 small
2 4.7 3.2 1.3 0.2 Iris-setosa 1.5 1.3 small
3 4.6 3.1 1.5 0.2 Iris-setosa 1.7 1.5 small
4 5.0 3.6 1.4 0.2 Iris-setosa 1.6 1.4 big
Out[13]:
big      128
small     19
Name: flower, dtype: int64

2. Plotting

In [14]:
import matplotlib.pyplot as plt

# allows to output plots in the notebook
%matplotlib inline 

# makes inline plots to have better quality (can replace svg with retina as well)
%config InlineBackend.figure_format = 'svg'

# can change the default style of plots - google for more choices
plt.style.use("ggplot")

2.1 Example 1 - scatterplot with outliers

In [15]:
# usual plot
irisdf.plot(x="sepal_length", y="sepal_width", kind="scatter")

# calculating IQR (difference between 75% and 25% quantile)
IQR = irisdf["sepal_width"].quantile(0.75) - irisdf["sepal_width"].quantile(0.25) # interquartile range

# filtering out outliers (rows with "extreme" values - not all of them are actually outliers)
outliers = irisdf[irisdf["sepal_width"] > irisdf["sepal_width"].quantile(0.75) + 1.5*IQR]

# remove outliers from other data
usual = irisdf.drop(outliers.index)

# plot outliers with different color
ax = usual.plot(x="sepal_length", y="sepal_width", kind="scatter")
outliers.plot(x="sepal_length", y="sepal_width", kind="scatter", c="red", ax=ax)
ax.set_title("Cool scatterplot")
ax.set_ylabel("I can change this")
ax.set_xlim([0,10]);

2.2 Histogram

In [16]:
irisdf["sepal_length"].hist();

2.3 Barplot for categorical values

In [17]:
irisdf.species.value_counts().plot(kind="bar");

2.4 Grouped barplot

In [18]:
irisdf.groupby(["species", "flower"])["sepal_length"].count().unstack().plot(kind="bar");

2.5 Seaborn module

Seaborn is a plotting module for Python that allows to do some cool and quite complicated plots easily. It might not be in the default installation of Anaconda so you might need to install it (conda install seaborn).

In [19]:
import seaborn as sns

sns.pairplot(irisdf[["sepal_length", "petal_length", "species"]], hue="species");

3. Machine Learning with scikit-learn

In [20]:
# we have a dataset with numerical and categorical features
irisdf.head()
Out[20]:
sepal_length sepal_width petal_length petal_width species petal_sum petal_max flower
0 5.1 3.5 1.4 0.2 Iris-setosa 1.6 1.4 big
1 4.9 3.0 1.4 0.2 Iris-setosa 1.6 1.4 small
2 4.7 3.2 1.3 0.2 Iris-setosa 1.5 1.3 small
3 4.6 3.1 1.5 0.2 Iris-setosa 1.7 1.5 small
4 5.0 3.6 1.4 0.2 Iris-setosa 1.6 1.4 big

Sklearn only works with numeric matrices. This means that we have to convert our categorical values into numeric ones. One way is to use LabelEncoder that associates every category with a number. This way might not always work well because we might introduce an order into the data that might not be true (if "dog", "mouse", "cat" would become 1,2,3, would it mean that mouse is somehow the average of dog and cat?). It is better to do one-hot-encoding and convert every categorical value in a feature into a separate binary feature. Pandas package has a very simple method for doing just that.

In [21]:
from sklearn.preprocessing import LabelEncoder

# we can use LabelEncoder to convert the column flower into a numeric vector, but this might introduce unwanted order
enc = LabelEncoder()
flower_num = enc.fit_transform(irisdf["flower"])
flower_num[:10]
Out[21]:
array([0, 1, 1, 1, 0, 0, 1, 0, 1, 1])
In [22]:
# pandas has a method get_dummies that converts the categorical values of a feature into new binary features
irisdf_dummies = pd.get_dummies(irisdf, columns = ["flower"])
irisdf_dummies.head()
Out[22]:
sepal_length sepal_width petal_length petal_width species petal_sum petal_max flower_big flower_small
0 5.1 3.5 1.4 0.2 Iris-setosa 1.6 1.4 1 0
1 4.9 3.0 1.4 0.2 Iris-setosa 1.6 1.4 0 1
2 4.7 3.2 1.3 0.2 Iris-setosa 1.5 1.3 0 1
3 4.6 3.1 1.5 0.2 Iris-setosa 1.7 1.5 0 1
4 5.0 3.6 1.4 0.2 Iris-setosa 1.6 1.4 1 0
In [23]:
# next we can create our training and testing datasets with a train_test_split method
# at the moment train set size will be 75% of the data and test set size 25%
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(irisdf_dummies.drop(["species"], axis=1),
                                                    irisdf_dummies.species, test_size=0.25, random_state=0)
In [24]:
# results of the splitting are pandas Dataframes (for X_) and Series (for y_)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out[24]:
((110, 8), (37, 8), (110,), (37,))
In [25]:
from sklearn.neighbors import KNeighborsClassifier

# next we can fit our model on the training set, we have chosen KNN model with 1 neighbors
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
Out[25]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')
In [26]:
# we can ask the model to directly calculate the accuracy
knn.score(X_train, y_train)
knn.score(X_test, y_test)
Out[26]:
1.0
Out[26]:
0.97297297297297303
In [27]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# we can also let the model predict the values for the test set
y_pred = knn.predict(X_test)
print(y_pred[:10])

# and calculate the accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

# report for other classification measures
print("Classification report:")
print(classification_report(y_test, y_pred))

# and the confusion matrix
print("Confusion matrix:")
print(confusion_matrix(y_test, y_pred))
['Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-virginica'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-setosa' 'Iris-setosa'
 'Iris-virginica' 'Iris-setosa']
Accuracy: 0.972972972973
Classification report:
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        15
Iris-versicolor       0.93      1.00      0.96        13
 Iris-virginica       1.00      0.89      0.94         9

    avg / total       0.97      0.97      0.97        37

Confusion matrix:
[[15  0  0]
 [ 0 13  0]
 [ 0  1  8]]