Python-Jupyter (+R) basics tutorial for DM 2017 course

This notebook contains some information and examples for getting started with Jupyter and Python.

Now we can start with the tutorial. First thing is to import the necessary modules.

In [147]:
# we will use the pandas module because it allows us to work with R-like dataframes
import pandas as pd

# often we need some functions from numpy as well
import numpy as np

# the next two lines will force jypyter to output all results from a cell (by default only the last one is shown)
# using semicolon (;) in the end of a line will force that line not to output the result
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

1. Data manipulation with pandas dataframes

Pandas dataframe is a convenient structure to work with data and has a lot of useful functionality for data analysis. If you are already familiar with R dataframes, then this is something really-really similar.

1.1 Creating a dataframe

In [137]:
# You can create a dataframe from scratch if you want to
# (although probably rarely need it during this course)
# you have to specify column names and data
df = pd.DataFrame({ 'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series([1,2,3,4],dtype='int32'), # you can be very specific about the data types
                    'D' : [4,2,3,4], # if you are not, pandas will try to guess
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : ["test", "train", "test", "train"],
                    'G' : 'foo' })

1.2 Information about a dataframe

In [138]:
# you can see information about the data in different ways
df.head() # first rows
df.tail(2) # last rows
df.shape # dimensions
df.describe() # summary statistics
Out[138]:
A B C D E F G
0 1.0 2013-01-02 1 4 test test foo
1 1.0 2013-01-02 2 2 train train foo
2 1.0 2013-01-02 3 3 test test foo
3 1.0 2013-01-02 4 4 train train foo
Out[138]:
A B C D E F G
2 1.0 2013-01-02 3 3 test test foo
3 1.0 2013-01-02 4 4 train train foo
Out[138]:
(4, 7)
Out[138]:
A C D
count 4.0 4.000000 4.000000
mean 1.0 2.500000 3.250000
std 0.0 1.290994 0.957427
min 1.0 1.000000 2.000000
25% 1.0 1.750000 2.750000
50% 1.0 2.500000 3.500000
75% 1.0 3.250000 4.000000
max 1.0 4.000000 4.000000

1.3 Series

In [139]:
# we can also create series - all dataframe columns are also series
s = pd.Series([1,5,2,6,4,5])

# and we can count different values in there
s.value_counts()

# as already said, we can do this also on the columns of dataframes
# (these two commands are identical)
df["D"].value_counts()
df.D.value_counts()
Out[139]:
5    2
6    1
4    1
2    1
1    1
dtype: int64
Out[139]:
4    2
3    1
2    1
Name: D, dtype: int64
Out[139]:
4    2
3    1
2    1
Name: D, dtype: int64

1.4 Reading a dataframe from file

In [142]:
# usually we read the dataset from some file (for example csv)
irisdf = pd.read_csv("iris.csv", header=None)

# we can assign (new) column names
irisdf.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]

# see the data
irisdf.head()

# data size
irisdf.shape

# ask for specific statistic of a column
irisdf.sepal_length.mean()

# or of all columns
irisdf.mean()
Out[142]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
Out[142]:
(150, 5)
Out[142]:
5.843333333333335
Out[142]:
sepal_length    5.843333
sepal_width     3.054000
petal_length    3.777181
petal_width     1.212162
dtype: float64

1.5 Working with NA's

In [10]:
# how many NA elements in every column
irisdf.isnull().sum()
Out[10]:
sepal_length    0
sepal_width     0
petal_length    1
petal_width     2
species         0
dtype: int64
In [11]:
# remove rows that have NA's
irisdf = irisdf.dropna()

# now the shape is
irisdf.shape
Out[11]:
(147, 5)
In [12]:
# you can also write data to a file
irisdf.to_csv("iris_no_na.csv")

1.6 Subsetting dataframes

In [13]:
# selecting only some column/columns (head() command is just for convenient printing)
irisdf["sepal_length"].head()
irisdf[["sepal_length", "sepal_width"]].head()
Out[13]:
0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: sepal_length, dtype: float64
Out[13]:
sepal_length sepal_width
0 5.1 3.5
1 4.9 3.0
2 4.7 3.2
3 4.6 3.1
4 5.0 3.6
In [17]:
# selecting rows
irisdf[irisdf["sepal_width"] > 4].head()
irisdf[(irisdf["sepal_width"] > 4) & (irisdf["petal_length"] > 1.4)].head()
irisdf[(irisdf["sepal_width"] > 4) | ((irisdf["species"] == "Iris-versicolor") & (irisdf["petal_length"] > 1.4))].head()
Out[17]:
sepal_length sepal_width petal_length petal_width species
15 5.7 4.4 1.5 0.4 Iris-setosa
32 5.2 4.1 1.5 0.1 Iris-setosa
33 5.5 4.2 1.4 0.2 Iris-setosa
Out[17]:
sepal_length sepal_width petal_length petal_width species
15 5.7 4.4 1.5 0.4 Iris-setosa
32 5.2 4.1 1.5 0.1 Iris-setosa
Out[17]:
sepal_length sepal_width petal_length petal_width species
15 5.7 4.4 1.5 0.4 Iris-setosa
32 5.2 4.1 1.5 0.1 Iris-setosa
33 5.5 4.2 1.4 0.2 Iris-setosa
50 7.0 3.2 4.7 1.4 Iris-versicolor
51 6.4 3.2 4.5 1.5 Iris-versicolor

1.7 Adding new columns

In [24]:
# adding a new columns
irisdf["petal_sum"] = irisdf["petal_length"] + irisdf["petal_width"]
irisdf["petal_max"] = irisdf.apply(lambda row: max([row["petal_length"], row["petal_width"]]), axis=1)
irisdf["flower"] = irisdf.apply(lambda row: "small_flower" if row["sepal_length"] < 5 else "big_flower", axis=1)
irisdf.head()
irisdf.flower.value_counts()
Out[24]:
sepal_length sepal_width petal_length petal_width species sum petal_sum petal_max flower
0 5.1 3.5 1.4 0.2 Iris-setosa 1.6 1.6 1.4 big_flower
1 4.9 3.0 1.4 0.2 Iris-setosa 1.6 1.6 1.4 small_flower
2 4.7 3.2 1.3 0.2 Iris-setosa 1.5 1.5 1.3 small_flower
3 4.6 3.1 1.5 0.2 Iris-setosa 1.7 1.7 1.5 small_flower
4 5.0 3.6 1.4 0.2 Iris-setosa 1.6 1.6 1.4 big_flower
Out[24]:
big_flower      128
small_flower     19
Name: flower, dtype: int64

2. Plotting

In [27]:
import matplotlib.pyplot as plt

# allows to output plots in the notebook
%matplotlib inline 

# makes inline plots to have better quality (can replace svg with retina as well)
%config InlineBackend.figure_format = 'svg'

# can change the default style of plots - google for more choices
plt.style.use("ggplot")

2.1 Example 1 - scatterplot with outliers

In [75]:
# usual plot
irisdf.plot(x="sepal_length", y="sepal_width", kind="scatter")

# calculating IQR (difference between 75% and 25% quantile)
IQR = irisdf["sepal_width"].quantile(0.75) - irisdf["sepal_width"].quantile(0.25) # interquartile range

# filtering out outliers (rows with "extreme" values - not all of them are actually outliers)
outliers = irisdf[irisdf["sepal_width"] > irisdf["sepal_width"].quantile(0.75) + 1.5*IQR]

# remove outliers from other data
usual = irisdf.drop(outliers.index)

# plot outliers with different color
ax = usual.plot(x="sepal_length", y="sepal_width", kind="scatter")
outliers.plot(x="sepal_length", y="sepal_width", kind="scatter", c="red", ax=ax)
ax.set_title("Cool scatterplot")
ax.set_ylabel("I can change this")
ax.set_xlim([0,10]);

2.2 Histogram

In [145]:
irisdf["sepal_length"].hist();

2.3 Barplot for categorical values

In [146]:
irisdf.species.value_counts().plot(kind="bar");

2.4 Grouped barplot

In [106]:
irisdf.groupby(["species", "flower"])["sepal_length"].count().unstack().plot(kind="bar");

2.5 Seaborn module

Seaborn is a plotting module for Python that allows to do some cool and quite complicated plots easily. It is not in the default installation of Anaconda so you need to install it (conda install seaborn).

In [158]:
import seaborn as sns

sns.pairplot(irisdf[["sepal_length", "petal_length", "species"]], hue="species");

2.6 Using R in a Python Jupyter notebook

There is some functionality that is present in R and not in Python. Also plotting capabilities of R are sometimes more flexible and easier to use. Because of that it is good to for example do some initial data processing in Python and then plot the results in R. In Jupyter notebook we can do it in a one notebook in the following way.

Before running these commands, you need to install the necessary things. Follow the instructions under steps 20 and 21 in here: https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/. The installation might not go too smoothly (I had some issues) but after you get it working, it is very convenient.

In [150]:
# fist load the necessary things to work with R in Python Jupyter notebook

%load_ext rpy2.ipython
%R require(ggplot2)
#%Rdevice svg <- you can try to do smth like this to get good quality R imgs, but it might not work
The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
Out[150]:
array([1], dtype=int32)
  • To write R code in a cell, you have to start it with %%R command - rmagic command.
  • If you want to give some existing data to R, use -i command, you can also send output from R.
  • Other commands are there to make the plots look nice - you can experiment with them, remove them etc.
  • Read more about the possibilities from here: https://ipython.org/ipython-doc/2/config/extensions/rmagic.html.
In [152]:
%%R -i irisdf -w 150 -h 100 -u mm -r 400

ggplot(irisdf, aes(x = sepal_length, y = sepal_width,color = species, size = petal_width)) + geom_point()
In [121]:
%%R -i irisdf -w 150 -h 100 -u mm -r 400

ggplot(irisdf, aes(x = sepal_length, y = sepal_width))+ geom_point() + facet_grid(species~.)