This notebook contains some information and examples for getting started with Jupyter and Python.
Now we can start with the tutorial. First thing is to import the necessary modules.
# we will use the pandas module because it allows us to work with R-like dataframes
import pandas as pd
# often we need some functions from numpy as well
import numpy as np
# the next two lines will force jypyter to output all results from a cell (by default only the last one is shown)
# using semicolon (;) in the end of a line will force that line not to output the result
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Pandas dataframe is a convenient structure to work with data and has a lot of useful functionality for data analysis. If you are already familiar with R dataframes, then this is something really-really similar.
# You can create a dataframe from scratch if you want to
# (although probably rarely need it during this course)
# you have to specify column names and data
df = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series([1,2,3,4],dtype='int32'), # you can be very specific about the data types
'D' : [4,2,3,4], # if you are not, pandas will try to guess
'E' : pd.Categorical(["test","train","test","train"]),
'F' : ["test", "train", "test", "train"],
'G' : 'foo' })
# you can see information about the data in different ways
df.head() # first rows
df.tail(2) # last rows
df.shape # dimensions
df.describe() # summary statistics
# we can also create series - all dataframe columns are also series
s = pd.Series([1,5,2,6,4,5])
# and we can count different values in there
s.value_counts()
# as already said, we can do this also on the columns of dataframes
# (these two commands are identical)
df["D"].value_counts()
df.D.value_counts()
# usually we read the dataset from some file (for example csv)
irisdf = pd.read_csv("iris.csv", header=None)
# we can assign (new) column names
irisdf.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]
# see the data
irisdf.head()
# data size
irisdf.shape
# ask for specific statistic of a column
irisdf.sepal_length.mean()
# or of all columns
irisdf.mean()
# how many NA elements in every column
irisdf.isnull().sum()
# remove rows that have NA's
irisdf = irisdf.dropna()
# now the shape is
irisdf.shape
# you can also write data to a file
irisdf.to_csv("iris_no_na.csv")
# selecting only some column/columns (head() command is just for convenient printing)
irisdf["sepal_length"].head()
irisdf[["sepal_length", "sepal_width"]].head()
# selecting rows
irisdf[irisdf["sepal_width"] > 4].head()
irisdf[(irisdf["sepal_width"] > 4) & (irisdf["petal_length"] > 1.4)].head()
irisdf[(irisdf["sepal_width"] > 4) | ((irisdf["species"] == "Iris-versicolor") & (irisdf["petal_length"] > 1.4))].head()
# adding a new columns
irisdf["petal_sum"] = irisdf["petal_length"] + irisdf["petal_width"]
irisdf["petal_max"] = irisdf.apply(lambda row: max([row["petal_length"], row["petal_width"]]), axis=1)
irisdf["flower"] = irisdf.apply(lambda row: "small_flower" if row["sepal_length"] < 5 else "big_flower", axis=1)
irisdf.head()
irisdf.flower.value_counts()
import matplotlib.pyplot as plt
# allows to output plots in the notebook
%matplotlib inline
# makes inline plots to have better quality (can replace svg with retina as well)
%config InlineBackend.figure_format = 'svg'
# can change the default style of plots - google for more choices
plt.style.use("ggplot")
# usual plot
irisdf.plot(x="sepal_length", y="sepal_width", kind="scatter")
# calculating IQR (difference between 75% and 25% quantile)
IQR = irisdf["sepal_width"].quantile(0.75) - irisdf["sepal_width"].quantile(0.25) # interquartile range
# filtering out outliers (rows with "extreme" values - not all of them are actually outliers)
outliers = irisdf[irisdf["sepal_width"] > irisdf["sepal_width"].quantile(0.75) + 1.5*IQR]
# remove outliers from other data
usual = irisdf.drop(outliers.index)
# plot outliers with different color
ax = usual.plot(x="sepal_length", y="sepal_width", kind="scatter")
outliers.plot(x="sepal_length", y="sepal_width", kind="scatter", c="red", ax=ax)
ax.set_title("Cool scatterplot")
ax.set_ylabel("I can change this")
ax.set_xlim([0,10]);
irisdf["sepal_length"].hist();
irisdf.species.value_counts().plot(kind="bar");
irisdf.groupby(["species", "flower"])["sepal_length"].count().unstack().plot(kind="bar");
Seaborn is a plotting module for Python that allows to do some cool and quite complicated plots easily. It is not in the default installation of Anaconda so you need to install it (conda install seaborn).
import seaborn as sns
sns.pairplot(irisdf[["sepal_length", "petal_length", "species"]], hue="species");
There is some functionality that is present in R and not in Python. Also plotting capabilities of R are sometimes more flexible and easier to use. Because of that it is good to for example do some initial data processing in Python and then plot the results in R. In Jupyter notebook we can do it in a one notebook in the following way.
Before running these commands, you need to install the necessary things. Follow the instructions under steps 20 and 21 in here: https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/. The installation might not go too smoothly (I had some issues) but after you get it working, it is very convenient.
# fist load the necessary things to work with R in Python Jupyter notebook
%load_ext rpy2.ipython
%R require(ggplot2)
#%Rdevice svg <- you can try to do smth like this to get good quality R imgs, but it might not work
%%R -i irisdf -w 150 -h 100 -u mm -r 400
ggplot(irisdf, aes(x = sepal_length, y = sepal_width,color = species, size = petal_width)) + geom_point()
%%R -i irisdf -w 150 -h 100 -u mm -r 400
ggplot(irisdf, aes(x = sepal_length, y = sepal_width))+ geom_point() + facet_grid(species~.)