Andmekaeve - Kursused - Arvutiteaduse instituut

How to choose a programming language for Data Mining course?

In order to solve common data mining tasks during the course, you need to choose a programming language. In general, you can use any language you are comfortable with, but we strongly recommend using either R or Python since they are widely used in the data mining community and they have a lot of built-in libraries that will make your life much easier.

So which language to use? Well, if you are serious about doing data science you should know both, but we will provide you with short descriptions to help you decide.

Since both languages have their pros and cons, then sometimes the easiest and fastest way is to use them both together (for example do the preprocessing in Python and plotting in R etc). If you want to do that then good news for you. Jupyter notebooks provide a really simple way - you can just write one cell in Python, another in R and so on. An example will be uploaded soon in here.

R - all in one language for statistical computing and graphics

has been used usually as the default language of the course
is known to have all the necessary functionality we use in this course (and in general has a very good and large set of existing packages for data mining tasks)
is itself easily installable and installing new packages is one line of code
has a good IDE for exploratory analysis (RStudio)
is known as a statistics programming language and is used for data analysis, not so much for developing software
the syntax might be a little unintuitive at first, so there is a learning curve, but once you get used to it, it allows you to do complicated things in a really short way
has very nice plotting capabilities (ggplot2)

We recommend using R if:

you like R
you want to learn a new language
you are not too much used to programming
you are ok with spending some time getting to know the language (and later can write really short solutions to the exercises)
you want to work in the data mining field and feel like you should know R (because a lot of other people are also working with R)

One way to set up a convenient working environment in R:

Easiest way to use R is to work in RStudio.
You can generate nice reports by writing your report and code into R markdown file and then using knittr (Knit PDF) to generate a pdf report from that.
A nice library for plotting is ggplot2.
A good tutorial for R is swirl. It is an R package that helps you learn R, you can select tutorials you want, or just do them all step-by-step if R is completely new to you.
To get started with R faster, you are encouraged to try out the rattle package. The package allows you to perform many common data mining tasks via GUI without entering the commands.
You can also run interactive R sessions within Jupyter notebooks. You can enter regular R commands and accompanied markdown text, and the output will be generated and updated automatically each time you rerun the session. Think of it as R with Word capabilities. These interactive documents are a convenient way to submit your homework.

Feel free to share information about useful packages and R tricks in the Piazza forum.

Python - widespread programming language that has modules for doing data analysis

has not been used as a default language in this course, but many have used it and said that it is sufficient (and we think so too)
it should have most of the necessary packages we use in this course (although not all might be there yet, for example we haven’t found a decent module for itemset mining, although the algorithms we use are easily implementable and can be found on different forums as well)
modules like pandas make working with the data and plotting as easy and short as it is in R
the community of people using Python for data mining seems to be growing so there are new modules coming up constantly
is a widely used programming language, for all purposes, not just data analysis

We recommend using Python if:

you like Python
you have programming background and would like to use a programming language that is also usable in other areas besides data analysis
you are ok with sometimes spending extra time searching and installing necessary modules (although Anaconda should be sufficient for this course)

One way to set up a convenient working environment in Python:

The easiest way to use Python is to install Anaconda that has all the necessary packages already installed. Of course, you could install Python and then the modules separately, but Anaconda probably has most, if not all modules we need.
The modules we recommend using are (all of them except seaborn are in the default installation of Anaconda, no need to install anything separately):
- pandas - for storing data in data frames and easy plotting, statistics, allows you to do R like analysis (R also has data frames as a main data structure) cheat sheet 1, cheat sheet 2.
- seaborn - advanced and easy plotting (might not be in the default modules list, install it by running: conda install seaborn)
- scikit-learn - machine learning functionality, clustering, etc
- numpy, scipy, matplotlib, etc - modules above use them, but they have some separate functionality that might be useful
- jupyter/ipython
  - allows to do experimental analysis (execute code in parts and see intermediate results, very useful tool, highly recommended)
  - you can use markdown to write explanations between code and plots, so basically generate a report with the code (you can download report with File -> Download as -> PDF via LaTeX)
  - to run jupyter notebook, type jupyter notebook in the folder where your code will be (or DM home folder, etc).
There are different tutorials about all these modules. I would recommend pandas to get started with data frames (one pandas tutorial).

Feel free to share information about useful modules and Python tricks in Piazza forum.

TL;DR - Python is good for data preprocessing, writing nice code and also has good libraries for data analysis (might not have as much functionality yet as R). R has all the functionality needed during the course and is probably better for plotting. There are tasks were either R is easier or Python is easier/more convenient to use. But if you want to do data science you should definitely learn good and bad sides of both and use suitable one for your current task!

Andmekaeve 2016/17 kevad