HW11. Pivot Tables and PCA (07.05)
In this homework we will learn two new topics that were so far not directly covered in the lectures yet. Namely, the pivot table functionality in spreadsheets (which is closely related to data cubes and OLAP in database terms); and the Principal Component Analysis (PCA).
The data that we will use is about all Estonian companies - their paid taxes, paid employer taxes, nr of people, and turnover. XLS data file is available from Tax Authority web page https://www.emta.ee/et/kontaktid-ja-ametist/maksulaekumine-statistika/tasutud-maksud
Main interest is in columns B (name), C (type - business?), E (county), F (field of industry), G (all taxes paid), H (employment taxes paid), I (turnover), J (nr of employees) - in first quarter of 2017.
Your goal: be able to work with this data.
1) Learn to use filters - characterise largest companies in 3 select counties; by different criteria - area of business, nr of employees, paid taxes, highest salaries per employee (employment tax paid per employee). Goal: a short meaningful description of dominant companies in various counties.
2) Learn to use pivot tables. Use online tutorials if you are not familiar. Google Drive Sheet may not be able to take the full dataset "easily". Excel and Open Office should be enough. Make an overview of the above data: characterise all business areas by: Number of entities; Total paid taxes; Average nr of employees; Average paid employment taxes.
3) Subselect all entities that paid taxes 1M-2M EUR. Make a separate data file for yourself. Fill in missing values - to avoid division by zero, make all empty slots or zeroes for example into 0.1 (close enough to 0). Cluster these companies using some clustering algorithm (try varying distance measures).
4) Study the general concept of the Principal Component Analysis (PCA). E.g. videos - ; ; or some other tutorials, textbooks, wikipedia. State briefly in 4-5 sentences what PCA does, and identify some PCA plot where underlying data set had at least 4-5 dimensions (google search, cite the source).
5) Use the same selection of data as from Task 3. Now visualise these data using the PCA plot. Make PCA plots for PC1-PC2; PC1-PC3; PC2-PC3. Look up a few specific companies in that plot; Try to compare to the clustering performed in Task 3. Where do some of these clusters "lie" on the PCA plot.
6) (Bonus 1p) Make some other subselection of data, and apply PCA on that as well - e.g. some large companies from different industries - are industries somewhat different by PCA? Make a compelling story about some county and different companies; or different industries/companies. You may also need to increase the "weight" of nr of employed individuals - perhaps normalise first each column?
7) (Bonus 1p) Finalise your actual project plan - with your team mates. Make sure to have it described in the presentation file of https://drive.google.com/drive/folders/0B5TbAaWYr3OgR2FyVnlIVjBoVDA - please also add there somemore details on data; visualisation idea/sketch; the methods to be used. One image tells more than thousand words - add some illustrations. If you need, use two slides in the slide-deck. For project team it is ok to share the same description in your HW submissions. Just make sure to add all team-mates names as well.