HW10 (due Dec 4th) Mining supermarket shopping baskets
The purpose of this homework is to give you some experience in exploring a slightly larger dataset, hopefully proving to be useful when you are starting your own projects.
We are going to use 'instacart' 1M dataset in these tasks (we had a look at it in HW02). We recommend to use fread
function from data.table
package for reading in the data and dplyr
package for working with this large data.
Exercise 1 (EX1) (1 point)
In this exercise we will attempt to build a very simple recommendation engine. Read in the dataset linked above and consider the first order (purchase with order_id=2539329) of the customer with user_id = 1, let's address this customer, as A.
- (a) Save product ids that were bought by A during the first visit to the shop (there should be five items). What are their product names?
- (b) In order to reduce the amount of computations, from the initial dataset filter out all rows that do not contain these products ids. How many rows remained? Hint: use
filter
function fromdplyr
package. You should also remove all rows in the initial dataset which correspond to missing product names (NA
in the product_id column). - (c) Transform the result of (b) into a matrix, which would have order ids in rows and product ids (or product names) in columns. Hint: you can use
dcast
function fromdata.table
package this way -dcast(your_data_frame, order_id~product_name)
. This matrix should have 1s in cells, which correspond to products being present and 0s otherwise (you should substitute all NAs with 0s). What is the dimensionality of this matrix? - (d) Now compute euclidean distances between the initial order of A to the orders from the matrix.
- (e) Select top N closest (in terms of euclidean distance) orders to the first order by A. Explain your choice of N.
- (f) Based on these N closest orders, say, which product we should advertise to A next time he/she comes to the shop (NB! apart from the ones that were bought first time)? Why?
- (g) Look at the next shopping basket of A. Do you see this product among purchases?
This simple method is called collaborative filtering. Which classification method from the ones we have already studied has a very similar strategy? What are more sophisticated/efficient ways of generating recommendations based on user purchase data? Name few and explain them briefly.
Exercise 2 (EX2) (2 points)
Let us investigate the popularity of products at different hours of the day. Please ignore the day of the week (order_dow) field in this task. For each of the following subtasks provide both the code and the answer. Please read through all subtask descriptions before starting to solve it, because this might help to save you some coding effort by reusing code. Also, check out our tutorial on dplyr
that we gave earlier in the course.
(a) What is the overall most popular product? Can you guess the answer before finding it out? By popularity we mean the number of shopping baskets containing this product. Note that you can obtain the answer by finding the most frequent product in the column product_name.
(b) What is the most popular product at hour 00? At hour 01? ... At hour 23? Provide the answer as a list or vector of 24 product names, one corresponding to each hour.
(c) The answers to the subtask (b) were perhaps not very surprising, because overall popularity can lead to popularity at any hour. However, let us now study which product is the most popular at hour 00 relative to its total sales. Here we define relative popularity of a product as the ratio of its popularity at hour 00 over its overall popularity. In other words, you need to divide the number of rows with this product at hour 00 by the total number of rows with this product. This is an example of data normalization, a very important set of methods to enable fair comparisons between items. Please report the product with the highest relative popularity at hour 00, at hour 01, ..., at hour 23, print as a list of 24 product names. If at some hour there are multiple products with the same relative frequency then for reporting choose just one of them arbitrarily.
(d) Please prepare a 24x24 table X
about the products that you obtained in subtask (c). The rows in X
should be products and columns should be hours. For each product P and each hour H the table should specify the relative popularity of product P at hour H. Have a look at the table, no need to print it out in the report. Discuss if you have discovered anything interesting in this table. In case this table is very boring, then please explain the reason.
(e) The previous subtasks (c) and (d) found the products with the highest relative popularity at hour 00, ..., hour 23. Please narrow the same analysis down to the 300 overall most popular products. In other words, create a smaller version of the original dataset which only includes data about 300 products with the highest popularity in the original dataset. Then, find the most relatively popular product at hour 00, at hour 01, ..., at hour 23 and report these as a list of 24 product names. Using the same or similar code as in (d) create a 24x24 table X
with these 24 products with their relative popularity at each hour. Have a look at the table, no need to print it out in the report. Discuss if the result is more interesting than in subtask (d) and why.
(f) Let us plot the table from the previous subtask (e) using the command pheatmap(X,cluster_rows=F,cluster_cols=F)
from package pheatmap
(author: Raivo Kolde), which you can install using install.packages('pheatmap')
and load using library(pheatmap)
.
Before plotting, please assign names of products to the rows using rownames(X)<-put_vector_of_24_names_here
and assign hours 00,...,23 to the columns using colnames(X)<-sprintf("h%02d",0:23)
, in order to be able to see the product names and hours in the figure. Please include the figure in the report. If you have done this correctly, then you should be able to see that the descending diagonal stands out from the figure (values slightly higher than around it). Discuss why this happens. If we would provide you with the next 1 million rows from the same supermarket, do you think the same analysis would give a similar result? Why?
(g) Let us repeat the analysis of subtask (c) but now with aisles. Please study which aisle is the most popular at hour 00 relative to its total sales. Here we define relative popularity of an aisle as the ratio of its popularity at hour 00 over its overall popularity. In other words, you need to divide the number of rows with this aisle at hour 00 by the total number of rows with this aisle. Please report the aisle with the highest relative popularity at hour 00, at hour 01, ..., at hour 23, print as a list of 24 aisle names.
(h) Please prepare a 24x24 table X
about the aisles that you obtained in subtask (c). The rows in X
should be aisles and columns should be hours. For each aisle A and each hour H the table should specify the relative popularity of aisle A at hour H. Have a look at the table, no need to print it out in the report.
Note that it can happen that some aisle name occurs multiple times in this table, this is fine.
Assign names of aisles and hours to rows and columns, respectively, and visualize using the pheatmap
command as you did in subtask (f). Report 3 interesting facts from the visualization. Can you still see the diagonal standing out from the figure?
(i) Please prepare a 22x24 table X
about the departments. The rows in X
should be the 22 departments and columns should be hours. For each department D and each hour H the table should specify the relative popularity of department D at hour H. Have a look at the table, no need to print it out in the report.
Assign names of departments and hours to rows and columns, respectively, and visualize using the pheatmap
command as you did in subtask (h). Report 3 interesting facts from the visualization.
R implementation hints: If you are not able to get the result with simple commands then you can always fall back to writing for-loops. Using dplyr
commands something like group_by(product_name,order_hour_of_day)
and summarise(count=length(product_name))
and arrange()
might turn out to be useful. This produces a long table which you could transform into a wide table either using package tidyr
or alternatively using package reshape2
using the dcast
command, see an example here. The benefit of dcast
is that you can explicitly state what you want in rows and columns, such as dcast(d,product_name~order_hour_of_day,value.var='count')
. Preparing a table with all products in rows and 24 hours is columns already for subtask (a) might save you time later. A vector of row sums in such a table can be obtained by
s <- apply(d,1,sum)
and dividing each row by its sum can be done by p <- d/s
. When using pheatmap
it can happen that the plotting space is too narrow to fit the heatmap together with the long product names. Then try increasing the width of the plotting window or alternatively save directly to pdf by calling something like pdf('filename.pdf',width=15,height=15)
followed by the pheatmap
function call and then finally dev.off()
to close the file.
Exercise 3 (EX3) (1 point)
The objective of this exercise is to let you think about potential data mining questions and goals, similarly to what you will soon need to do in your project. Please come up with 3 data mining goals, either for the same Instacart dataset, or for some other dataset of your choice. If choosing something other than Instacart then please briefly describe your chosen dataset and if possible then provide a link to it. Each suggested goal should be broad enough, so that achieving it requires answering multiple questions relating to this goal, with the total size at least as big as 1 home exercise (but can be even broader). For each of your suggested 3 data mining goals please describe the following:
- (a) Briefly explain the data mining goal (such as identifying products which are particularly popular during some specific hour of the day) and why you think it matters for the business (such as hypothetically informing the supermarket about what products to put in a new shelf with popular breakfast-related products)
- (b) List the subtasks that you think should be carried out to achieve the goal.
- (c) What data mining techniques would you use to achieve the goal?
If choosing Instacart then please make sure that your 3 suggested goals are different from the ideas presented in EX1 and EX2. It is fine to take the dataset that you are planning to choose for your project. You can choose any other existing dataset as well (for example, if the dataset that you envisioned in EX4 of homework HW01 exists, then that is also fine). Hint: there is no need to implement any code at all in this exercise.
Bonus Exercise 1 (BEX1) (2 points)
Implement one of the three ideas that you came up with in EX3. The result should form a nice story and should be presented in a similar way as EX2. This means that you should provide code, visualizations of results with discussion and conclusions about what the analysis has revealed about the data.