Andmekaeve - Kursused - Arvutiteaduse instituut

Homework 3 (HW03) - Computational statistics

Exercise 1 (EX1) (1 point)

In a big country there are exactly four political parties and every citizen supports exactly one party. You are carrying out an opinion poll and ask N uniformly randomly chosen people about which party they support. How big should N be if you want to have with probability at least 95% the errors on all 4 percentages less than 0.5%?

We want you to estimate N computationally under the assumption that you know the true distribution for the support of the parties (which you normally wouldn’t know, but this task is about calculating the sample size N that gets us “close enough” to the truth). Use two true distributions: a) 10%, 20%, 30% and 40% and b) 25%, 25%, 25% and 25%.

Hints for the solution:

We recommend solving this task by following this algorithm:

Fix a sample size N and choose one true probability distribution (a or b).
Repeat the following 100 times (you can use a for loop) and count how many times the 4^th part holds.
1. Generate a sample with size N (check what sample(c(1,2,3), 20, replace = T, prob = c(0.1, 0.4, 0.5)) does).
2. Calculate the percentages of support on the sample (check out table function).
3. Calculate the errors (differences from the true probabilities).
4. Find out if all of the errors are small enough (< 0.5% if you work in 0% to 100% range or < 0.005 if you work with probabilities from 0 to 1).
Find out how many times were all the errors small enough.

We recommend writing this into a function that takes in the sample size N and true probabilities as input and returns the percentage of times where all of the errors were small enough.

sample_n <- function(N, probs) {
#############
# YOUR CODE #
#############
return(result)
}

Then you can run this function systematically with different N’s and report when the conditions are met. Report all of the N’s you tried with the corresponding percentages (for both distributions). Don’t forget to answer the initial question!

Finally, discuss if there are differences between the two different true distributions.

Exercise 2 (EX2) (1 point)

You used fully computational approach to solve the previous task. However, these approaches can sometimes be computationally very expensive. That is why it is good to involve some knowledge from statistics, which would allow you to do less calculations. In this exercise you need to solve exactly the same problem as before, but this time you can use a “statistical shortcut” for calculating the errors.

Before, the entire sample generation process had to be done in order to estimate the percentages of all parties from which you got the errors (differences from true probabilities). From the lecture you learned that actually you can estimate these errors more directly without even generating a sample (assuming that the errors in all four percentages are independent). All you need to know is that the errors for each party have normal distribution with 0 mean and standard deviation equal to the square root of the count of the party (e.g. when N is 1000 and we use true probabilities of version a, then the counts are 100, 200, 300 and 400).

Your task is to modify the code from EX1 so that instead of generating 100 times new samples and calculating the errors, you generate the errors right away without generating the sample, using the information provided above.

Hints:

to generate 3 numbers from normal distribution with mean 1 and different standard deviations 1, 2 and 4, you can use the code rnorm(3, mean = 1, sd = c(1, 2, 4)).
you have to normalize the errors in order to compare them with 0.5% (or 0.005) (e.g. if the generated error is 20 and our sample size N is 1000, then the normalized error is 0.02).

Report and answer the same questions as before. Comment on the differences in terms of sample size and speed (nr of operations) when comparing EX1 and EX2.

Exercise 3 (EX3) (1 point)

Consider the following file. This file has 100 items and 100 attributes (when you read it in, make sure you are not using the index column as one of the attributes). We have created this file such that most of the attributes are independent and any correlations in them are just due to random chance. There are only a few pairs of attributes which are genuinely correlated. How many correlated pairs did we make? Justify the answer by constructing a permutation test. It is fine if your reported number of pairs is not exactly correct, we will grade your reasoning more than the answer.

Hints:

To calculate all correlation pairs you can use the cor function.
The diagonal line is attribute’s correlation with itself, which you don’t need. Correlation matrix is symmetric, which means you only need one part of it, otherwise you duplicate each pair. To solve both of these issues check out the lower.tri or upper.tri functions (and parameter diag).
For the permutation test it is ok to take the data and shuffle each column independently (you can do it once or multiple times, choose yourself). Then you can calculate the correlations and use them (e.g. min, max correlation or something else) to come up with a claim how many pairs were correlated in the original data. You can do it by estimating the highest (and the lowest) correlation between two pairs in case of randomly permuted data and later compare it to the correlations you have recorded in the original data and make conclusions.
Remember, pairs can be both negatively and positively correlated.
To get shuffled version of one column of a data frame, you can do the following: data[sample(nrow(data)),i], where i is the index of the column being shuffled.

Explain how you have built the test and report how many pairs were significantly correlated according to it.

Exercise 4 (EX4) (1 point)

Replicate the experiment in John Rauser’s talk by first using the t-test in R (function t.test) as well as the permutation test. Show and comment the process and the results. Were they the same as in the talk?

The data from the talk:

beer <- c(27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24, 28, 24, 29, 21, 21, 18, 27, 20)
water <- c(21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20)

Hint: after previous tasks you should know all the necessary functions and the talk provides you with the experiment setup. If you need more hints, then ask us! :)

Bonus exercise 1 (BEX1) (2 points)

You are organising a one-day kayaking trip for 100 people and it is your responsibility to give everyone a life vest with the right size. The sizes are: Large (above 90kg), Medium (70kg-90kg) and Small (below 70kg). Unfortunately, there is no way of knowing in advance who will come and what sizes of vests they need. You find out that it is possible to rent the vests for a very high price once the people arrive and tell you their required size. However, if you rent the vests in advance on the previous day, then you get the discount of 30%. You figure out that it is probably best to rent most of the vests with the cheap price, risking that some of them would not be used, and rent some extra at the higher price only in the case when you have too few vests of some size. The only information that you have is that a week earlier on a smaller trip with 20 people they used 7 L, 10 M, 3 S. How many vests of each size would you order in advance, assuming that you want to minimise the expected cost? You get 1 point if you solve the task by assuming that each of the people among 100 has size L with probability 7/20, M with probability 10/20 and S with probability 3/20. You get 2 points if you additionally model the uncertainty in these probabilities due to estimation from a sample with only 20 people. Note that we do not expect you to provide the mathematically optimal answer. It is good enough if you explore many different possible numbers for advance rental. The simplest answer of 35 L, 50 M, 15 S is not optimal and will not be considered correct.

Andmekaeve 2017/18 sügis