HomeWork 1 (due Feb 22nd) - Introduction (probability, R, business)
1. Read the first chapter on Probability theory from MathWiki web-site : http://mathwiki.cs.ut.ee/start. We recommend you to solve all the given exercises for training purposes. Play with the simulation of a dice. Explain how the increase of rolling times changes the probability distribution of the dice.
2. Solve the following tasks from Math wiki, describe it in detail (not just an answer) and make sure you can explain them for the audience:
2.1. A company makes computer discs. It tested a random sample of discs from a large batch and found that the probability of any disc being defective is 0,025. Bob buys two discs. Calculated the probability that
- both discs are defective;
- that only one disc is defective.
The company found 4 defective discs in the sample they tested. How many discs were likely tested?
2.2. At the exam there is 0.8 probability that student has prepared and 0.2 that he has not prepared. Those who are prepared have 0.7 probability of success, those who have not prepared have 0.4 probability of success. What is the probability that randomly selected student will succeed?
3. What is the probability to get 7 or 8 heads when you throw a fair coin 10 times? What is the probability to get 70 or more heads when you throw a fair coin 100 times? Conduct a computational experiment by generating 10,000 times such sequences of 10 coin tosses or 100 coin tosses.
Hint: recall R tutorial practice about sampling and provide solution written in R code.
(Obviously, in here you can use also other languages to get the task done. But it would help you in future to use R whenever possible)
4. Read into R the data from the coursepage: Attach:bodyfat.txt - https://courses.cs.ut.ee/MTAT.03.183/2015_spring/uploads/Main/bodyfat.txt ( the description of dataset is here: https://courses.cs.ut.ee/MTAT.03.183/2015_spring/uploads/Main/bodyfat_description.txt ) and answer the following questions:
- What are the column names of the bodyfat dataset?
- How many observations (i.e. rows) are in this data frame?
- Print the first 3 lines from the dataset. What are the ages of the printed persons?
- Extract the last 2 rows of the data frame. What is the hipcirc of these persons?
- What is the value of DEXfat in the 31st row?
- How many missing values are in the age column?
- What is the mean of the waistcirc column? Exclude missing values (NA) from this calculation.
- Extract the subset of rows of the data frame where elbowbreadth values are above 6.1 and kneebreadth values are below 9.4. What is the mean of DEXfat in this subset?
- What is the most frequent age value?
- What is the minimum of DEXfat when age is equal to 49?
- Make a scatterplot with the DEXfat variable on the y-axis and waistcirc variable on the x-axis. Based on this plot, describe the relationship between the two variables.
Copy in the report the R command and respective printout. Attend this week R practice session to make sure you get the help needed :)
5. Selver is one of the main grocery store chains in Estonia. It belongs to the Kaubamaja group. Describe all the data or information that is potentially collected about the customers of Selver and their "Partnerkaart" loyalty card. Make some clear assumptions and estimate the amount of all this data collected during one year.
6. (Bonus 2p) Formulate 6 business questions and goals for analysis of all that data and how that would possibly make Selver/Kaubamaja more profitable.