The data comes from a publication http://www.pnas.org/content/102/5/1572.full (reading it is optional).
The data describes N=109 yeast cells, that come from the crossing of two yeast strains (BY and RM). There is M =~ 3000 SNPs that are different between parent strains. Offsprings can have either BY or RM variant of any one SNP.
File description:
genotype.csv: (N+3 rows, M+1 columns)
Line 1. - SNP Title (nnn_mm; nnn - location on the the chr, mm - chr number)
Line 2. - Chromosome number for the SNP
Line 3. - Location of the SNP on the chromosome
All other lines:
Column 1. - Cell ID
All other columns - individual SNP genotype; 0 = BY, 1 = RM, NA - not measured. SNPs can be identified using heading rows
expression.csv: (N+1 rows, G+1 columns)
Line 1. - Gene Title (mm_kkk_lll; mm - chr number, kkk - start coordinate on char for the gene, lll - end coordinate on char for the gene)
All other rows:
Column 1. - Cell ID
All other columns - normalised expression values for yeast genes. Genes are identified by title row. (location on the genome)
Tasks:
- read in the files.
- choose 2 - 5 genes (columns in expression.csv file) and plot their expression distribution.
- choose 2 - 5 individual yeast IDs (rows in genotype.csv file) and plot SNP values.
- implement LOD score calculation (formulas available here).
- calculate LOD score for each SNP for 5 genes. Choose SNP with largest LOD score, plot distribution if xi=0, xi=1.
- calculate LOD score for each SNP for all genes. Draw heat map.
5 tasks out of 6 for full credit.
Properly annotate figures. Submit in pdf format.