Topics based on published papers:
Proposals from Balaji:
- Identification and assembly of genomes and genetic elements in complex metagenomic samples without using a reference genome. Given a metagenome, how can the reads be clustered by genome? Large genomic fragments can be used to probe deeper questions of microbial evolution, diversity, and function.http://www.nature.com/nbt/journal/v32/n8/full/nbt.2939.html. whether a single deeply sequenced sample could be randomly partitioned into virtual samples for a similar analysis. (source from a blog)
- Comparing gene set enrichment tools (g:Profiler, DAVID and may be some others) and also identify the accuracy of gene id conversion eg. gConvert tools. How can gProfiler be improved?
- Gene set enrichment analysis and pathway reconstruction of metagenomic data to understand the function and phenotype of the biome.
Proposals from Hedi:
1. Protein microarrays-setups and applications
- http://onlinelibrary.wiley.com/doi/10.1002/elsc.201300052/abstract
- http://omicsonline.org/open-access/protein-microarrays-in-proteomewide-applications-jpb.S12-001.pdf
- http://link.springer.com/protocol/10.1007/978-1-4939-0992-6_14
- http://www.sciencedirect.com/science/article/pii/S1570963914000508
- http://abbs.oxfordjournals.org/content/43/3/161.full.pdf
2. Tissue similarities based on gene expression in known protein complexes http://nar.oxfordjournals.org/content/41/18/e171.full
3. Omics integration & targetable pathways http://www.nature.com/ncomms/2013/131018/ncomms3617/full/ncomms3617.html
4. eQTLs http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1003486 http://www.sciencedirect.com/science/article/pii/S0092867412015565
Proposals from Lemps:
- Review of BioJS: "see here":http://bioinformatics.oxfordjournals.org/content/early/2013/02/23/bioinformatics.btt100.short. Several examples can be found. Advisable would be experimenting ( port g:profiler for example).
Proposals from Tauno:
- how to improve power of GSEA by using multiple datasets: Large scale Gene Set Enrichment Analysis: http://www.biomedcentral.com/1471-2164/15/S1/S6
- Linking signaling pathways and expression: http://genome.cshlp.org/content/early/2014/09/02/gr.173039.114.abstract
- Compare multiple javascript visualisation libraries in rCharts R package (http://ramnathv.github.io/rCharts/) for plotting a PCA plot. Consider unique and common features (e.g. which libraries allow to deselect groups by clicking on legend items etc.), performance with large datasets. R shiny (http://shiny.rstudio.com/) can be used for testing them. (Anti Alman)
Proposals from Chitra: 1. A review of methods to identify differetially methylated regions based on:
- Analysing and interpreting DNA methylation data http://www.nature.com/nrg/journal/v13/n10/full/nrg3273.html
- Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies http://ije.oxfordjournals.org/content/41/1/200.short
- QDMR: a quantitative method for identification of differentially methylated regions by entropy http://nar.oxfordjournals.org/cgi/pmidlookup?view=long&pmid=21306990
- IMA: An R package for high-throughput analysis of Illumina’s 450K Infinium methylation data http://bioinformatics.oxfordjournals.org/content/early/2012/01/16/bioinformatics.bts013
2. Review Co-methylation studies based on:
- Extracting coordinated patterns of DNA methylation and gene expression in ovarian cancer. http://www.ncbi.nlm.nih.gov/pubmed/23599224
- A network-based, integrative approach to identify genes with aberrant co-methylation in colorectal cancer http://pubs.rsc.org/en/Content/ArticleLanding/2014/MB/c3mb70270g#!divAbstract
Research questions that can also be taken by Data Mining seminar.
Proposals from Kostya
- Ancestry determination using the 1000 genomes data.
- Motivation: It would be great to make Estonian Genome Centre's data more accessible to the people. For example, designing a 23andMe-like interface on top of it would be great. In order to make that happen we need to find out what useful phenotype information can we derive from the genotype. Although the "interesting" phenotypes are believed to be those related to disease risks, before dealing with them it would be nice to see whether we can detect something seemingly as simple as, say, the nationality of the person in problem. People at EGC have recently published a paper on "Ancestry determination", however their method was applied to an internal dataset, which can't be shared. I believe replicating this type of work on a publicly available dataset, perhaps ending up with an open solution for ancestry (and other phenotype) determination would be useful, educational and maybe even publishable.
- Project vision: I did not do too much research into available literature but I believe there is a lot to base upon, so the project could end up being anything ranging from a rerun of someone else's model, review of the state of the art, or a usable tool implementation.
- Detection of constituent metabolites from a mass-spectrum.
- Motivation: Ursel's group has a great mass spectrometer and wants to use it for vaious analyses. So far we used the spectra as is, however it would be great to try to convert them to a representation, which would indicate the amounts of certain metabolites present in the spectrum. I tried some trivial approaches, however the topic might be worth of a more dedicated study.
- Project vision: Devise (and preferably implement) a method for converting mass-spec data into a list of compounds putatively present in the data. The baseline solution is a list of peaks.
- Using Map-Reduce-based approaches for bioinformatics.
- Motivation: Hadoop and the related map-reduce techniques are quite fashionable in large-scale data analysis, as they promise nearly linear scalability with the number of machines. However, they do not seem to be too popular in the bioinformatics community (e.g. google "The Genome Analysis Toolkit"). As far as BIIT goes, we have no expertise at all in that. Consequently, a project which would demonstrate the deployment and application of Hadoop for some practical use cases (e.g. fitting a linear model on the EGC methylation/SNP data, or any of the abovementioned examples), would be very promising for us and, given the amount of background research performed, perhaps even publishable.
- Project vision: Implementation of a simple (or complicated if skills/luck/time/etc permit) analysis with a write-up that would help us do similar stuff in the future.
- Design and implementation of a data model and a query language for biological data available at BIIT.
- Motivation: We have a lot of useful biological data, both internal/EGC and public. Unfortunately, it is not organized in any way and access to it is always painful. Coming up with a decent data/query model along with some guidelines for keeping our data in the future is something that has to be done.
- Project vision: Documentation of the data types that are currently used in BIIT along with recommendations for its storage and a list of supported queries. Not sure this is something publishable or usable for a thesis, but, if executed well, it would be extremely valuable.
Proposals from Elena
- Comparison of the protein-protein interaction (PPI) ranking algorithms.
- Motivation: In many cases when studying biological process or disease we want to understand what protein interactions play the most important role. Identification of such protein interactions will help to develop new drug targets to treat diseases.
- The aim of the project is to find different PPI ranking methods (potentially network-based). Run these algorithms with the provided dataset. Compare the results.
- Network-based data integration.
- Motivation: Currently there are various types of biological experiments are being carried out. As the result we have lots of various data types (gene expression, protein-protein interactions, DNA- methylation, etc ). Although analysis of these data types separately brings new knowledge about biological process or disease, joint effect of the different regulators is not taken to account, and can potentially result in misleading conclusions about the underlying regulatory program.
- The aim of the project is to find a network-based method(s) for the integration of different data types and to run the method on the selected data sets.
- Inferring regulatory program in stem cells using gene expression and histone methylation.
- Motivation: Epigenetic mechanisms such as histone methylation play great role in early development when stem cell "chooses" the cell type (lineage) that it wants to become. We want to understand how histone methylation "regulates" changes in gene expression when cell chooses its fate.
- The aim of the project is to find the methods/tools to combine the histone methylation datasets and gene expression data sets. Compare the tools and run them on the stem cells derived data sets.