Bioinformaatika seminar - Kursused - Arvutiteaduse instituut

Topics & Articles

Here are the topics with scientific papers that we believe would make good seminar presentations. For each topic there are two articles and those articles introduce methods that aim to solve a similar problem, but they do it in a different way. The aim of a presentation would be to compare those two methods. So, the combined presentation would look following: first presenter introduces first method, second presenter second method and then reviewers can give their opinion. It would be nice if presenters discuss those methods beforehand and adjust their presentations so that the problem and tools that are intended to solve that particular problem are introduced to the audience in a clear and coherent manner. Also, given that there are articles that benchmark those tools then it would be useful to check them out (they help to get better and wider understating about given topic/problem). If you have trouble finding benchmarking articles (or something introductory to given topic) then please ask us and we will gladly help!

If you are interested in another method or even another topic then please let us know and we will add it here (provided of course that it is suitable).

Sequence alignment

Here you can find programs that are used for aligning short reads (segments) of DNA sequences to a long reference sequence (e.g. human genome). This process in also known as read-mapping and is usually among first steps of almost every sequence analysis pipeline.

Method 1: BWA
Method 2: Bowtie 2

RNA-Seq alignment

Methods in this section serve a similar purpose as those that are in previous topic, but now the origin of short reads is mRNA (product of a gene) not DNA. Why is a separate program needed for mRNA? BWA and Bowtie are not aware of splicing (mRNA editing step - nascent mRNA contains segments called exons and introns, introns are usually spliced out and the edited mRNA contains only exons). Therefore, reads from mRNA cannot be mapped directly on DNA meaning that the absence of introns in mRNA should be dealt with. kallisto is also somewhat different, it does "pseudoalignment", to find out what it is, check the article!

Method 1: HISAT Khatia Kilanava
Method 2: kallisto Natia Doliashvili

Full-length transcript assembly from RNA-Seq data

Now that reads are mapped (check previous topic), it is important to assemble them to form longer sequences called transcripts (i.e. to generate actual mRNA-s that were the source of the reads) to understand which genes were active. In order to understand to what extent they were expressed (whether those genes were very active), it is important to estimate transcript abundance. All of this is done by following tools:

Method 1: Cufflinks Simona Micevska
Method 2: StringTie Hristijan Sardjoski

Alternative expression

In the topic "RNA-Seq alignment" we mentioned that mRNA-s are spliced (introns are excluded and exons are preserved). For many genes, multiple mRNA isoforms can be produced, meaning that some introns can be spared or even some exons may be omitted. It is also possible that mRNA starts or ends in a different place. This process is called alternative splicing and it generates variability. This variability alters gene's function and enable gene to perform more roles (thereby it is also relevant to identify those isoforms). So, there is a variability in transcripts that originate from one gene (some are a bit longer or doesn't have exactly the same sequence) and following tools are developed to identify them.

Method 1: LeafCutter Chau Anh
Method 2: MISO Nurlan Kerimov

Colocaliation between eQTLs and complex disease associations

There are many SNP-s (Single Nucleotide Polymorphisms - specific positions in the genome that can vary, e.g. majority of people have A in that particular position while minority have C, that can change transcription factor binding or amino acid of a protein). GWAS-s (Genome-Wide Association Study) aim at finding SNP-s that are associated with disease (e.g. schizophrenia) or any other complex trait (e.g. height, weight). The next challenge of course is to understand molecular basis of those associations. One way to do it is to find which of those associations (associated SNP-s) affect mRNA expressions levels (e.g. activity of genes). This means that GWAS datasets are integrated with datasets from eQTL (Expression Quantitative Trait Loci - genomic positions/regions that contribute to variation in expression levels of mRNAs) studies. Following articles show how it can be done.

Method 1: Regulatory trait concordance (RTC)
Method 2: coloc

ChIP-seq peak calling

ChIP-Seq (Chromatin Immunoprecipitation Sequencing) is a method for identifying DNA binding sites for transcription factors (TF-s) and other proteins. ChIP-Seq produces reads (sequences) that cover only DNA regions that interact with proteins or are very close to them. Peak calling is a computational method used to identify regions in the genome that have been enriched with aligned reads obtained from ChIP-Seq experiment. Those regions include TF binding sites (but also binding sites for other proteins). Following methods show how to identify candidate peaks (regions) and test them for statistical significance.

Method 1: MACS2
Method 2: BCP

Genotype imputation

Let's take SNP (described in colocalisation topic) in the 1st chromosome that can be either A or C in a particular position. In humans, each cell normally contains two sets of 23 chromosomes, for a total of 46 (one set is inherited from mother and the other from father). This means that we have two 1st chromosomes and therefore we can have A in one of them and C in the other or we can have the same letter in both of them. So, possibilities are AC, AA and CC and they are called genotypes. Some SNPs are situated very closely. Let's take 2nd SNP that is very close to first one, however, this can be either G or T. Now, the thing is that some combinations of those SNPs tend to be more frequent - e.g. chromosome that has A in the 1st SNP has much more frequently T in the 2nd SNP. This information can be used to infer unobserved genotypes and this process is called imputation. To illustrate this, hypothetical experiment goes as follows: biologist genotypes 2000 SNP-s, now let's say that there are 1000 SNPs close to them that weren't determined experimentally. However, there is information which letters (nucleotides) in which positions occur more frequently together (obtained from other projects) and this information can be used to impute them. So, after imputation we have 3000 SNPs and we can test associations with all of them (e.g. perform GWAS analysis (check colocalisation topic again)). Following articles show how imputation of missing genotypes can be done.

Method 1: IMPUTE
Method 2: STITCH

DNA-binding motifs

In here you can find tools that are used for discovery and analysis of sequence motifs (short sequences of DNA that have important biological significance - e.g. proteins like transcription factors (proteins that activate mRNA production from a gene) bind to them).

Method 1: DREME Vladyslav Fediukov
Method 2: RSAT Sofiya Demchuk

DNA methylation

DNA methylation is a process by which methyl groups (CH3) are added to DNA, more specifically to nucleotide C in the CpG context (p stands for phosphate bond and stresses that C and G nucleotides need to be situated sequentially on the same strand). The main purpose of DNA methylation is to modify gene expression (e.g. when CpG-s are methylated in the promoter (i.e. regulatory region that initiates production of mRNA), then gene expression is repressed). In order to identify methylated C-s, a specific laboratory procedure is carried out which results in conversion of unmethylated C-s to T-s. Following methods aim to map those treated reads (obtained from aforementioned experiment) to the genome and find out which positions were methylated.

Method 1: BSMAP Kaur Karus
Method 2: BWA-meth Sander Tars

Functional enrichment analysis

As a result of many experiments, hundreds or even thousands of genes are identified (e.g. experiment that seeks differentially expressed genes in cancer compared to normal tissue). Next logical question to ask is whether there is something "interesting" about this group of genes. One way to answer this (and get better understanding of the biology behind this gene set) is to perform functional enrichment analysis. This method aims to identify classes of genes that are over-represented in a large set of genes. This is achieved by comparison of the input gene set to terms in the gene ontology (representation of genes through their attributes like biological process they are involved in) and performing statistical test to see which terms are enriched for the input genes. Gene Set Enrichment Analysis (GSEA) and g:Profiler achieve the same result, but g:Profiler uses a discrete hypergeometric test while GSEA uses a continuous test of enrichment.

Method 1: GSEA
Method 2: g:Profiler

Image phenotypes

Nowadays, deep learning is often used for analysis of microscopy images (e.g. to detect whether drug worked i.e. caused desired change in the cell or diagnose skin cancer). Here you can find some examples:

Method 1: Accurate Classification of Protein Subcellular Localization from High-Throughput Microscopy Images Using Deep Learning Mikhail Papkov
Method 2: Automated analysis of high-content microscopy data with deep learning Muhammad Uzair

Mendelian Randomization

This method is used for finding causal associations. Let's imagine an experiment when we are interested in whether some specific biomarker (e.g. some inflammation marker in the blood) is associated with a disease (let's say Alzheimer's). We do our experiment and observe this relationship (increased level of this biomarker is associated with Alzheimer's disease). Okay, there is an association but that doesn't necessarily mean that there is a causal relationship. All of us have seen correlations between some strange stuff, e.g. ice cream consumption correlates with drowning. This example is most likely confounded by some other factor - like a good weather. Good weather correlates well with both - consumption of ice cream and risk of drowning (because people swim more with good weather). So, how to find whether our observed association is causal or not? There are many GWAS studies made (check colocalisation and genotype imputation topic) that report associations between SNP-s and all sorts of traits. Now, if we find a study that showed association with our biomarker of interest, then we have two correlations. One with biomarker and Alzheimer's the other with SNP-s and Alzheimer's. Now we can check whether there is an association between those SNP-s and Alzheimer's, and if there is then there might be a causal relationship. Of course, it doesn't rule out that there isn't a confounder, it just shows that even despite of this, it still has a real effect as well. This method is called Mendelian Randomization and it makes several assumptions, firstly our selected SNP-s shouldn't affect disease in any other way than through our biomarker and the relationship between biomarker and SNP should be unidirectional (which it is because we all get our gene variants pretty randomly). Here you can find articles that discuss given method:

Article 1: Fulfilling the promise of Mendelian randomization
Article 2: Mendelian randomization: a premature burial?

PS: If you need some extra biological information to better understand your article, then there is a good chance that you will find it from here: https://www.nature.com/scitable/topic/gene-expression-and-regulation-15

Bioinformaatika seminar 2017/18 kevad