Hajussüsteemide seminar - Kursused - Arvutiteaduse instituut

List of possible topics

1. Parallel Scientific Applications and Concurrent Computing (Eero Vainikko)

Overview of open-source numerical modelling tools for solving large problems in science and engineering with partial differential equations. The overview should focus on possibility of using parallel computing, including cluster computing and GPGPU capabilities. A few of such tools are as follows: OpenFOAM, FEniCS, Deal.II, Elmer FEM, GetFEM++, FreeFem++, CalculiX, MFEM, PyLith.
Parallel computations using Jupiter
- Jupiter environment is a popular way to perform computations, execute code directly via a web page, and create interactive content. The aim is to explore the recent libraries for performing parallel computations directly on the page. A survey of existing solutions with testing and evaluation will be performed. In particular, how to use Jupiter in teaching Parallel Computing. Also, debugging parallel programs written in Jupiter notebook is of huge interest.
Ńumerical methods utilizing mixed-precision arithmetic
- Abdelfattah, A., H. Anzt, E. Boman, E. Carson, T. Cojean, J. Dongarra et al., A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic, Innovative Computing Laboratory University of Tennessee, Tech. Report, 2020.
- Also, in particular:
  - Multiprecision arithmetics in preconditioning techniques in iterative solvers
  - Multiprecision arithmetics for power-saving purposes on different devices
  - Posit Arithmetic vs Interval Arithmetic - Gustafson and Yonemoto, "Beating Floating Point at its Own Game: Posit Arithmetic" http://dx.doi.org/10.14529/jsfi170206 (+ other papers)
Parallel programming environments, languages and programming practices
- Parallel profiling tools and best practices
- Recent hot topics in Distributed Systems development
- etc.
Reversibility in neural networks (Eero Vainikko and Stefan Kuhn)
- Neural networks are massively parallel applications, typically executed on parallel hardware like GPUs. Saving time, memory, and energy with these is very important not only to make the applications feasible, but also for ecological reasons. https://link.springer.com/chapter/10.1007/978-3-031-38100-3_7 suggests using reversibility technics for achieving this. Practical implementations are missing. A prototype implementation of this to get some idea what is really possible would be a nice proejct. Case studies and data we could use for that exist.

2. Topics by Ulrich Norbisrath can be found here

3. Distributed Systems, Network Applications (Artjom Lind)

Covering the topics related to development and applied research of distributed computing and network protocols.

Example topics:
- SUMO Simulator: add support for concurrent execution
  - Simulator of Urban Mobility (SUMO) is an open source, highly portable, microscopic and continuous multi-modal traffic simulation package designed to handle large networks. Current implementation performs all the road network related routines (simulation, calibration etc.) using single thread hence under-utilizing the multi-core CPU. The objective is to achieve better CPU utilization by allowing multiple concurrent threads within one simulation.
- DASK: Scalable analytics in Python
  - Evaluate DASK distributed computing framework in respect to various scientific computing tasks.
Individual topic -> Contact me!

4. Applied Computer Vision (CV) (Artjom Lind)

Mostly the topics related to the application of the latest results in CV. In this area, we mostly use OpenCV library, which is recommended but not obligatory. The several topics we can focus on:

Structure from motion
Object detection/classification
Object tracking
Optical Character Recognition (OCR)
Augmented Reality
Example topics:
- State-full Masking of Dynamic Objects for Visual Simultaneous Localization and Mapping
  - Advancing in the direction of reducing the time complexity of masking the moving objects Visual SLAM input. It was proven in previous research MaskRCNN is accurate but can hardly achieve 10FPS. Objective is to employ state estimation techniques to track the moving objects and update the masking information faster then actual detection rate.
Individual topic -> Contact me!

5. Parallel Machine learning algorithms (Artjom Lind, Amnir Hadachi)

Optical Character Recognition (OCR) algorithms for Estonian and non-latin scripts such as Arabic / Cyrillic / Chinese / Farsi / Hebrew / Hindi / Japanese / Korean
Road type recognition and detection
Object detection and recognition

6. Modelling and analyzing semantic trajectories (Amnir Hadachi)

Trajectory filtering
Map-matching
Movement episode detection
Conceptual modelling
Semantic modelling

7. Mobility data modelling and representation (Amnir Hadachi)

Trajectories and their representation
Trajectory collection and reconstruction
Uncertainty in mobility data
Data mining and human mobility behaviour
Visual analytics of mobility

8. Topics in Green Learning with the focus on multi-object tracking (Amnir Hadachi)

9. GPGPU: OpenCL for General-Purpose GPU Computing (Mohammad Anagreh)

iDash computation protocol for genomic data
Scientific computing: Parallel prefix sum, prefix minimum, and various other operations [14]

10. Autonomous Driving (Naveed Muhammad)

GNSS-free localisation for autonomous vehicles
Air-flow sensing applications in autonomous driving

11. A Reversible Debugger for MPI Applications (Stefan Kuhn, Eero Vainikko)

Reversible debuggers allow a user to step not only forwards, but also backwards in the program execution. This can be very helpful for debugging, but is not easy to implement. In distributed system, it gets even more difficult. We have started to implement a reversible debugger for MPI. Further work on this is needed to make is useable in practice.

12. Chemoinformatics (Stefan Kuhn)

This is an overview of projects in the area of nuclear magnetic resonance (NMR) and chemoinformatics in general, partly using optimization, machine learning. They can be tackled without a specific background in chemoninformatics. They can be done on their own or combined.

Survey of mixture analysis methods

There have been many publications on mixture analysis recently. A few examples are [1,4,5,8,13]. Most or all of those papers contain a) some new method and b) a dataset the method as applied to. It seems uncommon to use the same dataset, which would be good for comparison. In addition, there are application papers just providing a dataset. It would be valuable to see how those methods are doing on a or several datasets in comparison. This would need to identify relevant methods and finding good data. In addition to the results of the application, an overview of methods could be a side-product of this. It has potential for a nice review paper. There are potential obstacles, including availability of data and methods needing different experiments. Those should be documented, which can be valuable data in itself.

Conventional methods and optimization

[1] uses clustering to do some separation of compounds in a mixture. Using the review, can we develop this further using non-AI technologies? Any other techniques which could be used should be tested, of course taking inspiration from the literature. Also, this could be considered an optimization problem: Given a ranked list of candidates and their assigned spectra, what is the minimum set of compounds which cover the maximum of peaks in the spectra and are high up in the ranked candidate list? This is a really huge search space, but it should at least be tried.

Image segmentation in spectra

In [7] we showed that deep learning is able to identify if spectra come from a compound with a certain substructure. This was image classification. This could be extended into an image segmentation task by trying to identify the peaks from the substructure. This would be restricted to pures substances as a first step, but should be tried in mixtures as well.

Identify spectra of compounds using substructures

Assuming we have managed to identify substructures (topic 3), can we use this information to identify compounds? It seems reasonable to assume that if some information is there, it should reduce the problem. From the optimization point of view (topic 2), it should reduce the search space.

CASE tools (computer-aided structure elucidation)

These are tools to find the matching structure for a set of spectra. This is normally done for single compounds and is a type of optimization problem. [2] could be starting point, but there is a lot of work here. Perhaps this could be scaled up to a low number of compounds, like what is the best structural fit for two or three compounds? Also, perhaps there are some new methods in the area of optimization not yet applied here. This could be a nice first step to apply some new optimization method in a case framework.

NMR and MS

Another thing which I think deservers attention is the interaction of NMR and MS. Molecular networking is an established technique in MS [6] - of course results may go into NMR analysis later for example as a candidate list, but that is an indirect link. So could this become more closely linked? Somehow consider NMR whilst exploring the network or so? Or do something similar for NMR?

Using raw data

[3] uses the raw data to identify functional groups. This is opposed to [7] which uses spectral images. Would the substructures work from the raw data? What about mixtures? Or use this inc combination with MS? Have a neural network process MS and NMR at the same time?

Learning structure elucidation

Then there is the AI side. Roughly speaking, a spectroscopist will be able to make some suggestions if looking at a spectrum - not 100% accurate and not always, but a set of nmr spectra is more than blobs. Now it should in theory be possible to reproduce that knowledge in AI. The most common way to do this is supervised learning. In [7], we did some attempt at this. This should work to learn more. The issues are mainly around a) how do we get data to train? b) what is it in those data we can learn (say we did substructures, but perhaps we could learn functional groups or some other property, like does the molecule contain nitrogens?) c) what is the right input type for that? d) what sort of system to use (a neural network? which one? or a support vector machine? or....) e) how to optimize this?

Siamese networks

Siamese networks are used in SMART [9] and in [10] to compare spectra. Would that work with our fragments? Train a Siamese network to find substructures. Or functional groups. Also would image segmentation work with Siamese networks?

Literature (for all proposals)

[1] Kuhn, S, Colreavy-Donnelly, S, Santana de Souza, J, Borges, RM (2019). An integrated approach for mixture analysis using MS and NMR techniques. Faraday Discuss, 218:339-353.

[2] Jayaseelan, K.V., Steinbeck, C. Building blocks for automated elucidation of metabolites: natural product-likeness for candidate ranking. BMC Bioinformatics 15, 234 (2014).

[3] Li C, Cong Y, Deng W. Identifying molecular functional groups of organic compounds by deep learning of NMR data. Magn Reson Chem. 2022 Jun 8. doi: 10.1002/mrc.5292. Epub ahead of print.

[4] Bin Yuan, Zhiming Zhou, Bin Jiang, Ghulam Mustafa Kamal, Xu Zhang, Conggang Li, Xin Zhou, and Maili Liu: NMR for Mixture Analysis: Concentration-Ordered Spectroscopy, Analytical Chemistry 2021 93 (28), 9697-9703.

[5] Jeannerat Damien and Furrer Julien, NMR Experiments for the Analysis of Mixtures: Beyond 1D 1H Spectra, Combinatorial Chemistry & High Throughput Screening 2012; 15(1).

[6] Leao TF, Clark CM, Bauermeister A, Elijah EO, Gentry EC, Husband M, Oliveira MF, Bandeira N, Wang M, Dorrestein PC. Quick-start infrastructure for untargeted metabolomics analysis in GNPS. Nat Metab. 2021 Jul;3(7):880-882.

[7] Kuhn, S., Tumer, E., Colreavy-Donnelly, S., Moreira Borges, R., A pilot study for fragment identification using 2D NMR and deep learning, Magn Reson Chem 2021, 1.

[8] A. Bakiri, B. Plainchont, V. de Paulo Emerenciano, R. Reynaud, J. Hubert, J.-H. Renault, J.-M. Nuzillard, Computer-aided Dereplication and Structure Elucidation of Natural Products at the University of Reims, Mol. Inf. 2017, 36, 1700027.

[9] Zhang, C.; Idelbayev, Y.; Roberts, N.; Tao, Y.; Nannapaneni, Y.; Duggan, B.M.; Min, J.; Lin, E.C.; Gerwick, E.C.; Cottrell, G.W.; et al. Small Molecule Accurate Recognition Technology (SMART) to Enhance Natural Products Research. Sci. Rep. 2017, 7, 14243

[10] Wei, W.; Liao, Y.; Wang, Y.; Wang, S.; Du, W.; Lu, H.; Kong, B.; Yang, H.; Zhang, Z. Deep Learning-Based Method for Compound Identification in NMR Spectra of Mixtures. Molecules 2022, 27, 3653. https://doi.org/10.3390/molecules27123653

[11] S. Kuhn, R. M. Borges, F. Venturini and M. Sansotera, "Dataset Size and Machine Learning - Open NMR Databases as a Case Study," 2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC), 2022, pp. 1632-1636, doi: 10.1109/COMPSAC54236.2022.00259.

[12] Markus Fischer, Benedikt Schwarze, Nikola Ristic, Holger A. Scheidt, Predicting 2H NMR acyl chain order parameters with graph neural networks, Computational Biology and Chemistry, Volume 100, 2022, 107750, ISSN 1476-9271, https://doi.org/10.1016/j.compbiolchem.2022.107750.

[13] Database for Rapid Dereplication of Known Natural Products Using Data from MS and Fast NMR Experiments,Carlos L. Zani and Anthony R. Carroll, Journal of Natural Products 2017 80 (6), 1758-1766, DOI: 10.1021/acs.jnatprod.6b01093

[14] Anagreh, M., Laud, P. and Vainikko, E.: 2021. Parallel Privacy-Preserving Shortest Path Algorithms. Cryptography, 5(4), p.27, https://doi.org/10.3390/cryptography5040027

Hajussüsteemide seminar 2024/25 sügis