Project Topics

Here are some suggestions about suitable project topics. The ones ticked have already been choseb by somebody, but most of the topics are wide enough to be shared by two students. However, personal initiative about suggesting additional topics is more than welcome.

  • Definition, approaches and applications of Text Mining
  • Machine learning for automatic translation of words and short sentences.
  • Statistical Alignment and Machine Translation
  • Clustering
  • Information Retrieval
  • Text Categorization
  • Pattern Discovery in bioinformatics or other fields √

  • Definition, approaches and applications of Text Mining
    • Text mining is a diverse discipline covering a wide array of methods and application areas. The idea of the project would be to write an easily readable, reasonably short overview paper giving the less-well-versed reader a nice initial understanding of:
      • What is text mining and how does it relate to other "minings"
      • What are the typical problems that require text mining
      • What are the typical approaches. Describe in brief the main ideas of major algorithms.
      • What are the most popular "industrial" applications. Describe some cases of actual use, show some programs that the reader might actually download and play with, etc.
    • In addition, a nice presentation should be made, making the point clear.
    • Literature:

  • Machine learning for automatic translation of words and short sentences.
    • Given is a set of short words and phrases in language A (for example, a set of labels and messages used in an application). Half of these is translated into language B. Implement an algorithm that would translate the remaining half (potentially in a semi-automatic way).

  • Statistical Alignment and Machine Translation
    • Machine translation, the automatic translation of text or speech from one language to another, is one of the most important applications on Natural Languange Processing. Unfortunately, it still remains a hard problem. Simple word-for-word translation does not give good results, due to the different syntactic structure of languages, the use of idiomatic phrases, some words having multiple meanings, etc. With parallel text having become available, i.e. the same text in different languages (e.g. EU proceedings), several statistical approaches have appeared where different levels of language are aligned (e.g. paragraphs, sentences, phrases, words).
    • The project would involve the following topics:
      • Short description of different (non-statistical) approaches to machine translation
      • What do we mean by 'text alignment' and what are the principal approaches to the task?
      • Noisy channel model in machine translation (and possibly also other approaches)
    • Literature:
      • Chapter 13 of "Foundations of Statistical Natural Language Processing" by Christopher D. Manning and Hinrich Schütze

  • Clustering
    • Clustering algorithms partition a set of objects into groups or clusters. Objects are described and clustered using a set of features and values. The goal is to place similar objects in the same group and to assign dissimilar objects to different groups.
      • Describe various clustering algorithms and some applications
    • Literature:
      • Chapter 14 of "Foundations of Statistical Natural Language Processing" by Christopher D. Manning and Hinrich Schütze
      • Chapter 5, "The Text Mining Handbook: Advanced Approaches in Analysing Unstructured Data" by Ronen Feldman and James Sanger

  • Information Retrieval
    • Information retrieval (IR) research is concerned with developing algorithms and models for retrieving information from document repositories. Strictly speaking it is not part of Text Mining, even though it is often regarded as such. While the strict meaning of text mining is processing texts to discover new, previously unknown information, IR deals with extracting relevant information from a large pool of texts.
      • Describe the models used in IR and illustrate them with examples about actual applications.
    • Literature:
      • Chapter 15 of "Foundations of Statistical Natural Language Processing" by Christopher D. Manning and Hinrich Schütze
      • Chapter 6, "The Text Mining Handbook: Advanced Approaches in Analysing Unstructured Data" by Ronen Feldman and James Sanger

  • Text Categorization
    • Classification or categorization is the task of assigning objects from a universe to two or more predefined classes or categories. The goal of text categorization is to classify the topic or theme of a document. One application of text categorization is to filter a stream of news for a particular interest group. In general, the problem of statistical classification can be characterized as follows: we have a training set of objects, each labeled with one or more classes, which we encode via a data representation model.
      • Describe text classification techniques and some actual application
    • Literature:
      • Chapter 16 of "Foundations of Statistical Natural Language Processing" by Christopher D. Manning and Hinrich Schütze
      • Chapter 4, "The Text Mining Handbook: Advanced Approaches in Analysing Unstructured Data" by Ronen Feldman and James Sanger
Edit: header| contents| footer| sidebar