Download demo code vis.py used to generate the plots below. NB! The file will have .txt extension, remove it when you have downloaded it.
The code will demonstrate loading hinnavaatlus.ee and teachers.ee JSON-encoded datasets. Additionally, it shows how to turn the documents into bag-of-words and TF-IDF feature spaces, commonly used representations of textual features in various natural language processing tasks. We also use principal component analysis (PCA) to reduce the dimensionality into 2D-coordinate space, suitable for plotting.
Example 1: BOW + PCA (hinnavaatlus)
Bag-of-words (BOW) representation basically converts each document into a vector, where every element denotes the frequency of a single lemma. Note that we can also use word or character n-grams instead. See CountVectorizer documention for more details.
Following example does just that. It converts the documents into BOW representation, applies PCA and plots the transformations. The colors represent positive ( 4, 5), neutral (rating 3) and negative (1, 2) user ratings. The plot is useful for estimating how useful are the extracted features for classification tasks.
Example2: TF-IDF + PCA (hinnavaatlus)
Here we replace BOW representation with TF-IDF. The resulting separation of positive and negative comments seems slightly better, but it is hard to say if it would aid classification.
Example 3: TF-IDF + CHI2 TEST BASED FEATURE SELECTION + PCA (hinnavaatlus)
In this example, we additionally employ chi-squared test to measure independence of features in positive vs nutral vs negative classes. It will help to select features that are most discriminative between positive, negative and neutral documents. We see in the plot, that this approach is much better than two previous. Most positive examples are separated, although it is still hard to distinguish between neutral and negative documents.
Example 4: TF-IDF + PCA (hinnavaatlus vs teachers)
This last example shows that the language used in teachers dataset is clearly different from hinnavaatlus dataset as there is no problem to distinguish between the two. We even did not need to employ any special feature selection.