Text/Multimedia Mining and Semantic Technologies, ICT3

Data Mining and Knowledge Discovery — Text, Web, and Multimedia Mining

Using the TextGarden tools

Using the provided data and the TextGarden software, perform the following operations:

Generate bag-of-words file using the Txt2Bow utility.
Perform k-means clustering for two different values of k using the BowKMeans utility and analyse the results.
Perform classification using the BowTrainBinSVM and BowClassify utilities for two frequent and two rare categories. If you trained on dataset n, use documents from the dataset n + 1 for classification. Compute precision, recall, F₁, accuracy, plot a precision/recall curve (how? see below for details) and calculate the breakeven point.
- Note: BowClassify crashes if the model was trained on a category that had no negative training examples; this can happen e.g. with GCAT in some of our datasets, so don't train on that category.
- Note: BowClassify can only classify one document at a time, so you'll need to split the file containing your test corpus into individual documents.
Repeat the previous step with cubic kernels (-t:polynomial -ker_p:3) instead of the default linear kernels (-t:linear).
For each of the selected categories, find an article on the internet that is positively classified into it.

Prepare a presentation of the results in a 5–10 page report and a 10–15 minute presentation (all in English).

Deadline: send the reports and slides to janez.brank@ijs.si by 22 January 2018. The presentations will take place on 30 January 2018.

TextGarden software [more documentation]
Datasets:
- 1 Darko Dujič, David Gojo
- 2 Adem Kikaj
- 3 Ilin Tolovski
- 4 Jasmin Bogatinovski
- 5 Ana Kostovska
- 6
- 7
These documents are taken from the Reuters RCV1 corpus; more information about it.
Category names.

Examples of TextGarden usage:

> Txt2Bow.exe -inlndoc:news.txt -o:news.bow -stopword:none -stemmer:none -ngramlen:1
> BowKMeans.exe -i:news.bow -clusts:5
> BowTrainBinSVM.exe -i:news.bow -o:news.bowmd -cat:GSPO
> BowClassify.exe -ibow:news.bow -imd:news.bowmd -qs:"olympic games"
> BowClassify.exe -ibow:news.bow -imd:news.bowmd -qh:article1.txt

Enrycher example:

Note: you don't have to use Java for this part of the assignment. All you need is to generate suitable HTTP POST requests to enrycher.ijs.si and process the XML data from the responses (you need to generate the same requests as would be generated e.g. by the web form at http://enrycher.ijs.si/, so see that page for details). The Enrycher Java API helps you with that, but you can also do it by yourself in any other programming language you prefer.

public static void main(String[] args) {
  // input document
  String docString = "Tiger Woods emerged from a traffic jam of his " +
    "own making to thrill thousands of fans with a six-under 66 at " +
    "the $1.4 million Australian Masters on Thursday.";
  // URL of Enrycher web service
  URL pipelineUrl = new URL("http://enrycher.ijs.si/run");
  // convert input document to input stream
  InputStream docStream = new ByteArrayInputStream(docString.getBytes());
  // call Enrycher web service
  Document doc = EnrycherWebExecuter.processSync(pipelineUrl, docStream);
  // iterate over all the annotations
  for (Annotation ann : doc.getAnnotations()) {
    // list all persons
    if (ann.isPerson()) {
      System.out.println("Person: " + ann.getDisplayName());
      // get sentences in which it occurs
      for (Instance inst : ann.getInstances()) {
        int sentenceId = inst.getSentenceId(0);
        Paragraph paragraph = doc.getParagraph(sentenceId);
        Sentence sentence = paragraph.getSentence(sentenceId);
        System.out.println(inst.getDisplayName() + ": " + sentence.getPlainText());
      }

      // list all attributes
      for (Attribute attr : ann.getAttributes()) {
        if (attr.isLiterl()) {
          System.out.println(attr.getType() + " : " + attr.getLiteral());
        } else if (attr.isResource()){
          System.out.println(attr.getType() + " : " + attr.getResource());
        }
      }
    }
  }
}

> javac si/ijs/enrycher/Example.java
> java si.ijs.enrycher.Example

Example of computing the TF-IDF vectors and cosine similarity:

(potentially useful for the exam)

Excel worksheet showing how to compute TF-IDF vectors of two documents and their cosine similarity.

Example of how to plot a precision/recall curve:

Many binary classifiers, including SVM, don't just output a binary prediction (positive/negative). Instead, the prediction they initially compute is a real number and they obtain a binary prediction by comparing that real number to a threshold (for SVM, the default threshold is 0).

In this example, let's say we have five documents with the following class membership and real-valued predictions:

Document	True class label	Prediction
d₁	positive	+1.5
d₂	negative	+1.1
d₃	positive	+0.3
d₄	negative	−0.7
d₅	negative	−1.2

Note that we sorted the documents in decreasing order of predictions — that's a good first step when preparing the precision/recall curve.

To obtain binary predictions, we'd have to choose a threshold and then predict all documents above that threshold as positive and all documents below that threshold as negative; then we'd be able to compute TP, FP, TN, FN and from these also precision and recall.

Note that there are no documents whose predictions fall between +0.3 and +1.1. Thus, if we choose any threshold in the range (0.3, 1.1), the outcome will be exactly the same: the first two documents will be predicted positive, the remaining three will be predicted negative. (In this case we'd have TP = 1, FP = 1, TN = 2, FN = 3, and thus precision = 1/2 and recall = 1/2.)

Similarly, if we choose any threshold in the range (1.1, 1.5), the outcome is the same regardless of the exact value of the threshold: the first document is predicted positive, the other four documents are predicted negative. (This would give us TP = 1, FP = 0, TN = 3, FN = 1, and thus precision = 1 and recall = 1/2.)

By this line of thinking, we need just one threshold from each of the following ranges: (−∞ , −1.2), (−1.2, −0.7), (−0.7, 0.3), (0.3, 1.1), (1.1, 1.5), (1.5, ∞). For each of these thresholds, we should compute the resulting precision and recall, and plot it on a graph (recall on the x-axis, precision on the y-axis); connecting these points gives us the precision/recall curve.

We can compute all this very efficiently in the following manner. Let's start with a very high value of the threshold — higher than all the predictions on our documents (in our example, that means higher than 1.5). This means that all documents are predicted negative; we have TP = 0, FP = 0, TN = number_of_negative_documents and FN = number_of_positive_documents. (This gives us recall = 0 and precision = 0/0, which we conventionally define to be = 1 for the sake of continuity.) In our example this would mean TN = 3 and FN = 2.

Now if we gradually decrease the threshold, what happens when the threshold drops below 1.5? The first document, d₁, is now predicted positive rather than negative. This affects the contingency table in the following way: since d₁ is positive, it used to count towards FN when it was predicted as negative; but now that it is predicted as positive, it counts towards TP. So we have to decrease FN by 1 and increase TP by 1; we can then calculate new precision and recall. (In our example we now have TP = 1, FP = 0, TN = 3, FN = 1.)

If we decrease the threshold still further, it eventually drops below 1.1 and now d₂ also begins to be predicted as positive instead of as negative. How does this affect the contingency table? Note that in our example, d₂ is actually negative. Thus, while it was still being predicted as negative, it counted towards TN; but now that it is predicted as positive, it counts towards FP. So we have to decrease TN by 1 and increase FP by one. (In our example we now have TP = 1, FP = 1, TN = 2, FN = 1.)

We can continue in this way until the threshold drops below the predictions for all the documents. At that point we'd have TP = number_of_positive_documents, FP = number_of_negative_documents, TN = 0 and FN = 0.

Now that we have the contingency table for every possible value of the threshold, we just have to compute the precision and recall for each of them and plot them on a chart. You can do all this very easily in Excel, or write a program/script in any programming language, etc.

The breakeven point is the point on the precision/recall curve at which precision and recall are equal. If there are several such points, we use the one that is closest to the origin of the coordinate space.