Using the provided data and the TextGarden software, perform the following operations:
Prepare a presentation of the results in a 5–10 page report and a 10–15 minute presentation (all in English).
Deadline: send the reports and slides to janez.brank@ijs.si by 22 January 2018. The presentations will take place on 30 January 2018.
These documents are taken from the Reuters RCV1 corpus; more information about it.
> Txt2Bow.exe -inlndoc:news.txt -o:news.bow -stopword:none -stemmer:none -ngramlen:1 > BowKMeans.exe -i:news.bow -clusts:5 > BowTrainBinSVM.exe -i:news.bow -o:news.bowmd -cat:GSPO > BowClassify.exe -ibow:news.bow -imd:news.bowmd -qs:"olympic games" > BowClassify.exe -ibow:news.bow -imd:news.bowmd -qh:article1.txt
(potentially useful for the exam)
Many binary classifiers, including SVM, don't just output a binary prediction (positive/negative). Instead, the prediction they initially compute is a real number and they obtain a binary prediction by comparing that real number to a threshold (for SVM, the default threshold is 0).
In this example, let's say we have five documents with the following class membership and real-valued predictions:
| Document | True class label | Prediction |
|---|---|---|
| d1 | positive | +1.5 |
| d2 | negative | +1.1 |
| d3 | positive | +0.3 |
| d4 | negative | −0.7 |
| d5 | negative | −1.2 |
Note that we sorted the documents in decreasing order of predictions — that's a good first step when preparing the precision/recall curve.
To obtain binary predictions, we'd have to choose a threshold and then predict all documents above that threshold as positive and all documents below that threshold as negative; then we'd be able to compute TP, FP, TN, FN and from these also precision and recall.
Note that there are no documents whose predictions fall between +0.3 and +1.1. Thus, if we choose any threshold in the range (0.3, 1.1), the outcome will be exactly the same: the first two documents will be predicted positive, the remaining three will be predicted negative. (In this case we'd have TP = 1, FP = 1, TN = 2, FN = 3, and thus precision = 1/2 and recall = 1/2.)
Similarly, if we choose any threshold in the range (1.1, 1.5), the outcome is the same regardless of the exact value of the threshold: the first document is predicted positive, the other four documents are predicted negative. (This would give us TP = 1, FP = 0, TN = 3, FN = 1, and thus precision = 1 and recall = 1/2.)
By this line of thinking, we need just one threshold from each of the following ranges: (−∞ , −1.2), (−1.2, −0.7), (−0.7, 0.3), (0.3, 1.1), (1.1, 1.5), (1.5, ∞). For each of these thresholds, we should compute the resulting precision and recall, and plot it on a graph (recall on the x-axis, precision on the y-axis); connecting these points gives us the precision/recall curve.
We can compute all this very efficiently in the following manner. Let's start with a very high value of the threshold — higher than all the predictions on our documents (in our example, that means higher than 1.5). This means that all documents are predicted negative; we have TP = 0, FP = 0, TN = number_of_negative_documents and FN = number_of_positive_documents. (This gives us recall = 0 and precision = 0/0, which we conventionally define to be = 1 for the sake of continuity.) In our example this would mean TN = 3 and FN = 2.
Now if we gradually decrease the threshold, what happens when the threshold drops below 1.5? The first document, d1, is now predicted positive rather than negative. This affects the contingency table in the following way: since d1 is positive, it used to count towards FN when it was predicted as negative; but now that it is predicted as positive, it counts towards TP. So we have to decrease FN by 1 and increase TP by 1; we can then calculate new precision and recall. (In our example we now have TP = 1, FP = 0, TN = 3, FN = 1.)
If we decrease the threshold still further, it eventually drops below 1.1 and now d2 also begins to be predicted as positive instead of as negative. How does this affect the contingency table? Note that in our example, d2 is actually negative. Thus, while it was still being predicted as negative, it counted towards TN; but now that it is predicted as positive, it counts towards FP. So we have to decrease TN by 1 and increase FP by one. (In our example we now have TP = 1, FP = 1, TN = 2, FN = 1.)
We can continue in this way until the threshold drops below the predictions for all the documents. At that point we'd have TP = number_of_positive_documents, FP = number_of_negative_documents, TN = 0 and FN = 0.
Now that we have the contingency table for every possible value of the threshold, we just have to compute the precision and recall for each of them and plot them on a chart. You can do all this very easily in Excel, or write a program/script in any programming language, etc.
The breakeven point is the point on the precision/recall curve at which precision and recall are equal. If there are several such points, we use the one that is closest to the origin of the coordinate space.