Python code
https://github.com/shenshutao/Machine-Learning/tree/master/Text%20Mining/Text%20Classification
1. Brief
Categorization system automatically classify the category of the text.
2. Common Usage
- Assigning subject categories to documents
- Email spam detection
As gmail will block a lot of junk mail for us.
- Medical diagnosis
- Identifying a language (before further processing)
- Etc.
3. How does automatic text categorization work ?
Phase one - Training – creating the text “classifier” (automatic categorization engine)
- You need a set of documents, already categorized
- Divide the set into training (typically 70%) and testing (30%)
- Build your classifier such that it’s able to accurately classify the training
- set of documents to your level of comfort
- “level of comfort” depends on how hard is the task!
- Evaluate your classifier on the test set ensure sufficient accuracy
Phase two - Running – using your classifier on new sets of documents
- You will not know how well it performs
- Need to “audit” the results occasionally (use an assessor)
- Assess random sample of the documents against the predicted categories
4. Classifiers
Hand-coded classifiers (the “good old days!”)
If
then else NOT
Probabilistic Classifiers
Naïve Bayes
Decision Tree Classifiers
Decide if a name is male or female.
![]()
From: http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html
The Rocchio Classifiers
Support Vector Machines (SVMs)
Before deep learning conquer the world, SVM is the king.
5. Evaluation
Confusion Matrix.
![]()
Look at matrixs from two classifiers as below, which classifier is better ?
![]()
Answer is: Depends, different businesses will have different opinions.
6. Running the classifier
- Avoid overfitting
- Hard and Soft Categorization
Soft Categorization as below:
Rank Category Probability 1 Z 0.76 2 X 0.72 3 Y 0.54
7. Unsupervised text categorization
Input variables without label. No categories are given to the machine, machine figure it out.
Such as clustering below.
8. Document clustering
Samples on youtube:
- https://www.youtube.com/watch?v=CHlrx4gsoJI
- https://www.youtube.com/watch?v=Z-4S7kIoHa8
9. Topic modeling
Sample on youtube: classify wikipidia topic from the article content, link as below.
- https://www.youtube.com/watch?v=3mHy4OSyRf0