Python code
1. Brief
Categorization system automatically classify the category of the text.
2. Common Usage
- Assigning subject categories to documents
- Email spam detection
As gmail will block a lot of junk mail for us.
- Medical diagnosis
- Identifying a language (before further processing)
- Etc.
3. How does automatic text categorization work ?
Phase one - Training – creating the text “classifier” (automatic categorization engine)
- You need a set of documents, already categorized
- Divide the set into training (typically 70%) and testing (30%)
- Build your classifier such that it’s able to accurately classify the training
- set of documents to your level of comfort
- “level of comfort” depends on how hard is the task!
- Evaluate your classifier on the test set ensure sufficient accuracy
Phase two - Running – using your classifier on new sets of documents
- You will not know how well it performs
- Need to “audit” the results occasionally (use an assessor)
- Assess random sample of the documents against the predicted categories
4. Classifiers
Hand-coded classifiers (the “good old days!”)
then else NOT
Probabilistic Classifiers
Naïve Bayes
Decision Tree Classifiers
Decide if a name is male or female.
The Rocchio Classifiers
Support Vector Machines (SVMs)
Before deep learning conquer the world, SVM is the king.
5. Evaluation
Confusion Matrix.
Look at matrixs from two classifiers as below, which classifier is better ?
Answer is: Depends, different businesses will have different opinions.
6. Running the classifier
- Avoid overfitting
- Hard and Soft Categorization
Soft Categorization as below:
Rank Category Probability 1 Z 0.76 2 X 0.72 3 Y 0.54
7. Unsupervised text categorization
Input variables without label. No categories are given to the machine, machine figure it out.
Such as clustering below.
8. Document clustering
Samples on youtube:
9. Topic modeling
Sample on youtube: classify wikipidia topic from the article content, link as below.