Target

Preparing Textual Data for Analysis.

Step 1: File Preprocessing

Transfer file format from HTML, Word, PDF, XML, TXT, EXCEL and etc. to format TXT, JSON or XML which can be easily import by program.

Step 2: From Text to Words

1. Tokenize

Break a stream of characters into tokens by identifying token delimiters:

Whitespace characters such as space, tab, newline
Punctuation characters like ( ) < > ! ? “ ”
Othercharacters .,:‐‘’etc.

Word Tokenize

2. Lemmatization / Stemming

Inflectional stemming: Grammatical variants of the same word (no change of POS)
- Plural form of nouns
  classes -> class
  stories -> story
- Verbs in different tense and aspect
  likes, liked, liking -> like
Derivational Stemming: Forming new words from another word or stem by adding prefixes and/or suffixes (with change of POS, dangerous)
- Can change the syntactic category of a root
  production -> produce
- Cause a change of meaning
  unhappy -> happy
Other normalisation
- Case normalisation: lowercase

3. Stopword Removal

Some words are extremely common. They appear in almost all documents and carry little meaning. They are of limited use in text analytics applications.

Functional words (conjunctions, prepositions, determiners, or pronouns) like the, of, to, and, it, etc.
A stopword list can be constructed to exclude them from analysis.
Depending on the domain, other words may need to be included in the stopword list.

Stop Words

Step 3: Indexing

Many text mining applications are based on vector representation of documents (term‐document matrix) using “bag‐of‐words” approach. Usually only content words (adjectives, adverbs, nouns, and verbs) are used as vector features.
Indexing

- Two Ways of Term Weighting

Binary
0 or 1, simply indicating whether a word has occurred in the document (but that’s not very helpful).
Frequency‐based*
Term frequency, the frequency of words in the document, which provides additional information that can be used to contrast with other documents.

- Frequent Word List

With frequency‐based DTM(document-term matrix), a list of words and their frequencies in the corpus can be generated.
Sorted by frequency, can give us a rough idea of what the corpus is about.
Word Cloud is a nice visualization tool.

- Other Weighting Methods

Normalized frequency
To deal with varied document length, since a long document definitely has more occurrences of terms than a short document.
normalized_frequency = frequency of a term in a document / total number of terms in the document
tf-idf*
To modify the frequency of a word in a document by the perceived importance of the word(the inverse document frequency), widely used in information retrieval
- When a word appears in many documents, it’s considered unimportant.
- When the word is relatively unique and appears in few documents, it’s important.

- tf‐idf Indexing

TF-IDF 1

Example:

TF-IDF 2
Note that in this example, stopwords and very common words are not removed, and terms are not reduced to root terms.

Get tf table
Calculate IDF
log(D/Df) is the log of total number of docs / the number of docs contain the term.
For term sliver:
- we have three docs, so D is 3
- the number of doc contain the term sliver is one, so Df is 1
- so D/Df = 3/1 = 3
- so IDF = log(D/Df) = log(3) = 0.4771
- use TF * IDF, we get the tf-idf row value: 00.4771,20.4771,0*0.4771 is 0, 0.9542, 0
- loop this step for all terms