Hierarchical clustering example > dist_rain # Reclassify distances as hierarchical cluster object > hc # Plot dendrogram with city labels > plot(hc, labels = rain$city)
Intro to Text Mining: Bag of Words
Dendrogram aesthetics > # Load dendextend package > library(dendextend) > # Convert distance matrix to dendrogram > hc hcd # Color branches > hcd # Plot dendrogram with some aesthetics > plot(hcd, main = "Better Dendrogram") > rect.dendrogram(hcd, k = 2, border = "grey50")
INTRO TO TEXT MINING: BAG OF WORDS
Let’s practice!
INTRO TO TEXT MINING: BAG OF WORDS
Ge!ing past single words
Intro to Text Mining: Bag of Words
Unigrams, bigrams, trigrams, oh my! > # Use only first 2 coffee tweets > tweets$text[1:2] [1] @ayyytylerb that is so true drink lots of coffee [2] RT @bryzy_brib: Senior March tmw morning at 7:25 A.M. in the SENIOR lot. Get up early, make yo coffee/breakfast, cus this will only happen… > # Make a unigram DTM on first 2 coffee tweets > unigram_dtm unigram_dtm Non-/sparse entries: 18/18 Sparsity : 50% Maximal term length: 15 Weighting : term frequency (tf)
Intro to Text Mining: Bag of Words
Unigrams, bigrams, trigrams, oh my! > # Load RWeka package > library(RWeka) > # Define bigram tokenizer > tokenizer # Make a bigram TDM > bigram_tdm bigram_tdm Non-/sparse entries: 21/21 Sparsity : 50% Maximal term length: 19 Weighting : term frequency (tf)
INTRO TO TEXT MINING: BAG OF WORDS
Let’s practice!
INTRO TO TEXT MINING: BAG OF WORDS
Different frequency criteria
Intro to Text Mining: Bag of Words
Term weights ●
Default term frequency = simple word count
●
Frequent words can mask insights Term frequency-inverse document frequency
●
Adjust term weighting via TfIdf
●
Words appearing in many documents are penalized
Intro to Text Mining: Bag of Words
Term weights > > > >
# Standard term weighting tf_tdm tf_idf_tdm tf_idf_tdm_m tf_tdm_m # Create mapping to metadata > custom_reader # Create VCorpus including metadata > test_corpus # Clean and view results > text_corpus text_corpus[[1]][1] $content [1] "ayyytylerb true drink lots coffee" > text_corpus[[1]][2] $meta id : 1 author : thejennagibson date : 8/9/2013 2:43 language: en