intro to text mining: bag of words

Report 4 Downloads 98 Views
INTRO TO TEXT MINING: BAG OF WORDS

Simple word clustering

Intro to Text Mining: Bag of Words

Hierarchical clustering example > dist_rain # Reclassify distances as hierarchical cluster object > hc # Plot dendrogram with city labels > plot(hc, labels = rain$city)

Intro to Text Mining: Bag of Words

Dendrogram aesthetics > # Load dendextend package > library(dendextend) > # Convert distance matrix to dendrogram > hc hcd # Color branches > hcd # Plot dendrogram with some aesthetics > plot(hcd, main = "Better Dendrogram") > rect.dendrogram(hcd, k = 2, border = "grey50")

INTRO TO TEXT MINING: BAG OF WORDS

Let’s practice!

INTRO TO TEXT MINING: BAG OF WORDS

Ge!ing past single words

Intro to Text Mining: Bag of Words

Unigrams, bigrams, trigrams, oh my! > # Use only first 2 coffee tweets > tweets$text[1:2] [1] @ayyytylerb that is so true drink lots of coffee [2] RT @bryzy_brib: Senior March tmw morning at 7:25 A.M. in the SENIOR lot. Get up early, make yo coffee/breakfast, cus this will only happen… > # Make a unigram DTM on first 2 coffee tweets > unigram_dtm unigram_dtm Non-/sparse entries: 18/18 Sparsity : 50% Maximal term length: 15 Weighting : term frequency (tf)

Intro to Text Mining: Bag of Words

Unigrams, bigrams, trigrams, oh my! > # Load RWeka package > library(RWeka) > # Define bigram tokenizer > tokenizer # Make a bigram TDM > bigram_tdm bigram_tdm Non-/sparse entries: 21/21 Sparsity : 50% Maximal term length: 19 Weighting : term frequency (tf)

INTRO TO TEXT MINING: BAG OF WORDS

Let’s practice!

INTRO TO TEXT MINING: BAG OF WORDS

Different frequency criteria

Intro to Text Mining: Bag of Words

Term weights ●

Default term frequency = simple word count



Frequent words can mask insights Term frequency-inverse document frequency



Adjust term weighting via TfIdf



Words appearing in many documents are penalized

Intro to Text Mining: Bag of Words

Term weights > > > >

# Standard term weighting tf_tdm tf_idf_tdm tf_idf_tdm_m tf_tdm_m # Create mapping to metadata > custom_reader # Create VCorpus including metadata > test_corpus # Clean and view results > text_corpus text_corpus[[1]][1] $content [1] "ayyytylerb true drink lots coffee" > text_corpus[[1]][2] $meta id : 1 author : thejennagibson date : 8/9/2013 2:43 language: en

INTRO TO TEXT MINING: BAG OF WORDS

Let’s practice!