r - What's the first element in my trigrams? -


using trigram-tokenizer rweka class

> trigramtokenizer <- function(x){ngramtokenizer(x, weka_control(min=3, max=3))} 

i tokenized corpus. inspection shows trigrams this:

> inspect(tdm_trigram[1:10, 1:3]) term-document matrix (10 terms, 3 documents)  non-/sparse entries: 10/20 sparsity           : 67% maximal term length: 17  weighting          : term frequency (tf)                             docs terms                       en_us.blogs.capped.txt en_us.news.capped.txt   \u0097 age believe                             0                     1   \u0095 tradeable                           0                     1   \u0093 amazing feat\u0094                      0                     1   \u0097 appear poised                           0                     1   \u0096 areas muslim                            0                     1 

what's \u0097 ? preprocessed corpus usual methods tm library (stripwhitespace, remove punctuation , on).

should perhaps readin using different encoding?

these unicode control characters have interpreted words.

in older versions of unicode

  • u+0097 end of guarded area
  • u+0095 message waiting
  • u+0093 set transmit state
  • u+0096 start of guarded area

you may want strip them out before trigrams


Comments

Popular posts from this blog

qt - Using float or double for own QML classes -

Create Outlook appointment via C# .Net -

ios - Swift Array Resetting Itself -