r - What's the first element in my trigrams? -

- January 15, 2011

using trigram-tokenizer rweka class

> trigramtokenizer <- function(x){ngramtokenizer(x, weka_control(min=3, max=3))}

i tokenized corpus. inspection shows trigrams this:

> inspect(tdm_trigram[1:10, 1:3]) term-document matrix (10 terms, 3 documents)  non-/sparse entries: 10/20 sparsity           : 67% maximal term length: 17  weighting          : term frequency (tf)                             docs terms                       en_us.blogs.capped.txt en_us.news.capped.txt   \u0097 age believe                             0                     1   \u0095 tradeable                           0                     1   \u0093 amazing feat\u0094                      0                     1   \u0097 appear poised                           0                     1   \u0096 areas muslim                            0                     1

what's \u0097 ? preprocessed corpus usual methods tm library (stripwhitespace, remove punctuation , on).

should perhaps readin using different encoding?

these unicode control characters have interpreted words.

in older versions of unicode

u+0097 end of guarded area
u+0095 message waiting
u+0093 set transmit state
u+0096 start of guarded area

you may want strip them out before trigrams

Search This Blog

Chrom

r - What's the first element in my trigrams? -

Comments

Post a Comment

Popular posts from this blog

qt - Using float or double for own QML classes -

json - ORA-06502: PL/SQL: numeric or value error: character string buffer too small - Convert Clob to varchar2 -

ios - Swift Array Resetting Itself -