r - What's the first element in my trigrams? -
using trigram-tokenizer rweka
class
> trigramtokenizer <- function(x){ngramtokenizer(x, weka_control(min=3, max=3))}
i tokenized corpus. inspection shows trigrams this:
> inspect(tdm_trigram[1:10, 1:3]) term-document matrix (10 terms, 3 documents) non-/sparse entries: 10/20 sparsity : 67% maximal term length: 17 weighting : term frequency (tf) docs terms en_us.blogs.capped.txt en_us.news.capped.txt \u0097 age believe 0 1 \u0095 tradeable 0 1 \u0093 amazing feat\u0094 0 1 \u0097 appear poised 0 1 \u0096 areas muslim 0 1
what's \u0097
? preprocessed corpus usual methods tm
library (stripwhitespace, remove punctuation , on).
should perhaps readin using different encoding?
these unicode control characters have interpreted words.
in older versions of unicode
- u+0097 end of guarded area
- u+0095 message waiting
- u+0093 set transmit state
- u+0096 start of guarded area
you may want strip them out before trigrams
Comments
Post a Comment