string - R: producing a list of near matches with stringdist and stringdistmatrix -

- July 15, 2013

i discovered excellent package "stringdist" , want use compute string distances. in particular have set of words, , want print out near-matches, "near match" through algorithm levenshtein distance.

i have extremely slow working code in shell script, , able load in stringdist , produce matrix metrics. want boil down matrix smaller matrix has near matches, e.g. metric non-zero less threshold.

kp <-  c('leaflet','leafletr','lego','levenshtein-distance','logo') kpm <- stringdistmatrix(kp,usenames="strings",method="lv") > kpm                      leaflet leafletr lego levenshtein-distance leafletr                   1                                    lego                       5        6                           levenshtein-distance      16       16   18                      logo                       6        7    1                   19 m = as.matrix(kpm) close = apply(m, 1, function(x) x>0 & x<5) >  close                      leaflet leafletr  lego levenshtein-distance  logo  leaflet                false     true false                false false  leafletr                true    false false                false false  lego                   false    false false                false  true  levenshtein-distance   false    false false                false false  logo                   false    false  true                false false

ok, have (big) dist, how reduce list output like

leafletr,leaflet,1 logo,lego,1

for cases metric non-zero , less n=5? found "apply()" lets me test, need sort out how use it.

the problem not specific stringdist , stringdistmatrix , elementary r, still i'm stuck. suspect answer involves subset(), don't know how transform "dist" else.

you can this:

library(reshape2) d <- unique(melt(m)) out <- subset(d, value > 0 & value < 5)

here, melt brings m long form (2 columns string names , 1 column value). however, since we've melted symmetric matrix, use unique de-duplication.

another way use dplyr (since cool kids using dplyr pipes now):

library(dlpyr) library(reshape2) library(magrittr)  out <- melt(m) %>% distinct() %>% filter(value > 0 & value < 5)

this second option faster have not timed it.

Search This Blog

Chrom

string - R: producing a list of near matches with stringdist and stringdistmatrix -

Comments

Post a Comment

Popular posts from this blog

qt - Using float or double for own QML classes -

json - ORA-06502: PL/SQL: numeric or value error: character string buffer too small - Convert Clob to varchar2 -

ios - Swift Array Resetting Itself -