string - R: producing a list of near matches with stringdist and stringdistmatrix -
i discovered excellent package "stringdist" , want use compute string distances. in particular have set of words, , want print out near-matches, "near match" through algorithm levenshtein distance.
i have extremely slow working code in shell script, , able load in stringdist , produce matrix metrics. want boil down matrix smaller matrix has near matches, e.g. metric non-zero less threshold.
kp <- c('leaflet','leafletr','lego','levenshtein-distance','logo') kpm <- stringdistmatrix(kp,usenames="strings",method="lv") > kpm leaflet leafletr lego levenshtein-distance leafletr 1 lego 5 6 levenshtein-distance 16 16 18 logo 6 7 1 19 m = as.matrix(kpm) close = apply(m, 1, function(x) x>0 & x<5) > close leaflet leafletr lego levenshtein-distance logo leaflet false true false false false leafletr true false false false false lego false false false false true levenshtein-distance false false false false false logo false false true false false
ok, have (big) dist, how reduce list output like
leafletr,leaflet,1 logo,lego,1
for cases metric non-zero , less n=5? found "apply()" lets me test, need sort out how use it.
the problem not specific stringdist , stringdistmatrix , elementary r, still i'm stuck. suspect answer involves subset(), don't know how transform "dist" else.
you can this:
library(reshape2) d <- unique(melt(m)) out <- subset(d, value > 0 & value < 5)
here, melt
brings m
long form (2 columns string names , 1 column value). however, since we've melted symmetric matrix, use unique
de-duplication.
another way use dplyr
(since cool kids using dplyr
pipes now):
library(dlpyr) library(reshape2) library(magrittr) out <- melt(m) %>% distinct() %>% filter(value > 0 & value < 5)
this second option faster have not timed it.
Comments
Post a Comment