string - R: producing a list of near matches with stringdist and stringdistmatrix -
i discovered excellent package "stringdist" , want use compute string distances. in particular have set of words, , want print out near-matches, "near match" through algorithm levenshtein distance.
i have extremely slow working code in shell script, , able load in stringdist , produce matrix metrics. want boil down matrix smaller matrix has near matches, e.g. metric non-zero less threshold.
kp <- c('leaflet','leafletr','lego','levenshtein-distance','logo') kpm <- stringdistmatrix(kp,usenames="strings",method="lv") > kpm leaflet leafletr lego levenshtein-distance leafletr 1 lego 5 6 levenshtein-distance 16 16 18 logo 6 7 1 19 m = as.matrix(kpm) close = apply(m, 1, function(x) x>0 & x<5) > close leaflet leafletr lego levenshtein-distance logo leaflet false true false false false leafletr true false false false false lego false false false false true levenshtein-distance false false false false false logo false false true false false ok, have (big) dist, how reduce list output like
leafletr,leaflet,1 logo,lego,1 for cases metric non-zero , less n=5? found "apply()" lets me test, need sort out how use it.
the problem not specific stringdist , stringdistmatrix , elementary r, still i'm stuck. suspect answer involves subset(), don't know how transform "dist" else.
you can this:
library(reshape2) d <- unique(melt(m)) out <- subset(d, value > 0 & value < 5) here, melt brings m long form (2 columns string names , 1 column value). however, since we've melted symmetric matrix, use unique de-duplication.
another way use dplyr (since cool kids using dplyr pipes now):
library(dlpyr) library(reshape2) library(magrittr) out <- melt(m) %>% distinct() %>% filter(value > 0 & value < 5) this second option faster have not timed it.
Comments
Post a Comment