Conditional join in r -
i conditionally join 2 data tables together:
library(data.table) set.seed(1) key.table <- data.table( out = (0:10)/10, keyz = sort(runif(11)) ) large.tbl <- data.table( ab = rnorm(1e6), cd = runif(1e6) )
according following rule: match smallest value of out
in key.table
keyz
value larger cd
. have following:
library(dplyr) large.tbl %>% rowwise %>% mutate(out = min(key.table$out[key.table$keyz > cd]))
which provides correct output. problem have rowwise
operation seems expensive large.tbl
using, crashing unless on particular computer. there less memory-expensive operations? following seems faster, not enough problem have.
large.tbl %>% group_by(cd) %>% mutate(out = min(key.table$out[key.table$keyz > cd]))
this smells problem data.table
answer, answer not have use package.
what want is:
setkey(large.tbl, cd) setkey(key.table, keyz) key.table[large.tbl, roll = -inf]
see ?data.table
>roll
:
applies last join column, date can ordered variable, irregular , including gaps. if
roll=true
,i
's row matches lastx
join column, , value in lasti
join column falls in gap (including after last observation inx
group), prevailing value inx
rolled forward. operation particularly fast using modified binary search. operation known last observation carried forward (locf). usually, there should no duplicates inx
's key, last key column date (or time, or datetime) , columns ofx
's key joined to. common idiom select contemporaneous regular time series (dts
) across set of identifiers (ids
):dt[cj(ids,dts),roll=true]
dt
has 2-column key (id,date
) ,cj
stands cross join. whenroll
positive number, limits how far values carried forward.roll=true
equivalentroll=+inf
. whenroll
negative number, values rolled backwards; i.e., next observation carried backwards (nocb). use-inf
unlimited roll back. when roll"nearest"
, nearest value joined to.
(to fair think go elucidation, it's pretty dense)
Comments
Post a Comment