apache spark - How to partition data by multiple fields? -

- July 15, 2013

lets have record 4 identifier variables: var1, var2, var3, var4 , additional variable: var5.

i want perform reduce operation on records have same value in either 1 of identifier fields. i'm trying think how can implement kind of solution minimal amount of shuffling.

is there option tell spark put records have @ least 1 match in identifier variables on same partition? know there option of custom partitioner i'm not sure if possible implement in way support use case.

well, quite lot depends on structure of data in general , how priori knowledge have.

in worst case scenario, when data relatively dense , uniformly distributed below , perform one-off analysis, way achieve goal seems to put 1 partition.

[1 0 1 0] [0 1 0 1] [1 0 0 1]

obviously not useful approach. 1 thing can try instead analyze @ least subset of data insight structure , try use knowledge build custom partitioner assures relatively low traffic , reasonable distribution on cluster @ same time.

as general framework try 1 of these:

hashing

choose number of buckets
for each row create binary vector of length equal number of buckets
- for each feature in row
  - hash feature bucket
    - if hash(bucket) == 0 flip 1
    - otherwise nothing
sum computed vectors summary statistics
use optimization technique of choice create function hash partition

frequent itemsets

use 1 of algorithms apriori, closed-patterns, max-patterns on sample of data, fp-growth
compute distribution of itemsets on sample
use optimization compute hash above

both solutions computationally intensive , require quite lot of work implement not worth fuss ad hoc analytics if have reusable pipeline may worth trying.

Search This Blog

Chrom