apache spark - How to partition data by multiple fields? -


lets have record 4 identifier variables: var1, var2, var3, var4 , additional variable: var5.

i want perform reduce operation on records have same value in either 1 of identifier fields. i'm trying think how can implement kind of solution minimal amount of shuffling.

is there option tell spark put records have @ least 1 match in identifier variables on same partition? know there option of custom partitioner i'm not sure if possible implement in way support use case.

well, quite lot depends on structure of data in general , how priori knowledge have.

in worst case scenario, when data relatively dense , uniformly distributed below , perform one-off analysis, way achieve goal seems to put 1 partition.

[1 0 1 0] [0 1 0 1] [1 0 0 1]  

obviously not useful approach. 1 thing can try instead analyze @ least subset of data insight structure , try use knowledge build custom partitioner assures relatively low traffic , reasonable distribution on cluster @ same time.

as general framework try 1 of these:

hashing

  • choose number of buckets
  • for each row create binary vector of length equal number of buckets
    • for each feature in row
      • hash feature bucket
        • if hash(bucket) == 0 flip 1
        • otherwise nothing
  • sum computed vectors summary statistics
  • use optimization technique of choice create function hash partition

frequent itemsets

  • use 1 of algorithms apriori, closed-patterns, max-patterns on sample of data, fp-growth
  • compute distribution of itemsets on sample
  • use optimization compute hash above

both solutions computationally intensive , require quite lot of work implement not worth fuss ad hoc analytics if have reusable pipeline may worth trying.


Comments

Popular posts from this blog

qt - Using float or double for own QML classes -

Create Outlook appointment via C# .Net -

ios - Swift Array Resetting Itself -