pandas - alternative to recursive sql / dedupe columns -
hi have data set 1 many relations. want reduce 1-1 concatenating distinct values each column
here example:
i have data set called customer , product affiliation (pa). 1 customer can have multiple pa different times.
here input:
╔════════╦═══════╦══════╦══════╗ ║ cust ║pa1 ║ pa2 ║ pa3 ║ ╠════════╬═══════╬══════╬══════╣ ║ ║ h ║ m ║ l ║ ║ ║ h ║ l ║ m ║ ║ ║ h ║ m ║ h ║ ╚════════╩═══════╩══════╩══════╝ desired output:
╔════════╦══════╦══════════╦═══════╗ ║ cust ║ pa1 ║ pa2 ║ pa3 ║ ╠════════╬══════╬══════════╬═══════╣ ║ ║ h ║ m&l ║ l&m&h ║ ╚════════╩══════╩══════════╩═══════╝ i have multiple pa columns. want know if there generic python panda code can apply thank minc
here's expanded version of sample data make more clear how code operates on groups rather whole dataframe:
df = pd.dataframe({ 'cust':list('aaabbb'),'pa1':list('hhhmmh'), 'pa2': list('mlmmmm'),'pa3':[1,2,3,3,3,1] }) cust pa1 pa2 pa3 0 h m 1 1 h l 2 2 h m 3 3 b m m 3 4 b m m 3 5 b h m 1 the first step create new dataframe 1 row per customer. quick , dirty way fine example @ hand might want differently depending on want other columns.
df2 = df.groupby('cust').sum() pa1 pa2 pa3 cust hhh mlm 123 b mmh mmm 331 with template desired output, can loop on relevant columns of original dataframe , use groupby/apply join/unique replace values in existing columns:
for col in ['pa1','pa2','pa3']: df2[col] = df.groupby('cust')[col].apply( lambda x: '&'.join( x.astype(str).unique() )) pa1 pa2 pa3 cust h m&l 1&2&3 b m&h m 3&1
Comments
Post a Comment