pandas - alternative to recursive sql / dedupe columns -
hi have data set
1 many relations. want reduce 1-1 concatenating
distinct values each column
here example:
i have data set called customer
, product affiliation (pa)
. 1 customer
can have multiple pa
different times.
here input:
╔════════╦═══════╦══════╦══════╗ ║ cust ║pa1 ║ pa2 ║ pa3 ║ ╠════════╬═══════╬══════╬══════╣ ║ ║ h ║ m ║ l ║ ║ ║ h ║ l ║ m ║ ║ ║ h ║ m ║ h ║ ╚════════╩═══════╩══════╩══════╝
desired output:
╔════════╦══════╦══════════╦═══════╗ ║ cust ║ pa1 ║ pa2 ║ pa3 ║ ╠════════╬══════╬══════════╬═══════╣ ║ ║ h ║ m&l ║ l&m&h ║ ╚════════╩══════╩══════════╩═══════╝
i have multiple pa columns. want know if there generic python panda code can apply thank minc
here's expanded version of sample data make more clear how code operates on groups rather whole dataframe:
df = pd.dataframe({ 'cust':list('aaabbb'),'pa1':list('hhhmmh'), 'pa2': list('mlmmmm'),'pa3':[1,2,3,3,3,1] }) cust pa1 pa2 pa3 0 h m 1 1 h l 2 2 h m 3 3 b m m 3 4 b m m 3 5 b h m 1
the first step create new dataframe 1 row per customer. quick , dirty way fine example @ hand might want differently depending on want other columns.
df2 = df.groupby('cust').sum() pa1 pa2 pa3 cust hhh mlm 123 b mmh mmm 331
with template desired output, can loop on relevant columns of original dataframe , use groupby/apply
join/unique
replace values in existing columns:
for col in ['pa1','pa2','pa3']: df2[col] = df.groupby('cust')[col].apply( lambda x: '&'.join( x.astype(str).unique() )) pa1 pa2 pa3 cust h m&l 1&2&3 b m&h m 3&1
Comments
Post a Comment