Groupby a column containing duplicates but also preserving the duplicate information - python

I have the following dataframe:
df=pd.DataFrame({'id':['A','A','B','C','D'],'Name':['apple','apricot','banana','orange','citrus'], 'count':[2,3,6,5,12]})
id Name count
0 A apple 2
1 A apricot 3
2 B banana 6
3 C orange 5
4 D citrus 12
I am trying to group the dataframe by the 'id' column, but also preserve the duplicated names as separate columns. Below is the expected output:
id sum(count) id1 id2
0 A 5 apple apricot
1 B 6 banana na
2 C 5 orange na
3 D 12 citrus na
I tried grouping by the id column using the following statement but that removes the name column completely.
df.groupby(['id'], as_index=False).sum()
I would appreciate any suggestions/ help.

You can use DataFrame.pivot_table for this:
g = df.groupby('id')
# Generate the new columns of the pivoted dataframe
col = g.Name.cumcount()
# Sum of count grouped by id
sum_count = g['count'].sum()
(df.pivot_table(values='Name', index='id', columns = col, aggfunc='first')
.add_prefix('id')
.assign(sum_count = sum_count))
id0 id1 sum_count
id
A apple apricot 5
B banana NaN 6
C orange NaN 5
D citrus NaN 12

Related

Pandas replace columns by merging another dataframe

I have a dataframe df1 looks like this:
id A B
0 1 10 5
1 1 11 6
2 2 10 7
3 2 11 8
And another dataframe df2:
id A
0 1 3
1 2 4
Now I want to replace A column in df1 with the value of A in df2 based on id, so the result should look like this:
id A B
0 1 3 5
1 1 3 6
2 2 4 7
3 2 4 8
There's a way that I can drop column A in df1 first and merge df2 to df1 on id like df1 = df1.drop(['A'], axis=1).merge(df2, how='left', on='id'), but if there're like 10 columns in df2, it will be pretty hard. Is there a more elegant way to do so?
here is one way to do it, by making use of pd.update. However, it requires to set the index on the id, so it can match the two df
df.set_index('id', inplace=True)
df2.set_index('id', inplace=True)
df.update(df2)
df['A'] = df['A'].astype(int) # value by default was of type float
df.reset_index()
id A B
0 1 3 5
1 1 3 6
2 2 4 7
3 2 4 8
Merge just the id column from df to df2, and then combine_first it to the original DataFrame:
df = df[['id']].merge(df2).combine_first(df)
print(df)
Output:
A B id
0 3 5 1
1 3 6 1
2 4 7 2
3 4 8 2

How to compare 2 non-identical dataframes in python

I have two dataframes with the same column order but different column names and different rows. df2 rows vary from df1 rows.
df1= col_id num name
0 1 3 linda
1 2 4 James
df2= id no name
0 1 2 granpa
1 2 6 linda
2 3 7 sam
This is the output I need. Outputs rows with same, OLD and NEW values so the user can clearly see what changed between two dataframes:
result col_id num name
0 1 was 3| now 2 was linda| now granpa
1 2 was 4| now 6 was James| now linda
2 was | now 3 was | now 7 was | now sam
Since your goal is just to compare differences, use DataFrame.compare instead of aggregating into strings.
However,
DataFrame.compare can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames
So we just need to align the row/column indexes, either via merge or reindex.
Align via merge
Outer-merge the two dfs:
merged = df1.merge(df2, how='outer', left_on='col_id', right_on='id')
# col_id num name_x id no name_y
# 0 1 3 linda 1 2 granpa
# 1 2 4 james 2 6 linda
# 2 NaN NaN NaN 3 7 sam
Divide the merged frame into left/right frames and align their columns with set_axis:
cols = df1.columns
left = merged.iloc[:, :len(cols)].set_axis(cols, axis=1)
# col_id num name
# 0 1 3 linda
# 1 2 4 james
# 2 NaN NaN NaN
right = merged.iloc[:, len(cols):].set_axis(cols, axis=1)
# col_id num name
# 0 1 2 granpa
# 1 2 6 linda
# 2 3 7 sam
compare the aligned left/right frames (use keep_equal=True to show equal cells):
left.compare(right, keep_shape=True, keep_equal=True)
# col_id num name
# self other self other self other
# 0 1 1 3 2 linda granpa
# 1 2 2 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam
left.compare(right, keep_shape=True)
# col_id num name
# self other self other self other
# 0 NaN NaN 3 2 linda granpa
# 1 NaN NaN 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam
Align via reindex
If you are 100% sure that one df is a subset of the other, then reindex the subsetted rows.
In your example, df1 is a subset of df2, so reindex df1:
df1.assign(id=df1.col_id) # copy col_id (we need original col_id after reindexing)
.set_index('id') # set index to copied id
.reindex(df2.id) # reindex against df2's id
.reset_index(drop=True) # remove copied id
.set_axis(df2.columns, axis=1) # align column names
.compare(df2, keep_equal=True, keep_shape=True)
# col_id num name
# self other self other self other
# 0 1 1 3 2 linda granpa
# 1 2 2 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam
Nullable integers
Normally int cannot mix with nan, so pandas converts to float. To keep the int values as int (like the examples above):
Ideally we'd convert the int columns to nullable integers with astype('Int64') (capital I).
However, there is currently a comparison bug with Int64, so just use astype(object) for now.
If I understand correctly, you want something like this:
new_df = df1.drop(['name', 'num'], axis=1).merge(df2.rename({'id': 'col_id'}, axis=1), how='outer')
Output:
>>> new_df
col_id no name
0 1 2 granpa
1 2 6 linda
2 3 7 sam

Handling values with multiple items in dataframe

Supposed my dataframe
Name Value
0 A apple
1 A banana
2 A orange
3 B grape
4 B apple
5 C apple
6 D apple
7 D orange
8 E banana
I want to show the items of each name.
(By removing duplicates)
output what I want
Name Values
0 A apple, banana, orange
1 B grape, apple
2 C apple
3 D apple, orange
4 E banana
thank you for reading
Changed sample data with duplicates:
print (df)
Name Value
0 A apple
1 A apple
2 A banana
3 A banana
4 A orange
5 B grape
6 B apple
7 C apple
8 D apple
9 D orange
10 E banana
If duplicates per both columns is necessary remove first use DataFrame.drop_duplicates and then aggregate join:
df1 = (df.drop_duplicates(['Name','Value'])
.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df1)
Name Value
0 A apple,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana
If not removed duplicates output is:
df2 = (df.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df2)
Name Value
0 A apple,apple,banana,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana

update pandas groupby group with column value

I have a test df like this:
df = pd.DataFrame({'A': ['Apple','Apple', 'Apple','Orange','Orange','Orange','Pears','Pears'],
'B': [1,2,9,6,4,3,2,1]
})
A B
0 Apple 1
1 Apple 2
2 Apple 9
3 Orange 6
4 Orange 4
5 Orange 3
6 Pears 2
7 Pears 1
Now I need to add a new column with the respective %differences in col 'B'. How is this possible. I cannot get this to work.
I have looked at
update column value of pandas groupby().last()
Not sure that it is pertinent to my problem.
And this which looks promising
Pandas Groupby and Sum Only One Column
I need to find and insert into the col maxpercchng (all rows in group) the maximum change in col (B) per group of col 'A'.
So I have come up with this code:
grouppercchng = ((df.groupby['A'].max() - df.groupby['A'].min())/df.groupby['A'].iloc[0])*100
and try to add it to the group col 'maxpercchng' like so
group['maxpercchng'] = grouppercchng
Or like so
df_kpi_hot.groupby(['A'], as_index=False)['maxpercchng'] = grouppercchng
Does anyone know how to add to all rows in group the maxpercchng col?
I believe you need transform for Series with same size like original DataFrame filled by aggregated values:
g = df.groupby('A')['B']
df['maxpercchng'] = (g.transform('max') - g.transform('min')) / g.transform('first') * 100
print (df)
A B maxpercchng
0 Apple 1 800.0
1 Apple 2 800.0
2 Apple 9 800.0
3 Orange 6 50.0
4 Orange 4 50.0
5 Orange 3 50.0
6 Pears 2 50.0
7 Pears 1 50.0
Or:
g = df.groupby('A')['B']
df1 = ((g.max() - g.min()) / g.first() * 100).reset_index()
print (df1)
A B
0 Apple 800.0
1 Orange 50.0
2 Pears 50.0

how to map multiple records to one unique id

I have 2 data sets with a common unique ID(duplicates in 2nd data frame)
I want to map all records with respect to each ID.
df1
id
1
2
3
4
5
df2
id col1
1 mango
2 melon
1 straw
3 banana
3 papaya
i want the out put like
df1
id col1
1 mango
straw
2 melon
3 banana
papaya
4 not available
5 not available
Thanks in advance
You're looking to do an outer df.merge:
df1 = df1.merge(df2, how='outer').set_index('id').fillna('not available')
>>> df1
col1
id
1 mango
1 straw
2 melon
3 banana
3 papaya
4 not available
5 not available

Categories