How to groupby multiple columns in dataframe, except one in python - python

I have the following dataframe:
ID Code Color Value
-----------------------------------
0 111 AAA Blue 23
1 111 AAA Red 43
2 111 AAA Green 4
3 121 ABA Green 45
4 121 ABA Green 23
5 121 ABA Red 75
6 122 AAA Red 52
7 122 ACA Blue 24
8 122 ACA Blue 53
9 122 ACA Green 14
...
I want to group this dataframe by the columns "ID", and "Code", and sum the values from the "Value" column, while excluding the "Color" column from this grouping. Or in other words, I want to groupy by all non-Value columns, except for the "Color" column, and then sum the values from the "Value" column. I am using python for this.
What I am thinking of doing is creating a list of all column names that are not "Color" and "Value", and creating this "column_list", and then simply running:
df.groupby['column_list'].sum()
Though this will not work. How might I augment this code so that I can properly groupby as intended?
EDIT:
This code works:
bins = df.groupby([df.columns[0],
df.columns[1],
df.columns[2]).count()
bins["Weight"] = bins / bins.groupby(df.columns[0]).sum()
bins.reset_index(inplace=True)
bins['Weight'] = bins['Weight'].round(4)
display(HTML(bins.to_html()))
Full code that is not working:
column_list = [c for c in df.columns if c not in ['Value']]
bins = df.groupby(column_list, as_index=False)['Value'].count()
bins["Weight"] = bins / bins.groupby(df.columns[0]).sum()
bins.reset_index(inplace=True)
bins['Weight'] = bins['Weight'].round(4)
display(HTML(bins.to_html()))

You can pass list to groupby and specify column for aggregate sum:
column_list = [c for c in df.columns if c not in ['Color','Value']]
df1 = df.groupby(column_list, as_index=False)['Value'].sum()
Or:
column_list = list(df.columns.difference(['Color','Value'], sort=False))
df1 = df.groupby(column_list, as_index=False)['Value'].sum()
It working with sample data like:
df1 = df.groupby(['ID','Code'], as_index=False)['Value'].sum()
EDIT: Yes, also working:
column_list = [c for c in df.columns if c not in ['Color']]
df1 = df.groupby(column_list, as_index=False).sum()
Reason is because sum remove by default not numeric columns and if not specified Value it summed all columns.
So if Color is numeric, it sum it too:
print (df)
ID Code Color Value
0 111 AAA 1 23
1 111 AAA 2 43
2 111 AAA 3 4
3 121 ABA 1 45
4 121 ABA 1 23
5 121 ABA 2 75
6 122 AAA 1 52
7 122 ACA 2 24
8 122 ACA 1 53
9 122 ACA 2 14
column_list = [c for c in df.columns if c not in ['Color']]
df1 = df.groupby(column_list, as_index=False).sum()
print (df1)
ID Code Value Color
0 111 AAA 4 3
1 111 AAA 23 1
2 111 AAA 43 2
3 121 ABA 23 1
4 121 ABA 45 1
5 121 ABA 75 2
6 122 AAA 52 1
7 122 ACA 14 2
8 122 ACA 24 2
9 122 ACA 53 1
column_list = [c for c in df.columns if c not in ['Color']]
df1 = df.groupby(column_list, as_index=False)['Value'].sum()
print (df1)
ID Code Value
0 111 AAA 4
1 111 AAA 23
2 111 AAA 43
3 121 ABA 23
4 121 ABA 45
5 121 ABA 75
6 122 AAA 52
7 122 ACA 14
8 122 ACA 24
9 122 ACA 53
EDIT: If need MultiIndex in bins remove as_index=False and column after groupby:
bins = df.groupby([df.columns[0],
df.columns[1],
df.columns[2]).count()
should be changed to:
column_list = [c for c in df.columns if c not in ['Value']]
bins = df.groupby(column_list).count()

Related

Extract corresponding df value with reference from another df

There are 2 dataframes with 1 to 1 correspondence. I can retrieve an idxmax from all columns in df1.
Input:
df1 = pd.DataFrame({'ref':[2,4,6,8,10,12,14],'value1':[76,23,43,34,0,78,34],'value2':[1,45,8,0,76,45,56]})
df2 = pd.DataFrame({'ref':[2,4,6,8,10,12,14],'value1_pair':[0,0,0,0,180,180,90],'value2_pair':[0,0,0,0,90,180,90]})
df=df1.loc[df1.iloc[:,1:].idxmax(), 'ref']
Output: df1, df2 and df
ref value1 value2
0 2 76 1
1 4 23 45
2 6 43 8
3 8 34 0
4 10 0 76
5 12 78 45
6 14 34 56
ref value1_pair value2_pair
0 2 0 0
1 4 0 0
2 6 0 0
3 8 0 0
4 10 180 90
5 12 180 180
6 14 90 90
5 12
4 10
Name: ref, dtype: int64
Now I want to create a df which contains 3 columns
Desired Output df:
ref max value corresponding value
12 78 180
10 76 90
What are the best options to extract the corresponding values from df2?
Your main problem is matching the columns between df1 and df2. Let's rename them properly, melt both dataframes, merge and extract:
(df1.melt('ref')
.merge(df2.rename(columns={'value1_pair':'value1',
'value2_pair':'value2'})
.melt('ref'),
on=['ref', 'variable'])
.sort_values('value_x')
.groupby('variable').last()
)
Output:
ref value_x value_y
variable
value1 12 78 180
value2 10 76 90

Smart pandas merge

every one!
I've got a problem. I want to merge two pandas DataFrame by same column, where the 1st DataFrame in his column contains values of column 2nt DataFrame. And i want to keep in result values of 1st DataFrame, where they exist, and where they isn't keep values from 2nt. Like this:
1st:
_ col_1 col_2
0 123 100
1 124 200
2 125 150
3 126 250
4 127 300
2nt:
_ col_1 col_2
0 123 10
1 125 20
2 127 30
And i want to get next one:
_ col_1 col_2
0 123 10
1 124 200
2 125 20
3 126 250
4 127 30
Use concat with DataFrame.drop_duplicates and DataFrame.sort_values:
df = (pd.concat([df2, df1], ignore_index=True)
.drop_duplicates('col_1')
.sort_values('col_1'))
print (df)
col_1 col_2
0 123 10
4 124 200
1 125 20
6 126 250
2 127 30

How to merge DataFrames based on on column while adding another

I have the following mock DataFrames:
df1:
ID FILLER1 FILLER2 QUANTITY
01 123 132 12
02 123 132 5
03 123 132 10
df2:
ID FILLER1 FILLER2 QUANTITY
01 123 132 +1
02 123 132 -1
which would result in the 'Quantity' of DF1 will result in 13, 4 and 10.
Thx in advance for any help provided!
Question is not super clear but if I get what you're trying to do here is a way:
# A left join and filling 0 instead of NaN for that third row
In [19]: merged = df1.merge(df2, on=['ID', 'FILLER1', 'FILLER2'], how='left').fillna(0)
In [20]: merged
Out[20]:
ID FILLER1 FILLER2 QUANTITY_x QUANTITY_y
0 1 123 132 12 1.0
1 2 123 132 5 -1.0
2 3 123 132 10 0.0
# Adding new quantity column
In [21]: merged['QUANTITY'] = merged['QUANTITY_x'] + merged['QUANTITY_y']
In [22]: merged
Out[22]:
ID FILLER1 FILLER2 QUANTITY_x QUANTITY_y QUANTITY
0 1 123 132 12 1.0 13.0
1 2 123 132 5 -1.0 4.0
2 3 123 132 10 0.0 10.0
# Removing _x and _y columns
In [23]: merged = merged[['ID', 'FILLER1', 'FILLER2', 'QUANTITY']]
In [24]: merged
Out[24]:
ID FILLER1 FILLER2 QUANTITY
0 1 123 132 13.0
1 2 123 132 4.0
2 3 123 132 10.0

pandas merge two dataframes without cross-references and with NaN's for uneven number of rows

EDITED 3/5/19:
Tried different ways to merge and/or join the data below but couldn't wrap my head around how to do that correctly.
Initially I have a data like this:
index unique_id group_name id name
0 100 ABC 20 aaa
1 100 ABC 21 bbb
2 100 DEF 22 ccc
3 100 DEF 23 ddd
4 100 DEF 24 eee
5 100 DEF 25 fff
6 101 ABC 30 ggg
7 101 ABC 31 hhh
8 101 ABC 32 iii
9 101 DEF 33 jjj
The goal is to reshape it by merging on unique_id so that the result looks like this:
index unique_id group_name_x id_x name_x group_name_y id_y name_y
0 100 ABC 20 aaa DEF 22 ccc
1 100 ABC 21 bbb DEF 23 ddd
2 100 NaN NaN NaN DEF 24 eee
3 100 NaN NaN NaN DEF 25 fff
4 101 ABC 30 ggg DEF 33 jjj
5 101 ABC 31 hhh NaN NaN NaN
6 101 ABC 32 iii NaN NaN NaN
How can I do this in pandas? The best I could think of is to split the data into two dataframes by group name (ABC and DEF) and then merge them with how='outer', on='unique_id', but that way it creates references between each record (2 ABC x 4 DEF = 8 records) without any NaN's.
pd.concat with axis=1 mentioned in answers doesn't align the data per unique_id and doesn't create any NaN's.
As you said , split the dataframe then concat both dataframe by row wise after resetting both index
A working code,
df=pd.read_clipboard()
req_cols=['group_name','id','name']
df_1=df[df['group_name']=='ABC'].reset_index(drop=True)
df_2=df[df['group_name']=='DEF'].reset_index(drop=True)
df_1=df_1.rename(columns = dict(zip(df_1[req_cols].columns.values, df_1[req_cols].add_suffix('_x'))))
df_2=df_2.rename(columns = dict(zip(df_2[req_cols].columns.values, df_2[req_cols].add_suffix('_y'))))
req_cols_x=[val+'_x'for val in req_cols]
print (pd.concat([df_2,df_1[req_cols_x]],axis=1))
O/P:
index unique_id group_name_y id_y name_y group_name_x id_x name_x
0 2 100 DEF 22 ccc ABC 20.0 aaa
1 3 100 DEF 23 ddd ABC 21.0 bbb
2 4 100 DEF 24 eee NaN NaN NaN
3 5 100 DEF 25 fff NaN NaN NaN

Remove duplicates with few columns and sum the other columns

The following is my data:
name id junk date time value value2
abc 1 1 1/1/2017 18:07:54 5 10
abc 1 2 1/1/2017 19:07:54 10 15
abc 2 3 2/1/2017 20:07:54 15 20
abc 2 4 2/1/2017 21:07:54 20 25
def 3 5 3/1/2017 22:07:54 25 30
def 3 6 3/1/2017 23:07:54 30 35
def 4 7 4/1/2017 12:07:54 35 40
def 4 8 4/1/2017 13:07:54 40 45
I want to remove the duplicates based on three columns, name, id and date and take the first value. I tried the following command:
data.drop_duplicates(subset=['name', 'id', 'date'],keep = 'first')
I also want to group these three columns and take the sum of value and value2 column and I tried following column:
data[['name', 'id', 'date', 'value']].groupby(['name', 'id', 'date']).sum()
data[['name', 'id', 'date', 'value2']].groupby(['name', 'id', 'date']).sum()
Now I want to join all the three data frames and take the columns. I am thinking there should be a better way to do this? The following is the output I am looking for:
name id junk date time value value2
abc 1 1 1/1/2017 18:07:54 15 25
abc 2 3 2/1/2017 20:07:54 35 45
def 3 5 3/1/2017 22:07:54 55 65
def 4 7 4/1/2017 12:07:54 75 85
Where I want to consider to remove duplicates based on name, id and date column, take the first value of junk and time columns and also add the value and value2 columns.
Can anybody help me in doing this?
You need groupby with agg:
df = df.groupby(['name', 'id', 'date'])
.agg({'value':'sum', 'value2':'sum', 'time':'first', 'junk':'first'})
.reset_index()
print (df)
name id date value2 time junk value
0 abc 1 1/1/2017 25 18:07:54 1 15
1 abc 2 2/1/2017 45 20:07:54 3 35
2 def 3 3/1/2017 65 22:07:54 5 55
3 def 4 4/1/2017 85 12:07:54 7 755
Dynamic solution:
g_cols = ['name','id','date']
sum_cols = ['value','value2']
#remove columns in groupby and for sum
cols = df.columns[~df.columns.isin(sum_cols + g_cols)]
print (cols)
Index(['junk', 'time'], dtype='object')
#dict comprehension for sum columns
d_sum = {col:'sum' for col in sum_cols}
#dict comprehension for first columns
d = {col:'first' for col in cols}
#add dicts together
d.update(d_sum)
print (d)
{'value2': 'sum', 'time': 'first', 'junk': 'first', 'value': 'sum'}
df = df.groupby(g_cols).agg(d).reset_index()
print (df)
name id date value2 time junk value
0 abc 1 1/1/2017 25 18:07:54 1 15
1 abc 2 2/1/2017 45 20:07:54 3 35
2 def 3 3/1/2017 65 22:07:54 5 55
3 def 4 4/1/2017 85 12:07:54 7 75

Categories