How to groupby multiple columns to list in pandas DataFrame

How to groupby multiple columns to list in pandas DataFrame - python

I've a DataFrame df:
A B C date
O 4 5 5 2019-06-2
1 3 5 2 2019-06-2
2 3 2 1 2019-06-2
3 4 4 3 2019-06-3
4 5 4 6 2019-06-3
5 2 3 7 2019-06-3
Now I can groupby one column by using the following code:
df.groupby('date')['A'].apply(list)
A date
O [4,3,3] 2019-06-2
1 [4,5,2] 2019-06-3
but what if want to group by multiple columns? I've tried something like this but it doesn't seems to be working:
df.groupby('date')[['A','B','C']].apply(list)
The final DataFrame should look like this:
A B C date
O [4,3,3] [5,5,2] [5,2,1] 2019-06-2
1 [4,5,2] [4,4,3] [3,6,7] 2019-06-3

Use GroupBy.agg instead of GroupBy.apply:
df1 = df.groupby('date')[['A','B','C']].agg(list).reset_index()
print (df1)
date A B C
0 2019-06-2 [4, 3, 3] [5, 5, 2] [5, 2, 1]
1 2019-06-3 [4, 5, 2] [4, 4, 3] [3, 6, 7]
EDIT: If wanting to do more aggregations pass it in list:
df2 = df.groupby('date')[['A','B','C']].agg(['mean','min','max', list])
print (df2)
A B C \
mean min max list mean min max list mean
date
2019-06-2 3.333333 3 4 [4, 3, 3] 4.000000 2 5 [5, 5, 2] 2.666667
2019-06-3 3.666667 2 5 [4, 5, 2] 3.666667 3 4 [4, 4, 3] 5.333333
min max list
date
2019-06-2 1 5 [5, 2, 1]
2019-06-3 3 7 [3, 6, 7]
Then the MultiIndex columns can be flatten:
df2 = df.groupby('date')[['A','B','C']].agg(['mean','min','max', list])
df2.columns = df2.columns.map(lambda x: f'{x[0]}_{x[1]}')
df2 = df2.reset_index()
print (df2)
date A_mean A_min A_max A_list B_mean B_min B_max \
0 2019-06-2 3.333333 3 4 [4, 3, 3] 4.000000 2 5
1 2019-06-3 3.666667 2 5 [4, 5, 2] 3.666667 3 4
B_list C_mean C_min C_max C_list
0 [5, 5, 2] 2.666667 1 5 [5, 2, 1]
1 [4, 4, 3] 5.333333 3 7 [3, 6, 7]

Related

Shaping a Pandas DataFrame (multiple columns into 2)

I have a simular dataframe to below and require it to be shaped as per expected output.
df = pd.DataFrame({
'col1': ['A', 'A', 'A', 'B', 'B', 'B'],
'col2': [1, 3, 5, 7, 9, 11],
'col3': [2, 4, 6, 8, 10, 12]
})
col1 col2 col3
0 A 1 2
1 A 3 4
2 A 5 6
3 B 7 8
4 B 9 10
5 B 11 12
Expected Output
df_expected = pd.DataFrame({
'A': [1, 2, 3, 4, 5, 6],
'B': [7, 8, 9, 10, 11, 12]
})
A B
0 1 7
1 2 8
2 3 9
3 4 10
4 5 11
5 6 12
So far I have tried pack, unpack & pivot without getting the desired result
Thanks for your help!

pd.DataFrame(df.groupby('col1').agg(list).T.sum().to_dict())

Use Numpy to reshape the data then package back up into a dataframe.
cols = (df['col2'],df['col3'])
data = np.stack(cols,axis=1).reshape(len(cols),len(df))
dft = pd.DataFrame(data, index=df['col1'].unique()).T
print(dft)
Result
A B
0 1 7
1 2 8
2 3 9
3 4 10
4 5 11
5 6 12

Replace values in dataframe where updated versions are in another dataframe [duplicate]

This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 1 year ago.
I have two dataframes, something like this:
df1 = pd.DataFrame({
'Identity': [3, 4, 5, 6, 7, 8, 9],
'Value': [1, 2, 3, 4, 5, 6, 7],
'Notes': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
})
df2 = pd.DataFrame({
'Identity': [4, 8],
'Value': [0, 128],
})
In[3]: df1
Out[3]:
Identity Value Notes
0 3 1 a
1 4 2 b
2 5 3 c
3 6 4 d
4 7 5 e
5 8 6 f
6 9 7 g
In[4]: df2
Out[4]:
Identity Value
0 4 0
1 8 128
I'd like to use df2 to overwrite df1 but only where values exist in df2, so I end up with:
Identity Value Notes
0 3 1 a
1 4 0 b
2 5 3 c
3 6 4 d
4 7 5 e
5 8 128 f
6 9 7 g
I've been searching through all the various merge, combine, join functions etc, but I can't seem to find one that does what I want. Is there a simple way of doing this?

Use:
df1['Value'] = df1['Identity'].map(df2.set_index('Identity')['Value']).fillna(df1['Value'])
Or try reset_index with reindex and set_index with fillna:
df1['Value'] = df2.set_index('Identity').reindex(df1['Identity'])
.reset_index(drop=True)['Value'].fillna(df1['Value'])
>>> df1
Identity Value Notes
0 3 1.0 a
1 4 0.0 b
2 5 3.0 c
3 6 4.0 d
4 7 5.0 e
5 8 128.0 f
6 9 7.0 g
>>>
This fills missing rows in df2 with NaN and fills the NaNs with df1 values.

List unique values using grouper

I have a dataframe in which the index is a datetime and column A and B are objects. I need to see the unique values of A and B per week.
I managed to get the unique value count per week (I am using the pd.grouper function for that) but I am struggling to get the unique values per week.
This code gives me the unique value counts per week
df_unique = pd.DataFrame(df.groupby(pd.Grouper(freq="W"))['A', 'B'].nunique())
However, the code below does not give me the unique values itself per week
df_unique_list = pd.DataFrame(df.groupby(pd.Grouper(freq="W"))['A', 'B'].unique())
This code gives me te following error message
AttributeError: 'DataFrameGroupBy' object has no attribute 'unique'

Use lambda function with Series.unique and converting to list:
np.random.seed(123)
rng = pd.date_range('2017-04-03', periods=20)
df = pd.DataFrame({'A': np.random.choice([1,2,3,4,5,6], size=20),
'B': np.random.choice([1,2,3,4,5,6,7,8], size=20)}, index=rng)
print (df)
A B
2017-04-03 6 1
2017-04-04 3 5
2017-04-05 5 2
2017-04-06 3 8
2017-04-07 2 4
2017-04-08 4 3
2017-04-09 3 5
2017-04-10 4 8
2017-04-11 2 3
2017-04-12 2 5
2017-04-13 1 8
2017-04-14 2 1
2017-04-15 2 6
2017-04-16 1 1
2017-04-17 1 8
2017-04-18 2 2
2017-04-19 4 4
2017-04-20 6 5
2017-04-21 5 5
2017-04-22 1 5
df_unique_list = df.groupby(pd.Grouper(freq="W"))['A', 'B'].agg(lambda x: list(x.unique()))
print (df_unique_list)
A B
2017-04-09 [6, 3, 5, 2, 4] [1, 5, 2, 8, 4, 3]
2017-04-16 [4, 2, 1] [8, 3, 5, 1, 6]
2017-04-23 [1, 2, 4, 6, 5] [8, 2, 4, 5]

Pandas loop a dataframe and compare all rows with other DF rows and assign a value

I have two DF:
df1 = pd.DataFrame({'A':[3, 2, 5, 1, 6], 'B': [4, 6, 5, 8, 2], 'C': [4, 8, 3, 8, 0], 'D':[1, 4, 2, 8, 7], 'zebra': [5, 7, 2, 4, 8]})
df2 = pd.DataFrame({'B': [7, 3, 5, 1, 8], 'D':[4, 5, 8, 2, 3] })
print(df1)
print(df2)
A B C D zebra
0 3 4 4 1 5
1 2 8 8 5 7
2 5 5 3 2 2
3 1 6 8 5 4
4 6 2 0 7 8
B D
0 7 4
1 3 5
2 5 8
3 8 5
4 8 3
This is a simple example, in real df1 is with 1000k+ rows and 10+ columns, df2 is with only 24 rows and fewer columns as well. I would like to loop all rows in df2 and to compare those specific rows (for example column 'B' and 'D') from df2 with same column names in df1 and if row values match (if value in column B and column D in df2 match same values in same columns but in df1) to assign corresponding zebra value in that row to the same row in df2 creating new column zebra and assigning that value. If no matching found to assign 0s or NaN's.
B D zebra
0 7 4 nan
1 3 5 nan
2 5 8 nan
3 8 5 7
4 8 3 nan
From example, only row index 3 in df2 matched values 'B': 8 and 'D':5 with a row with index 2 from df1 (NOTE: row index should not be important in comparisons) and assign corresponding row value 7 from column 'zebra' to df2.

A merge would do
df2.merge(df1[['B', 'D', 'zebra']], on = ['B', 'D'], how = 'left')
B D zebra
0 7 4 NaN
1 3 5 NaN
2 5 8 NaN
3 8 5 7.0
4 8 3 NaN

Comparing every values columns and rows in dataframes

I have two dataframes of different sizes and I would like to use a comparison for all values in four different columns, (two sets of two)
Essentially I would like to see where df1['A'] == df2['A'] & where df1['B'] == df2['B'] and return df1['C']'s value plus df2['C']'s values
import pandas as pd
df1 = pd.DataFrame({"A": [1, 2, 3, 4, 3], "B": [2, 5, 4, 7, 5], "C": [1, 2, 8, 0, 0]})
df2 = pd.DataFrame({"A": [1, 3, 2, 4, 8], "B": [5, 5, 4, 9, 1], "C": [1, 3, 3, 4, 6]})
df1:
A B C
0 1 2 1
1 2 5 2
2 3 4 8
3 4 7 0
4 3 5 0
...
df2:
A B C
0 1 5 1
1 3 4 3
2 2 5 4
3 4 9 4
5 8 1 6
...
in: df1['A'] == df2['A'] & where df1['B'] == df2['B']
df1['D'] = df1['C'] + df2['C']
out: df1:
A B C D
0 1 2 1 nan
1 2 5 2 6
2 3 4 8 11
3 4 7 0 nan
4 3 5 0 nan
My actual dataframes are much larger (120000ish rows of data with values for both 'A' columns range from 1 to 700 and 'B' from 1 to 300) so I know it might be a longer process.

You can merge the two DataFrames on columns A and B. Since you want to keep all values from df1, do a left merge of df1 and df2. The merged column C from df2 will be null wherever A and B don't match. After the merge, it's just a matter of renaming the merged column and doing a sum.
# Do a left merge, keeping df1 column names unchanged.
df1 = pd.merge(df1, df2, how='left', on=['A', 'B'], suffixes=('', '_2'))
# Add the two columns, fill locations that don't match with zero, and rename.
df1['C_2'] = df1['C_2'].add(df1['C']).fillna(0)
df1.rename(columns={'C_2': 'D'}, inplace=True)

You could first merge the two dataframes
In [145]: dff = pd.merge(df1, df2, on=['A', 'B'], how='left')
In [146]: dff
Out[146]:
A B C_x C_y
0 1 2 1 NaN
1 2 5 2 4
2 3 4 8 3
3 4 7 0 NaN
Then, take row-wise sum on C_-{like} columns, where null values are not present, then fill NaN with zero.
In [147]: dff['C'] = dff.filter(regex='C_').sum(skipna=False, axis=1).fillna(0)
In [148]: dff
Out[148]:
A B C_x C_y C
0 1 2 1 NaN 0
1 2 5 2 4 6
2 3 4 8 3 11
3 4 7 0 NaN 0
And, you can drop/pick required columns.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to groupby multiple columns to list in pandas DataFrame - python

Related

Shaping a Pandas DataFrame (multiple columns into 2)

Replace values in dataframe where updated versions are in another dataframe [duplicate]

List unique values using grouper

Pandas loop a dataframe and compare all rows with other DF rows and assign a value

Comparing every values columns and rows in dataframes

Categories

Resources