Python Pandas Dataframe select row by max date in group with aggregate - python

I have a dataframe as follows:
df = pd.DataFrame({'id': ['A', 'A', 'B', 'B', 'C'],
'date': ['2021-01-01T14:54:42.000Z',
'2021-01-01T14:54:42.000Z',
'2021-01-01T14:55:42.000Z',
'2021-04-01T15:51:42.000Z',
'2021-03-01T15:51:42.000Z'],
'foo': ['apple', 'orange', 'apple', 'banana', 'pepper'],
'count': [3, 2, 4, 2, 1]})
I want to group the dataframe by id and date so that foo and count per date are aggregated lists. I then want to take the row with the most recent date per id.
Expected outcome
id date foo count
A '2021-01-01T14:54:42.000Z' ['apple, orange'] [3, 2]
B '2021-04-01T15:51:42.000Z' ['banana'] [2]
C '2021-03-01T15:51:42.000Z' ['pepper'] [1]
I've tried
df = df.sort_values(['id', 'date'], ascending=(True, False))
test_df = df.groupby(['id', 'date'], as_index=False)['foo', 'count'].agg(list).head(1).reset_index(drop=True)
but this only gives me the first row of the df. .first() gives me a TypeError. Any help is greatly appreciated.

In your case
df.groupby('id',as_index=False).agg({'date':'max','foo':list,'count':list})
Out[178]:
id date foo count
0 A 2021-01-01T14:54:42.000Z [apple, orange] [3, 2]
1 B 2021-04-01T15:51:42.000Z [apple, banana] [4, 2]
2 C 2021-03-01T15:51:42.000Z [pepper] [1]

Related

Replace values of a dataframe with the value of another dataframe

I have two pandas dataframes
df1 = pd.DataFrame({'A': [1, 3, 5], 'B': [3, 4, 5]})
df2 = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [8, 9, 10, 11, 12], 'C': ['K', 'D', 'E', 'F', 'G']})
The index of both data-frames are 'A'.
How to replace the values of df1's column 'B' with the values of df2 column 'B'?
RESULT of df1:
A B
1 8
3 10
5 12
Maybe dataframe.isin() is what you're searching:
df1['B'] = df2[df2['A'].isin(df1['A'])]['B'].values
print(df1)
Prints:
A B
0 1 8
1 3 10
2 5 12
One of possible solutions:
wrk = df1.set_index('A').B
wrk.update(df2.set_index('A').B)
df1 = wrk.reset_index()
The result is:
A B
0 1 8
1 3 10
2 5 12
Another solution, based on merge:
df1 = df1.merge(df2[['A', 'B']], how='left', on='A', suffixes=['_x', ''])\
.drop(columns=['B_x'])

How do dcast a pandas dataframe and then map the column names when dropping the index

I have the following dataframe
import pandas as pd
df = pd.DataFrame({'fmc': [1, 2],
'id_r': [1, 1],
'id_b': ['a', 'b'],
'id_c': ['br', 'br'],
'fmc_': ['aa_bb', 'cc_dd']})
I would like to dcast this df with index=['id_r', 'id_b', 'id_c'] and values='fmc'.
I would like the output to be like
import numpy as np
dff = pd.DataFrame({'id_r': [1, 1],
'id_b': ['a', 'b'],
'id_c': ['br', 'br'],
'fmc_aa_bb':[1, np.nan],
'fmc_cc_dd':[np.nan, 2]})
I followed a previous question of mine
df = df.pivot_table(index=['id_r', 'id_b', 'id_c'], columns='fmc_', values='fmc')
df.columns = df.columns.map('_'.join)
df = df.reset_index()
But does not give the desired output.
Any help ?
Modify your code with add_prefix
s=df.pivot_table(index=['id_r', 'id_b', 'id_c'], columns='fmc_', values='fmc').add_prefix('fmc_').reset_index()
Out[190]:
fmc_ id_r id_b id_c fmc_aa_bb fmc_cc_dd
0 1 a br 1.0 NaN
1 1 b br NaN 2.0

Pandas groupby get row with max in multiple columns

Looking to get the row of a group that has the maximum value across multiple columns:
pd.DataFrame([{'grouper': 'a', 'col1': 1, 'col2': 3, 'uniq_id': 1}, {'grouper': 'a', 'col1': 2, 'col2': 4, 'uniq_id': 2}, {'grouper': 'a', 'col1': 3, 'col2': 2, 'uniq_id': 3}])
col1 col2 grouper uniq_id
0 1 3 a 1
1 2 4 a 2
2 3 2 a 3
In the above, I'm grouping by the "grouper" column. Within the "a" group, I want to get the row that has the max of col1 and col2, in this case, when I group my DataFrame, I want to get the row with uniq_id of 2 because it has the highest value of col1/col2 with 4, so the outcome would be:
col1 col2 grouper uniq_id
1 2 4 a 2
In my actual example, I'm using timestamps, so I actually don't expect ties. But in the case of a tie, I am indifferent to which row I select in the group, so it would just be first of the group in that case.
One more way you can try:
# find row wise max value
df['row_max'] = df[['col1','col2']].max(axis=1)
# filter rows from groups
df.loc[df.groupby('grouper')['row_max'].idxmax()]
col1 col2 grouper uniq_id row_max
1 2 4 a 2 4
Later you can drop row_max using df.drop('row_max', axis=1)
IIUC using transform then compare with original dataframe
g=df.groupby('grouper')
s1=g.col1.transform('max')
s2=g.col2.transform('max')
s=pd.concat([s1,s2],axis=1).max(1)
df.loc[df[['col1','col2']].eq(s,0).any(1)]
Out[89]:
col1 col2 grouper uniq_id
1 2 4 a 2
Interesting approaches all around. Adding another one just to show the power of apply (which I'm a big fan of) and using some of the other mentioned methods.
import pandas as pd
df = pd.DataFrame(
[
{"grouper": "a", "col1": 1, "col2": 3, "uniq_id": 1},
{"grouper": "a", "col1": 2, "col2": 4, "uniq_id": 2},
{"grouper": "a", "col1": 3, "col2": 2, "uniq_id": 3},
]
)
def find_max(grp):
# find max value per row, then find index of row with max val
max_row_idx = grp[["col1", "col2"]].max(axis=1).idxmax()
return grp.loc[max_row_idx]
df.groupby("grouper").apply(find_max)
value = pd.concat([df['col1'], df['col2']], axis = 0).max()
df.loc[(df['col1'] == value) | (df['col2'] == value), :]
col1 col2 grouper uniq_id
1 2 4 a 2
This probably isn't the fastest way, but it will work in your case. Concat both the columns and find the max, then search the df for where either column equals the value.
You can use numpy and pandas as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 3],
'col2': [3, 4, 2],
'grouper': ['a', 'a', 'a'],
'uniq_id': [1, 2, 3]})
df['temp'] = np.max([df.col1.values, df.col2.values],axis=0)
idx = df.groupby('grouper')['temp'].idxmax()
df.loc[idx].drop('temp',1)
col1 col2 grouper uniq_id
1 2 4 a 2

Pandas: renaming columns that have the same name

I have a dataframe that has duplicated column names a, b and b. I would like to rename the second b into c.
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "b1": [7, 8, 9]})
df.rename(index=str, columns={'b1' : 'b'})
Trying this with no success..
df.rename(index=str, columns={2 : "c"})
try:
>>> df.columns = ['a', 'b', 'c']
>>> df
a b c
0 1 4 7
1 2 5 8
2 3 6 9
You can always just manually rename all the columns.
df.columns = ['a', 'b', 'c']
You can simply do:
df.columns = ['a','b','c']
If your columns are ordered and you want lettered columns, don't type names out manually. This is prone to error.
You can use string.ascii_lowercase, assuming you have a maximum of 26 columns:
from string import ascii_lowercase
df = pd.DataFrame(columns=['a', 'b', 'b1'])
df.columns = list(ascii_lowercase[:len(df.columns)])
print(df.columns)
Index(['a', 'b', 'c'], dtype='object')
These solutions don't take into account the problem with having many cols.
Here is a solution where, independent on the amount of columns, you can rename the columns with the same name to a unique name
df.columns = ['name'+str(col[0]) if col[1] == 'name' else col[1] for col in enumerate(df.columns)]

Select columns of pandas dataframe using a dictionary list value

I have column names in a dictionary and would like to select those columns from a dataframe.
In the example below, how do I select dictionary values 'b', 'c' and save it in to df1?
import pandas as pd
ds = {'cols': ['b', 'c']}
d = {'a': [2, 3], 'b': [3, 4], 'c': [4, 5]}
df_in = pd.DataFrame(data=d)
print(ds)
print(df_in)
df_out = df_in[[ds['cols']]]
print(df_out)
TypeError: unhashable type: 'list'
Remove nested list - []:
df_out = df_in[ds['cols']]
print(df_out)
b c
0 3 4
1 4 5
According to ref, just need to drop one set of brackets.
df_out = df_in[ds['cols']]

Categories