Is there away to specify to the groupby() call to use the group name in the apply() lambda function?
Similar to if I iterate through groups I can get the group key via the following tuple decomposition:
for group_name, subdf in temp_dataframe.groupby(level=0, axis=0):
print group_name
...is there a way to also get the group name in the apply function, such as:
temp_dataframe.groupby(level=0,axis=0).apply(lambda group_name, subdf: foo(group_name, subdf)
How can I get the group name as an argument for the apply lambda function?
I think you should be able to use the nameattribute:
temp_dataframe.groupby(level=0,axis=0).apply(lambda x: foo(x.name, x))
should work, example:
In [132]:
df = pd.DataFrame({'a':list('aabccc'), 'b':np.arange(6)})
df
Out[132]:
a b
0 a 0
1 a 1
2 b 2
3 c 3
4 c 4
5 c 5
In [134]:
df.groupby('a').apply(lambda x: print('name:', x.name, '\nsubdf:',x))
name: a
subdf: a b
0 a 0
1 a 1
name: b
subdf: a b
2 b 2
name: c
subdf: a b
3 c 3
4 c 4
5 c 5
Out[134]:
Empty DataFrame
Columns: []
Index: []
For those who came looking for an answer to the question:
Including the group name in the transform function pandas python
and ended up in this thread, please read on.
Given the following input:
df = pd.DataFrame(data={'col1': list('aabccc'),
'col2': np.arange(6),
'col3': np.arange(6)})
Data:
col1 col2 col3
0 a 0 0
1 a 1 1
2 b 2 2
3 c 3 3
4 c 4 4
5 c 5 5
We can access the group name (which is visible from the scope of the calling apply function) like this:
df.groupby('col1') \
.apply(lambda frame: frame \
.transform(lambda col: col + 3 if frame.name == 'a' and col.name == 'col2' else col))
Output:
col1 col2 col3
0 a 3 0
1 a 4 1
2 b 2 2
3 c 3 3
4 c 4 4
5 c 5 5
Note that the call to apply is needed in order to obtain a reference to the sub pandas.core.frame.DataFrame (i.e. frame) which holds the name attribute of the corresponding sub group. The name attribute of the argument of transform (i.e. col) refers to the column/series name.
Alternatively, one could also loop over the groups and then, within each group, over the columns:
for grp_name, sub_df in df.groupby('col1'):
for col in sub_df:
if grp_name == 'a' and col == 'col2':
df.loc[df.col1 == grp_name, col] = sub_df[col] + 3
My use case is quite rare and this was the only way to achieve my goal (as of pandas v0.24.2). However, I'd recommend exploring the pandas documentation thoroughly because there most likely is an easier vectorised solution to what you may need this construct for.
Related
In Pandas I have a dataframe like below
data= [['A','B',3],['A','C',4],['A','D',5],['B','A',4],['B','C',4],['C','D',1]]
df = pd.DataFrame(data,columns =['Col1','Col2','Value'])
df
Col1
Col2
Value
A
B
3
A
C
4
A
D
5
B
C
4
C
D
1
B
A
4
I want to convert it as below
A:B
A:C
A:D
B:C
C:D
7
4
`5
4
1
Note: first column A:B value is 7 because there exists combination (A:B) = 4 + (B:A) = 3.
Please suggest a quick method
Use sorted with join for both columns and aggregate sum, last transpose:
df1 = (df.groupby(df[['Col1','Col2']]
.agg(lambda x: ':'.join(sorted(x)), axis=1))
.sum()
.T
.reset_index(drop=True))
print (df1)
A:B A:C A:D B:C C:D
0 7 4 5 4 1
You can do:
df.groupby([':'.join(sorted(t)) for t in zip(df['Col1'], df['Col2'
])])['Value'].sum().to_frame().T
output:
A:B A:C A:D B:C C:D
Value 7 4 5 4 1
you can use sets to make order irrelevant or sorted to make sure you have consistent ordering
result = df['Value'].\
groupby(df[['Col1','Col2']].\
apply(set,axis=1).apply(':'.join)).sum()
print(result['B:A'])
print(result)
but that gets you a series ...
B:A 7
C:A 4
C:B 4
D:A 5
D:C 1
Name: Value, dtype: int64
if you want it the other way you need to make 2 small changes
result = df[['Value']].\
groupby(df[['Col1','Col2']].\
apply(set,axis=1).apply(':'.join)).sum().T
print(result)
print(result['B:A']['Value'])
note that set does not care about order so it might be A:B or B:A
if you need it actually sorted to always be 'A:B'(you should have specified that as part of your requirements) you will need to use sort instead of set as shown below
result = df['Value'].\
groupby(df[['Col1','Col2']].\
apply(sorted,axis=1).apply(':'.join)).sum()
print(result['A:B']) # always sorted...
print(result)
I have two pandas data frames (df1 and df2):
# df1
ID COL
1 A
2 F
2 A
3 A
3 S
3 D
4 D
# df2
ID VAL
1 1
2 0
3 0
3 1
4 0
My goal is to append the corresponding val from df2 to each ID in df1. However, the relationship is not one-to-one (this is my client's fault and there's nothing I can do about this). To solve this problem, I want to sort df1 by df2['ID'] such that df1['ID'] is identical to df2['ID'].
So basically, for any row i in 0 to len(df2):
if df1.loc[i, 'ID'] == df2.loc[i, 'ID'] then keep row i in df1.
if df1.loc[i, 'ID'] != df2.loc[i, 'ID'] then drop row i from df1 and repeat.
The desired result is:
ID COL
1 A
2 F
3 A
3 S
4 D
This way, I can use pandas.concat([df1, df2['ID']], axis=0) to assign df2[VAL] to df1.
Is there a standardized way to do this? Does pandas.merge() have a method for doing this?
Before this gets voted as a duplicate, please realize that len(df1) != len(df2), so threads like this are not quite what I'm looking for.
This can be done with merge on both ID and the order within each ID:
(df1.assign(idx=df1.groupby('ID').cumcount())
.merge(df2.assign(idx=df2.groupby('ID').cumcount()),
on=['ID','idx'],
suffixes=['','_drop'])
[df1.columns]
)
Output:
ID COL
0 1 A
1 2 F
2 3 A
3 3 S
4 4 D
The simplest way I can see of getting the result you want is:
# Add a count for each repetition of the ids to temporary frames
x = df1.assign(id_counter=df1.groupby('ID').cumcount())
y = df2.assign(id_counter=df2.groupby('ID').cumcount())
# Merge using the ID and the repetition counter
df1 = pd.merge(x, y, how='right', on=['ID', 'id_counter']).drop('id_counter', axis=1)
Which would produce this output:
ID COL VAL
0 1 A 1
1 2 F 0
2 3 A 0
3 3 S 1
4 4 D 0
I a dataframe (df1) whose one categorical column is
df1=pd.Dataframe({'COL1': ['AA','AB','BC','AC','BA','BB','BB','CA','CB','CD','CE']})
I have another dataframe (df2) which has two columns
df2=pd.Dataframe({'Category':['AA','AB','AC','BA','BB','BC','CA','CB','CC','CD','CE','CF'],'general_mapping':['A','A','A','B','B','B','C','C','C','C','C','C']})
I need to modify df1 using df2 and finally will look like:
df1->> ({'COL1': ['A','A','B','A','B','B','B','C','C','C','C']})
You can use pd.Series.map after setting Category as index using df.set_index.
df1['COL1'] = df1['COL1'].map(df2.set_index('Category')['general_mapping'])
df1
COL1
0 A
1 A
2 B
3 A
4 B
5 B
6 B
7 C
8 C
9 C
10 C
I have a dictionary:
d = {"A":1, "B":2, "C":3}
I also have a pandas dataframe:
col1
A
G
E
B
C
I'd like to create a new column by mapping the dictionary onto col1. Simultaneously I'd like to set the values in another column to indicate whether the value in that row has been mapped. The desired output would look like this:
col1 col2 col3
A 1 1
G NaN 0
E NaN 0
B 2 1
C 3 1
I know that col2 can be created using df.col1.map(d), but how can I simultaneously create col3?
You can create both column in one function assign - first by map and second by isin for boolean mask with casting to integers:
df = df.assign(col2=df.col1.map(d), col3=df.col1.isin(d.keys()).astype(int))
print (df)
col1 col2 col3
0 A 1.0 1
1 G NaN 0
2 E NaN 0
3 B 2.0 1
4 C 3.0 1
Another 2 step solution with different boolean mask - by checking not missing values:
df['col2'] = df.col1.map(d)
df['col3'] = df['col2'].notnull().astype(int)
I have two dataframes df1 and df2 that are defined like so:
df1 df2
Out[69]: Out[70]:
A B A B
0 2 a 0 5 q
1 1 s 1 6 w
2 3 d 2 3 e
3 4 f 3 1 r
My goal is to concatenate the dataframes by alternating the rows so that the resulting dataframe is like this:
dff
Out[71]:
A B
0 2 a <--- belongs to df1
0 5 q <--- belongs to df2
1 1 s <--- belongs to df1
1 6 w <--- belongs to df2
2 3 d <--- belongs to df1
2 3 e <--- belongs to df2
3 4 f <--- belongs to df1
3 1 r <--- belongs to df2
As you can see the first row of dff corresponds to the first row of df1 and the second row of dff is the first row of df2. The pattern repeats until the end.
I tried to reach my goal by using the following lines of code:
import pandas as pd
df1 = pd.DataFrame({'A':[2,1,3,4], 'B':['a','s','d','f']})
df2 = pd.DataFrame({'A':[5,6,3,1], 'B':['q','w','e','r']})
dfff = pd.DataFrame()
for i in range(0,4):
dfx = pd.concat([df1.iloc[i].T, df2.iloc[i].T])
dfff = pd.concat([dfff, dfx])
However this approach doesn't work because df1.iloc[i] and df2.iloc[i] are automatically reshaped into columns instead of rows and I cannot revert the process (even by using .T).
Question: Can you please suggest me a nice and elegant way to reach my goal?
Optional: Can you also provide an explanation about how to convert a column back to row?
I'm unable to comment on the accepted answer, but note that the sort operation in unstable by default, so you must choose a stable sorting algorithm.
pd.concat([df1, df2]).sort_index(kind='merge')
IIUC
In [64]: pd.concat([df1, df2]).sort_index()
Out[64]:
A B
0 2 a
0 5 q
1 1 s
1 6 w
2 3 d
2 3 e
3 4 f
3 1 r