I have a dictionary:
d = {"A":1, "B":2, "C":3}
I also have a pandas dataframe:
col1
A
G
E
B
C
I'd like to create a new column by mapping the dictionary onto col1. Simultaneously I'd like to set the values in another column to indicate whether the value in that row has been mapped. The desired output would look like this:
col1 col2 col3
A 1 1
G NaN 0
E NaN 0
B 2 1
C 3 1
I know that col2 can be created using df.col1.map(d), but how can I simultaneously create col3?
You can create both column in one function assign - first by map and second by isin for boolean mask with casting to integers:
df = df.assign(col2=df.col1.map(d), col3=df.col1.isin(d.keys()).astype(int))
print (df)
col1 col2 col3
0 A 1.0 1
1 G NaN 0
2 E NaN 0
3 B 2.0 1
4 C 3.0 1
Another 2 step solution with different boolean mask - by checking not missing values:
df['col2'] = df.col1.map(d)
df['col3'] = df['col2'].notnull().astype(int)
Related
I imported an excel and now I need multiply certain values from the list but if the value from the first column is NaN, Python should take another column for the calculation. I got the following Code:
if pd['Column1'] == 'NaN':
pd['Column2'] * pd['Column3']
else:
pd['Column1'] * pd['Column3']
Thank you for your help.
You can use isna() together with any() or all(). Here is an example:
import pandas as pd
import numpy as np
#generating test data assuming all the values in Col1 are 'NaN'
df = pd.DataFrame({'Col1':[np.nan,np.nan,np.nan,np.nan], 'Col2':[1,2,3,4], 'Col3':[2,3,4,5]})
if df['Col1'].isna().all(): # you can also use 'any()' instead of all()
df['Col4'] = df['Col2']*df['Col3']
else:
df['Col4'] = df['Col1']*df['Col3']
print(df)
Output:
Col1 Col2 Col3 Col4
0 NaN 1 2 2
1 NaN 2 3 6
2 NaN 3 4 12
3 NaN 4 5 20
I have a data frame like this,
df
col1 col2 col3
1 ab 4
hn
pr
2 ff 3
3 ty 3
rt
4 ym 6
Now I want to create one data frame from above, if both col1 and col3 values are empty('') just append(concatenate) it with above rows where both col3 and col1 values are present.
So the final data frame will look like,
df
col1 col2 col3
1 abhnpr 4
2 ff 3
3 tyrt 3
4 ym 6
I could do this using a for loop and comparing one with another row, but the execution time will be more, so looking for short cuts (pythonic way) to do the same task most efficiently.
Replace empty values to mising values and then forward filling them, then use aggregate join by GroupBy.agg and last reorder columns by DataFrame.reindex:
c = ['col1','col3']
df[c] = df[c].replace('', np.nan).ffill()
df = df.groupby(c)['col2'].agg(''.join).reset_index().reindex(df.columns, axis=1)
print (df)
col1 col2 col3
0 1 abhnpr 4
1 2 ff 3
2 3 tyrt 3
3 4 ym 6
I need results to have a 1:1 cardinality so I need to test if a value in COL1 exists more than once in COL2
COL1 COL2
A 1
B 2
B 2
B 3
C 4
D 5
E 5
E 5
Using Python (preferrably Pandas unless a better way exists), I want to see all rows where a value in COL1 has more than one match in COL2? In the example above, I want to know when COL1=B has more than 1 match in COL2 (i.e the cardinality in COL1 = B matches/joins with COL2 = 2 & also 3?
If you just want the rows that violate this condition, use groupby and check with nunique:
df[df.groupby('COL1').COL2.transform('nunique') > 1]
Or, with groupby, nunique, and map:
df[df.COL1.map(df.groupby('COL1').COL2.nunique()) > 1]
COL1 COL2
1 B 2
2 B 2
3 B 3
If you want a mapping of COL1 value to COL2 values, you can use an additional groupby and apply:
df[df.groupby('COL1').COL2.transform('nunique') > 1].groupby('COL1').COL2.apply(set)
COL1
B {2, 3}
Name: COL2, dtype: object
And finally, if all you want is the "cardinality" for > 1 COL1 values, use
df.groupby('COL1').COL2.nunique().to_frame().query('COL2 > 1')
COL2
COL1
B 2
I have this dataframe
col1 col2 col3
0 2 A 1
1 1 A 100
2 3 B 12
3 4 B 2
I want to select the highest col1 value from all with A, then the one from all with B, etc, i.e. this is the desired output
col1 col2 col3
0 2 A 1
3 4 B 2
I know I need some kind of groupby('col2'), but I don't know what to use after that.
is that what you want?
In [16]: df.groupby('col2').max().reset_index()
Out[16]:
col2 col1
0 A 2
1 B 4
use groupby('col2') then use idxmax to get the index of the max values within each group. Finally, use these index values to slice the original dataframe.
df.loc[df.groupby('col2').col1.idxmax()]
Notice that the index values of the original dataframe are preserved.
Is there away to specify to the groupby() call to use the group name in the apply() lambda function?
Similar to if I iterate through groups I can get the group key via the following tuple decomposition:
for group_name, subdf in temp_dataframe.groupby(level=0, axis=0):
print group_name
...is there a way to also get the group name in the apply function, such as:
temp_dataframe.groupby(level=0,axis=0).apply(lambda group_name, subdf: foo(group_name, subdf)
How can I get the group name as an argument for the apply lambda function?
I think you should be able to use the nameattribute:
temp_dataframe.groupby(level=0,axis=0).apply(lambda x: foo(x.name, x))
should work, example:
In [132]:
df = pd.DataFrame({'a':list('aabccc'), 'b':np.arange(6)})
df
Out[132]:
a b
0 a 0
1 a 1
2 b 2
3 c 3
4 c 4
5 c 5
In [134]:
df.groupby('a').apply(lambda x: print('name:', x.name, '\nsubdf:',x))
name: a
subdf: a b
0 a 0
1 a 1
name: b
subdf: a b
2 b 2
name: c
subdf: a b
3 c 3
4 c 4
5 c 5
Out[134]:
Empty DataFrame
Columns: []
Index: []
For those who came looking for an answer to the question:
Including the group name in the transform function pandas python
and ended up in this thread, please read on.
Given the following input:
df = pd.DataFrame(data={'col1': list('aabccc'),
'col2': np.arange(6),
'col3': np.arange(6)})
Data:
col1 col2 col3
0 a 0 0
1 a 1 1
2 b 2 2
3 c 3 3
4 c 4 4
5 c 5 5
We can access the group name (which is visible from the scope of the calling apply function) like this:
df.groupby('col1') \
.apply(lambda frame: frame \
.transform(lambda col: col + 3 if frame.name == 'a' and col.name == 'col2' else col))
Output:
col1 col2 col3
0 a 3 0
1 a 4 1
2 b 2 2
3 c 3 3
4 c 4 4
5 c 5 5
Note that the call to apply is needed in order to obtain a reference to the sub pandas.core.frame.DataFrame (i.e. frame) which holds the name attribute of the corresponding sub group. The name attribute of the argument of transform (i.e. col) refers to the column/series name.
Alternatively, one could also loop over the groups and then, within each group, over the columns:
for grp_name, sub_df in df.groupby('col1'):
for col in sub_df:
if grp_name == 'a' and col == 'col2':
df.loc[df.col1 == grp_name, col] = sub_df[col] + 3
My use case is quite rare and this was the only way to achieve my goal (as of pandas v0.24.2). However, I'd recommend exploring the pandas documentation thoroughly because there most likely is an easier vectorised solution to what you may need this construct for.