Apply function to first element in a group by and then remerging - python

(sorry about the title I realise it isn't very descriptive)
Given a data set such of the following:
word entity
0 Charlie 1
1 p. 1
2 Nelson 1
3 loves None
4 Dana 2
5 c. 2
6 anderson 2
7 and None
8 james 3
I want to apply a function (e.g. get_gender()) to first element of each entity (I would imagine I groupby of some sort)
as to get something like this:
word entity gender
0 Charlie 1 m
1 p. 1 None
2 Nelson 1 None
3 loves None None
4 Dana 2 f
5 c. 2 None
6 anderson 2 None
7 and None None
8 james 3 m
and lastly populate the missing rows of each entity to get
word entity gender
0 Charlie 1 m
1 p. 1 m
2 Nelson 1 m
3 loves None None
4 Dana 2 f
5 c. 2 f
6 anderson 2 f
7 and None None
8 james 3 m
Here is some code for generating the above data frame
import pandas as pd
df = pd.DataFrame([("Charlie", "p.", "Nelson", "loves", "Dana", "c.", "anderson", "and", "james"), (1,1,1, None, 2,2,2, None, 3)]).transpose()
df.columns = ["word", "entity"]
The current 'solution' I am using is:
import gender_guesser.detector as gender
d = gender.Detector()
# Detect gender in of the names in word. However this one if applied to all of the entity (including last names, furthermore one entity can be multiple genders (depending on e.g. their middle name)
df['gender'].loc[(df['entity'].isnull() == False)] = df['word'].loc[(df['entity'].isnull() == False)].apply(lambda string: d.get_gender(string.lower().capitalize()))

There is no order after groupby, so you can't get first element from a group. In this case, instead you can group by entity and select the not None value from each group, then join with the origin DataFrame.
df = pd.DataFrame([
("Charlie", "p.", "Nelson", "loves", "Dana", "c.", "anderson", "and", "james")
, (1,1,1, None, 2,2,2, None, 3)
, ('m', None, None, None, 'f', None, None, None, 'm')]).transpose()
df.columns = ["word", "entity", "gender"]
df_g = df.groupby('entity').agg({'gender': lambda x: max(filter(None, x))}).reset_index()
pd.merge(df, df_g, on='entity', suffixes=('_x', ''))[['word', 'entity', 'gender']]
But notice that after groupby, the items whose entity is None disappeared.

Related

Split text expanding rows in Pandas

I have this dataset:
mydf = pd.DataFrame({'source':['a','b','a','b'],
'text':['November rain','Sweet child omine','Paradise City','Patience']})
mydf
source text
0 a November rain
1 b Sweet child omine
2 a Paradise City
3 b Patience
And I want to split the text inside column text. This is the expected result:
source text
0 a November
1 a rain
2 b Sweet
3 b child
4 b omine
5 a Paradise
6 a City
7 b Patience
This is what I have tried:
mydf['text'] = mydf['text'].str.split(expand=True)
But it returns me an error:
ValueError: Columns must be same length as key
What I am doing wrong? Is there a way to do this without creating an index?
str.split(expand=True) returns a dataframe, normally with more than one column, so you can't assign back to your original column:
# output of `str.split(expand=True)`
0 1 2
0 November rain None
1 Sweet child omine
2 Paradise City None
3 Patience None None
I think you mean:
# expand=False is default
mydf['text'] = mydf['text'].str.split()
mydf = mydf.explode('text')
You can also chain with assign:
mydf.assign(text=mydf['text'].str.split()).explode('text')
Output:
source text
0 a November
0 a rain
1 b Sweet
1 b child
1 b omine
2 a Paradise
2 a City
3 b Patience

Get column name where value match with multiple condition python

Looking for a solution to my problem an entire day and cannot find the answer. I'm trying to follow the example of this topic: Get column name where value is something in pandas dataframe
to make a version with multiple conditions.
I want to extract column name (under a list) where :
value == 4 or/and value == 3
+
Only if there is no 4 or/and 3, then extract the column name where value == 2
Example:
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'acne': [1, 4, 1, 2], 'wrinkles': [1, 3, 4, 4],'darkspot': [2, 2, 3, 4] }
df1 = pd.DataFrame(data)
df1
df1
'''
Name acne wrinkles darkspot
0 Tom 1 1 2
1 Joseph 4 3 2
2 Krish 1 4 3
3 John 2 4 4
'''
The result i'm looking for :
df2
Name acne wrinkles darkspot problem
0 Tom 1 1 2 [darkspot]
1 Joseph 4 3 2 [acne, wrinkles]
2 Krish 1 4 3 [wrinkles, darkspot]
3 John 2 4 4 [wrinkles, darkspot]
'''
I tried with the apply function with a lambda detailled in the topic i mentionned above but it can only take one argument.
Many thanks for your answers if somebody can help me :)
You can use boolean mask:
problems = ['acne', 'wrinkles', 'darkspot']
m1 = df1[problems].isin([3, 4]) # main condition
m2 = df1[problems].eq(2) # fallback condition
mask = m1 | (m1.loc[~m1.any(axis=1)] | m2)
df1['problem'] = mask.mul(problems).apply(lambda x: [i for i in x if i], axis=1)
Output:
>>> df1
Name acne wrinkles darkspot problem
0 Tom 1 1 2 [darkspot]
1 Joseph 4 3 2 [acne, wrinkles]
2 Krish 1 4 3 [wrinkles, darkspot]
3 John 2 4 4 [wrinkles, darkspot]
You can use a Boolean mask to figure out which columns you need.
First check if any of the values are 3 or 4, and then if not, check if any of the values are 2. Form the composite mask (variable m below) with an | (or) between those two conditions.
Finally you can NaN the False values, that way when you stack and groupby.agg(list) you're left with just the column labels for the Trues.
cols = ['acne', 'wrinkles', 'darkspot']
m1 = df1[cols].isin([3, 4])
# If no `3` or `4` on the rows, check if there is a `2`
m2 = pd.DataFrame((~m1.any(1)).to_numpy()[:, None] & df1[cols].eq(2).to_numpy(),
index=m1.index, columns=m1.columns)
m = (m1 | m2)
# acne wrinkles darkspot
#0 False False True
#1 True True False
#2 False True True
#3 False True True
# Assignment aligns on original DataFrame index, i.e. `'level_0'`
df1['problem'] = m.where(m).stack().reset_index().groupby('level_0')['level_1'].agg(list)
print(df1)
Name acne wrinkles darkspot problem
0 Tom 1 1 2 [darkspot]
1 Joseph 4 3 2 [acne, wrinkles]
2 Krish 1 4 3 [wrinkles, darkspot]
3 John 2 4 4 [wrinkles, darkspot]

How to record if a specific change occurred across columns using pandas?

Here is the code I used to create my dataframe:
data = [['Anna',1,1,2,2,3],['Bob',2,2,3,1,1],['Chloe',1,1,2,3,4],
['David',1,2,2,2,1]]
df = pd.DataFrame(data, columns = ['Name', 'A','B','C','D','E'])
I want to create a column which would state if a specific change occurred across the table. For example for this dataset I would like the column to express whether either the person went from '1 to 2 to 3' or '1 to 2 to 3 to 4'. So for this specific dataframe both Anna and Chloe would have an indicator in that column to convey that they went through these changes.
The expected outcome should have the following column to the dataframe:
df['Column'] = ['1-2-3','NA','1-2-3-4','NA']
You can take the below approach:
cond=(~m.diff(axis=1).lt(0).any(axis=1))
df=df.assign(new_col=np.where(cond,
m.apply(lambda x: '-'.join(map(str,(dict.fromkeys(x).keys()))),axis=1),'NA'))
print(df)
Name A B C D E new_col
0 Anna 1 1 2 2 3 1-2-3
1 Bob 2 2 3 1 1 NA
2 Chloe 1 1 2 3 4 1-2-3-4
3 David 1 2 2 2 1 NA
Here You go:
def path(input):
nodes = []
for i in input:
if len(nodes):
if i < nodes[-1]:
return np.nan
if i not in nodes:
nodes.append(i)
return '-'.join(str(i) for i in nodes)
df['path'] = [path(row) for row in np.array(df[['A', 'B', 'C', 'D', 'E']])]

Pandas sort multiple columns independently

I've been struggling to sort the entire columns of my df, however, my code seems to be working for solely the first column ('Name') and shuffles the rest of the columns based upon the first column as shown here:
Index Name Age Education Country
0 W 2 BS C
1 V 1 PhD F
2 R 9 MA A
3 A 8 MA A
4 D 7 PhD B
5 C 4 BS C
df.sort_values(by=['Name', 'Age', 'Education', 'Country'],ascending=[True,True, True, True])
Here's what I'm hoping to get:
Index Name Age Education Country
0 A 1 BS A
1 C 2 BS A
2 D 4 MA B
3 R 7 MA C
4 V 8 PhD C
5 W 9 PhD F
Instead, I'm getting the following:
Index Name Age Education Country
3 A 8 MA A
5 C 4 BS C
4 D 7 PhD B
2 R 9 MA A
1 V 1 PhD F
0 W 2 BS C
Could you please shed some light on this issue. Many thanks in advance.
Cheers,
R.
Your code is sorting by name, then age, then country, etc.
To get what you want, you can do sort for each column to sort column by column. For example,
for col in df.columns:
df[col]=sorted(df[col])
But are you sure that’s what you want to do? DataFrame is designed so that each row corresponds to a single entry, e.g. a person, and the columns corresponds to attributes like, ‘name’ and ‘age’, etc. So you don’t want sort the name and age separately so that people’s name and age get mismatched.
You can use np.sort along the 0th axis:
df[:] = np.sort(df.values, axis=0)
df
Index Name Age Education Country
0 0 A 1 BS A
1 1 C 2 BS A
2 2 D 4 MA B
3 3 R 7 MA C
4 4 V 8 PhD C
5 5 W 9 PhD F
If course, you should beware that sorting columns independently will mess the order of your columns relative to one another and render your data meaningless.

Extracting number from inside parentheses in column list

I have a pandas dataframe column of lists and want to extract numbers from list strings and add them to their own separate column.
Column A
0 [ FUNNY (1), CARING (1)]
1 [ Gives good feedback (17), Clear communicator (2)]
2 [ CARING (3), Gives good feedback (3)]
3 [ FUNNY (2), Clear communicator (1)]
4 []
5 []
6 [ CARING (1), Clear communicator (1)]
I would like the output to look as follows:
FUNNY CARING Gives good feedback Clear communicator
1 1 None None
None None 17 2
None 3 3 None
2 None None 1
None None None None
etc...
Let's use apply with pd.Series, then extract and reshape with set_index and unstack:
df['Column A'].apply(pd.Series).stack().str.extract(r'(\w+)\((\d+)', expand=True)\
.reset_index(1, drop=True).set_index(0, append=True)[1]\
.unstack(1)
Output:
0 Authentic Caring Classy Funny
0 1 3 None 2
1 2 None 1 2
Edit with new input data set:
df['Column A'].apply(pd.Series).stack().str.extract(r'(\w+).*\((\d+)', expand=True)\
.reset_index(1, drop=True)\
.set_index(0, append=True)[1]\
.unstack(1)
0 CARING Clear FUNNY Gives
0 1 None 1 None
1 None 2 None 17
2 3 None None 3
3 None 1 2 None
6 1 1 None None

Categories