I have a pandas dataframe column of lists and want to extract numbers from list strings and add them to their own separate column.
Column A
0 [ FUNNY (1), CARING (1)]
1 [ Gives good feedback (17), Clear communicator (2)]
2 [ CARING (3), Gives good feedback (3)]
3 [ FUNNY (2), Clear communicator (1)]
4 []
5 []
6 [ CARING (1), Clear communicator (1)]
I would like the output to look as follows:
FUNNY CARING Gives good feedback Clear communicator
1 1 None None
None None 17 2
None 3 3 None
2 None None 1
None None None None
etc...
Let's use apply with pd.Series, then extract and reshape with set_index and unstack:
df['Column A'].apply(pd.Series).stack().str.extract(r'(\w+)\((\d+)', expand=True)\
.reset_index(1, drop=True).set_index(0, append=True)[1]\
.unstack(1)
Output:
0 Authentic Caring Classy Funny
0 1 3 None 2
1 2 None 1 2
Edit with new input data set:
df['Column A'].apply(pd.Series).stack().str.extract(r'(\w+).*\((\d+)', expand=True)\
.reset_index(1, drop=True)\
.set_index(0, append=True)[1]\
.unstack(1)
0 CARING Clear FUNNY Gives
0 1 None 1 None
1 None 2 None 17
2 3 None None 3
3 None 1 2 None
6 1 1 None None
Related
I have this dataset:
mydf = pd.DataFrame({'source':['a','b','a','b'],
'text':['November rain','Sweet child omine','Paradise City','Patience']})
mydf
source text
0 a November rain
1 b Sweet child omine
2 a Paradise City
3 b Patience
And I want to split the text inside column text. This is the expected result:
source text
0 a November
1 a rain
2 b Sweet
3 b child
4 b omine
5 a Paradise
6 a City
7 b Patience
This is what I have tried:
mydf['text'] = mydf['text'].str.split(expand=True)
But it returns me an error:
ValueError: Columns must be same length as key
What I am doing wrong? Is there a way to do this without creating an index?
str.split(expand=True) returns a dataframe, normally with more than one column, so you can't assign back to your original column:
# output of `str.split(expand=True)`
0 1 2
0 November rain None
1 Sweet child omine
2 Paradise City None
3 Patience None None
I think you mean:
# expand=False is default
mydf['text'] = mydf['text'].str.split()
mydf = mydf.explode('text')
You can also chain with assign:
mydf.assign(text=mydf['text'].str.split()).explode('text')
Output:
source text
0 a November
0 a rain
1 b Sweet
1 b child
1 b omine
2 a Paradise
2 a City
3 b Patience
I load in data from a CSV, one of the columns has this format:
!Color1:Color2:Color3!
!White:Green:Black!
!Green:Blue:Yellow!!Red:Brown:Blue!!White:Green:Black!
!Green:Blue:Yellow!!White:Green:Black!
!Red:Brown:Blue!!White:Green:Black!
I want to discard all of the other columns, pick this one out, then split this one into this:
0 1 2
0 White:Green:Black None None
1 Green:Blue:Yellow Red:Brown:Blue White:Green:Black
2 Green:Blue:Yellow White:Green:Black None
3 Red:Brown:Blue White:Green:Black None
Below is how I tried to do it:
df = pd.read_csv(csv_path, index_col=False)
new_df = df['!Color1:Color2:Color3!'].str.split('!', expand=True)
But it ends up like this:
0 1 2 3 4 5
0 None White:Green:Black None None None None
1 None Green:Blue:Yellow None Red:Brown:Blue None White:Green:Black
2 None Green:Blue:Yellow None White:Green:Black None None
3 None Red:Brown:Blue None White:Green:Black None None
So it interprets the first "!" as a field of its own, and so it adds empty fields between the "parts".
Bonus question:
After that is achieved, how do I pick out the middle color in each column, like this?:
0 1 2
0 Green None None
1 Blue Brown Green
2 Blue Green None
3 Brown Green None
Add Series.str.strip for avoid last and first columns filled by empty strings and regex !{1,} for split 1 or multiple !:
new_df = df['!Color1:Color2:Color3!'].str.strip('!').str.split('!{1,}', expand=True)
print (new_df)
0 1 2
0 White:Green:Black None None
1 Green:Blue:Yellow Red:Brown:Blue White:Green:Black
2 Green:Blue:Yellow White:Green:Black None
3 Red:Brown:Blue White:Green:Black None
Also if need second splitted values by : use custom lambda function:
new_df = (df['!Color1:Color2:Color3!']
.str.strip('!')
.str.split('!{1,}', expand=True)
.apply(lambda x: x.str.split(':').str[1]))
print (new_df)
0 1 2
0 Green None None
1 Blue Brown Green
2 Blue Green None
3 Brown Green None
(sorry about the title I realise it isn't very descriptive)
Given a data set such of the following:
word entity
0 Charlie 1
1 p. 1
2 Nelson 1
3 loves None
4 Dana 2
5 c. 2
6 anderson 2
7 and None
8 james 3
I want to apply a function (e.g. get_gender()) to first element of each entity (I would imagine I groupby of some sort)
as to get something like this:
word entity gender
0 Charlie 1 m
1 p. 1 None
2 Nelson 1 None
3 loves None None
4 Dana 2 f
5 c. 2 None
6 anderson 2 None
7 and None None
8 james 3 m
and lastly populate the missing rows of each entity to get
word entity gender
0 Charlie 1 m
1 p. 1 m
2 Nelson 1 m
3 loves None None
4 Dana 2 f
5 c. 2 f
6 anderson 2 f
7 and None None
8 james 3 m
Here is some code for generating the above data frame
import pandas as pd
df = pd.DataFrame([("Charlie", "p.", "Nelson", "loves", "Dana", "c.", "anderson", "and", "james"), (1,1,1, None, 2,2,2, None, 3)]).transpose()
df.columns = ["word", "entity"]
The current 'solution' I am using is:
import gender_guesser.detector as gender
d = gender.Detector()
# Detect gender in of the names in word. However this one if applied to all of the entity (including last names, furthermore one entity can be multiple genders (depending on e.g. their middle name)
df['gender'].loc[(df['entity'].isnull() == False)] = df['word'].loc[(df['entity'].isnull() == False)].apply(lambda string: d.get_gender(string.lower().capitalize()))
There is no order after groupby, so you can't get first element from a group. In this case, instead you can group by entity and select the not None value from each group, then join with the origin DataFrame.
df = pd.DataFrame([
("Charlie", "p.", "Nelson", "loves", "Dana", "c.", "anderson", "and", "james")
, (1,1,1, None, 2,2,2, None, 3)
, ('m', None, None, None, 'f', None, None, None, 'm')]).transpose()
df.columns = ["word", "entity", "gender"]
df_g = df.groupby('entity').agg({'gender': lambda x: max(filter(None, x))}).reset_index()
pd.merge(df, df_g, on='entity', suffixes=('_x', ''))[['word', 'entity', 'gender']]
But notice that after groupby, the items whose entity is None disappeared.
I have a pandas data frame in the form of:
user_id referral_code referred_by
1 A None
2 B A
3 C B
5 None None
6 E B
7 None none
....
What I want to do is to create another column weight for each user id such that it will contain the total number of references he has done to others as well as the the number of time he was referred i.e I have to check if the referral_code of a user id is present in the referred_by column and count the frequency of the same and also add 1 if the referred_by column has a entry for the user.
Expected output is:
user_id referral_code referred_by weights
1 A None 1
2 B A 3
3 C B 1
5 None None None
6 E B 1
7 None none none
The approaches if have tried is using df.grouby along with size and count but nothing is giving the expected output.
You want to build a new conditional column. If the conditions are simple enough, you can do it with np.where. I suggest you to have a look at this post.
Here, it's quite complex, there shoud have a solution with np.where but not really obvious. In this case, you can use the apply method. It gives you the opportunity the write conditions as complex as you want. Using apply is less efficient than np.where since you need a python abstraction. Depends on your dataset and the complexity of your conditions.
Here an example with apply:
df = pd.DataFrame(
[[1, "A" , None],
[2 , "B" , "A"],
[3 , "C" , "B"],
[5 , None, None],
[6 , "E" , "B"],
[7 , None , None]],
columns = 'user_id referral_code referred_by'.split(' ')
)
print(df)
# user_id referral_code referred_by
# 0 1 A None
# 1 2 B A
# 2 3 C B
# 3 5 None None
# 4 6 E B
# 5 7 None None
weight_refered_by = df.referred_by.value_counts()
print(weight_refered_by)
# B 2
# A 1
def countWeight(row):
count = 0
if row['referral_code'] in weight_refered_by.index:
count = weight_refered_by[row.referral_code]
if row["referred_by"] != None:
count += 1
# If referral_code is none, result is none
# because referred_by is included in referral_code
if row["referral_code"] == None:
count = None
return count
df["weights"] = df.apply(countWeight, axis=1)
print(df)
# user_id referral_code referred_by weights
# 0 1 A None 1.0
# 1 2 B A 3.0
# 2 3 C B 1.0
# 3 5 None None NaN
# 4 6 E B 1.0
# 5 7 None None NaN
Hope that help !
What you can do is using weights = df.referred_by.value_counts()['myword']+1 and then add it to your df in the column weights !
I need to find a word in any circumstance where it would appear in a line.
I would need to find Apple (case insensitive) in:
to much applesauce
apple_computers
*applesdf
or any other way that I may come across the word apple.
What I have so far:
(?i)^.*?(apple).*?
Update: I'm trying to accomplish this in Python pandas where rows for a particular column would contain the word apple anywhere within the row for only that column
If I had this data frame:
A B C D E F
0 1 no apple 1 3 test foo
1 1 retrain 1 3 train foo
2 1 applesfas 1 3 test foo
3 1 fit 1 3 train foo
I would get back something like this:
A B C D E F
0 1 no apple 1 3 test foo
2 1 applesfas 1 3 test foo
For the filtering I know I would use something like this:
appleFilter = data['B'].str.contains('\bApple\b')
str.contains has a case flag, to be case insensitive:
In [11]: df["B"].str.contains("apple", case=False)
Out[11]:
0 True
1 False
2 True
3 False
Name: B, dtype: bool
In [12]: df[df["B"].str.contains("apple", case=False)]
Out[12]:
A B C D E F
0 1 no apple 1 3 test foo
2 1 applesfas 1 3 test foo