I load in data from a CSV, one of the columns has this format:
!Color1:Color2:Color3!
!White:Green:Black!
!Green:Blue:Yellow!!Red:Brown:Blue!!White:Green:Black!
!Green:Blue:Yellow!!White:Green:Black!
!Red:Brown:Blue!!White:Green:Black!
I want to discard all of the other columns, pick this one out, then split this one into this:
0 1 2
0 White:Green:Black None None
1 Green:Blue:Yellow Red:Brown:Blue White:Green:Black
2 Green:Blue:Yellow White:Green:Black None
3 Red:Brown:Blue White:Green:Black None
Below is how I tried to do it:
df = pd.read_csv(csv_path, index_col=False)
new_df = df['!Color1:Color2:Color3!'].str.split('!', expand=True)
But it ends up like this:
0 1 2 3 4 5
0 None White:Green:Black None None None None
1 None Green:Blue:Yellow None Red:Brown:Blue None White:Green:Black
2 None Green:Blue:Yellow None White:Green:Black None None
3 None Red:Brown:Blue None White:Green:Black None None
So it interprets the first "!" as a field of its own, and so it adds empty fields between the "parts".
Bonus question:
After that is achieved, how do I pick out the middle color in each column, like this?:
0 1 2
0 Green None None
1 Blue Brown Green
2 Blue Green None
3 Brown Green None
Add Series.str.strip for avoid last and first columns filled by empty strings and regex !{1,} for split 1 or multiple !:
new_df = df['!Color1:Color2:Color3!'].str.strip('!').str.split('!{1,}', expand=True)
print (new_df)
0 1 2
0 White:Green:Black None None
1 Green:Blue:Yellow Red:Brown:Blue White:Green:Black
2 Green:Blue:Yellow White:Green:Black None
3 Red:Brown:Blue White:Green:Black None
Also if need second splitted values by : use custom lambda function:
new_df = (df['!Color1:Color2:Color3!']
.str.strip('!')
.str.split('!{1,}', expand=True)
.apply(lambda x: x.str.split(':').str[1]))
print (new_df)
0 1 2
0 Green None None
1 Blue Brown Green
2 Blue Green None
3 Brown Green None
Related
I have searched multiple threads and cannot seem to figure this out. Any help would be appreciated. Suppose I have the following data set (I have simplified it for the sake of this question)
I want to group together all rows that contain the same value in COL1 then search in COL2 for the string "red" for those specific rows. If at least one of the rows in that group contains "red", then I want to keep all of those rows. Thus, for this dataset, the output should look like this:
Any help would be greatly appreciated. I am working in python. Thank you!
df[df['col1'].isin(df[df['col2'] == 'red']['col1'])]
col1 col2
0 1 red
1 1 yellow
2 1 green
7 3 red
8 3 pink
9 3 green
Do you mean 'red' has value 1 and 3 in Col1, therefore you would like to keep all rows with value 1 and 3 in Col1? You can try this:
df[df['Col1'].isin(df['Col1'][df['Col2']=='red'])]
To explain, I'm using a filter list/array to extract the relevant rows:
filter = df['Col1'][df['Col2']=='red'] #array[1, 3]
df1 = df[df['Col1'].isin(filter)]
print(df1)
Output
Col1 Col2
0 1 red
1 1 yellow
2 1 green
6 3 red
7 3 pink
8 3 green
Using Groupby and checking if any rows in COL2 in group have color red
df[df.groupby("COL1").COL2.transform(lambda x: x.eq("red").any())]
Output
COL1 COL2
0 1 red
1 1 yellow
2 1 green
7 3 red
8 3 pink
9 3 green
Explanation
mask = df.groupby("COL1").COL2.transform(lambda x: x.eq("red").any())
mask is True if any items in group in COL2 have color red
0 True
1 True
2 True
3 False
4 False
5 False
6 False
7 True
8 True
9 True
I have a series that looks like this:
index
1 [{'id':1, 'primary':True,'source':None},{'id':2,'primary':False,'source':email}]
2 [{'id':2234, 'primary':True,'source':None},{'id':234,'primary':False,'source':email}]
3 [{'id':32, 'primary':False,'source':None}]
I want this to be a dataframe that looks like this:
index id primary source
1 1 True None
1 2 False email
2 2234 True None
2 234 False email
3 32 False google
I tried running this:
df_phone_numbers = df_phone_numbers.drop("phone_numbers", axis =1).join(pd.DataFrame(df_phone_numbers["phone_numbers"].to_dict()).T)
But I get an error message "All arrays must be of the same length"
Any advice?
Try convert the exploded series:
k = s.explode()
pd.DataFrame(k.tolist(), k.index)
Output:
id primary source
1 1 True None
1 2 False email
2 2234 True None
2 234 False email
3 32 False None
(sorry about the title I realise it isn't very descriptive)
Given a data set such of the following:
word entity
0 Charlie 1
1 p. 1
2 Nelson 1
3 loves None
4 Dana 2
5 c. 2
6 anderson 2
7 and None
8 james 3
I want to apply a function (e.g. get_gender()) to first element of each entity (I would imagine I groupby of some sort)
as to get something like this:
word entity gender
0 Charlie 1 m
1 p. 1 None
2 Nelson 1 None
3 loves None None
4 Dana 2 f
5 c. 2 None
6 anderson 2 None
7 and None None
8 james 3 m
and lastly populate the missing rows of each entity to get
word entity gender
0 Charlie 1 m
1 p. 1 m
2 Nelson 1 m
3 loves None None
4 Dana 2 f
5 c. 2 f
6 anderson 2 f
7 and None None
8 james 3 m
Here is some code for generating the above data frame
import pandas as pd
df = pd.DataFrame([("Charlie", "p.", "Nelson", "loves", "Dana", "c.", "anderson", "and", "james"), (1,1,1, None, 2,2,2, None, 3)]).transpose()
df.columns = ["word", "entity"]
The current 'solution' I am using is:
import gender_guesser.detector as gender
d = gender.Detector()
# Detect gender in of the names in word. However this one if applied to all of the entity (including last names, furthermore one entity can be multiple genders (depending on e.g. their middle name)
df['gender'].loc[(df['entity'].isnull() == False)] = df['word'].loc[(df['entity'].isnull() == False)].apply(lambda string: d.get_gender(string.lower().capitalize()))
There is no order after groupby, so you can't get first element from a group. In this case, instead you can group by entity and select the not None value from each group, then join with the origin DataFrame.
df = pd.DataFrame([
("Charlie", "p.", "Nelson", "loves", "Dana", "c.", "anderson", "and", "james")
, (1,1,1, None, 2,2,2, None, 3)
, ('m', None, None, None, 'f', None, None, None, 'm')]).transpose()
df.columns = ["word", "entity", "gender"]
df_g = df.groupby('entity').agg({'gender': lambda x: max(filter(None, x))}).reset_index()
pd.merge(df, df_g, on='entity', suffixes=('_x', ''))[['word', 'entity', 'gender']]
But notice that after groupby, the items whose entity is None disappeared.
I have a pandas data frame in the form of:
user_id referral_code referred_by
1 A None
2 B A
3 C B
5 None None
6 E B
7 None none
....
What I want to do is to create another column weight for each user id such that it will contain the total number of references he has done to others as well as the the number of time he was referred i.e I have to check if the referral_code of a user id is present in the referred_by column and count the frequency of the same and also add 1 if the referred_by column has a entry for the user.
Expected output is:
user_id referral_code referred_by weights
1 A None 1
2 B A 3
3 C B 1
5 None None None
6 E B 1
7 None none none
The approaches if have tried is using df.grouby along with size and count but nothing is giving the expected output.
You want to build a new conditional column. If the conditions are simple enough, you can do it with np.where. I suggest you to have a look at this post.
Here, it's quite complex, there shoud have a solution with np.where but not really obvious. In this case, you can use the apply method. It gives you the opportunity the write conditions as complex as you want. Using apply is less efficient than np.where since you need a python abstraction. Depends on your dataset and the complexity of your conditions.
Here an example with apply:
df = pd.DataFrame(
[[1, "A" , None],
[2 , "B" , "A"],
[3 , "C" , "B"],
[5 , None, None],
[6 , "E" , "B"],
[7 , None , None]],
columns = 'user_id referral_code referred_by'.split(' ')
)
print(df)
# user_id referral_code referred_by
# 0 1 A None
# 1 2 B A
# 2 3 C B
# 3 5 None None
# 4 6 E B
# 5 7 None None
weight_refered_by = df.referred_by.value_counts()
print(weight_refered_by)
# B 2
# A 1
def countWeight(row):
count = 0
if row['referral_code'] in weight_refered_by.index:
count = weight_refered_by[row.referral_code]
if row["referred_by"] != None:
count += 1
# If referral_code is none, result is none
# because referred_by is included in referral_code
if row["referral_code"] == None:
count = None
return count
df["weights"] = df.apply(countWeight, axis=1)
print(df)
# user_id referral_code referred_by weights
# 0 1 A None 1.0
# 1 2 B A 3.0
# 2 3 C B 1.0
# 3 5 None None NaN
# 4 6 E B 1.0
# 5 7 None None NaN
Hope that help !
What you can do is using weights = df.referred_by.value_counts()['myword']+1 and then add it to your df in the column weights !
I have a pandas dataframe column of lists and want to extract numbers from list strings and add them to their own separate column.
Column A
0 [ FUNNY (1), CARING (1)]
1 [ Gives good feedback (17), Clear communicator (2)]
2 [ CARING (3), Gives good feedback (3)]
3 [ FUNNY (2), Clear communicator (1)]
4 []
5 []
6 [ CARING (1), Clear communicator (1)]
I would like the output to look as follows:
FUNNY CARING Gives good feedback Clear communicator
1 1 None None
None None 17 2
None 3 3 None
2 None None 1
None None None None
etc...
Let's use apply with pd.Series, then extract and reshape with set_index and unstack:
df['Column A'].apply(pd.Series).stack().str.extract(r'(\w+)\((\d+)', expand=True)\
.reset_index(1, drop=True).set_index(0, append=True)[1]\
.unstack(1)
Output:
0 Authentic Caring Classy Funny
0 1 3 None 2
1 2 None 1 2
Edit with new input data set:
df['Column A'].apply(pd.Series).stack().str.extract(r'(\w+).*\((\d+)', expand=True)\
.reset_index(1, drop=True)\
.set_index(0, append=True)[1]\
.unstack(1)
0 CARING Clear FUNNY Gives
0 1 None 1 None
1 None 2 None 17
2 3 None None 3
3 None 1 2 None
6 1 1 None None