I have a dataframe df with two columns called 'MovieName' and 'Actors'. It looks like:
MovieName Actors
lights out Maria Bello
legend Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis
Please note that different actor names are separated by '*'. I have another csv file called gender.csv which has the gender of all actors based on their first names. gender.csv looks like -
ActorName Gender
Tom male
Emily female
Christopher male
I want to add two columns in my dataframe 'female_actors' and 'male_actors' which contains the count of female and male actors in that particular movie respectively.
How do I achieve this task using both df and gender.csv in pandas?
Please note that -
If particular name isn't present in gender.csv, don't count it in the total.
If there is just one actor in a movie, and it isn't present in gender.csv, then it's count should be zero.
Result of above example should be -
MovieName Actors male_actors female_actors
lights out Maria Bello 0 0
legend Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis 2 1
import pandas as pd
df1 = pd.DataFrame({'MovieName': ['lights out', 'legend'], 'Actors':['Maria Bello', 'Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis']})
df2 = pd.DataFrame({'ActorName': ['Tom', 'Emily', 'Christopher'], 'Gender':['male', 'female', 'male']})
def func(actors, gender):
actors = [act.split()[0] for act in actors.split('*')]
n_gender = df2.Gender[df2.Gender==gender][df2.ActorName.isin(actors)].count()
return n_gender
df1['male_actors'] = df1.Actors.apply(lambda x: func(x, 'male'))
df1['female_actors'] = df1.Actors.apply(lambda x: func(x, 'female'))
df1.to_csv('res.csv', index=False)
print df1
Output
Actors,MovieName,male_actors,female_actors
Maria Bello,lights out,0,0
Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis,legend,2,1
Related
I have two dataframes with actor names (their types are object) that look like the following:
df = pd.DataFrame({Actors: [Christian Bale, Ben Kingsley, Halley Bailey, Aaron Paul, etc...]
df2 = pd.read_csv({id: [Halley Bailey - 1998, Coco Jones – 1998, etc...]
Normally I would use the following code to find if one item is present in another dataframe but due to the numbers in the second dataframe I get 0 matches. Is there any smart way of going over this?
df.assign(indf=df.Actors.isin(df_actor_list.id).astype(int))
The code above did not work obviously
You can extract the actor names from df2['id'] and check if df['Actors'] is in it:
df.assign(indf=df['Actors'].isin(df2['id'].str.extract('(.*)(?=\s[-–])',
expand=False)).astype(int))
output:
Actors indf
0 Christian Bale 0
1 Ben Kingsley 0
2 Halley Bailey 1
3 Aaron Paul 0
Another, more generic, approach relying on a regex:
import re
regex = '|'.join(map(re.escape, df['Actors']))
# 'Christian\\ Bale|Ben\\ Kingsley|Halley\\ Bailey|Aaron\\ Paul'
actors = df2['id'].str.extract(f'({regex})', expand=False).dropna()
df.assign(indf=df['Actors'].isin(actors).astype(int))
used inputs:
df = pd.DataFrame({'Actors': ['Christian Bale', 'Ben Kingsley', 'Halley Bailey', 'Aaron Paul']})
df2 = pd.DataFrame({'id': ['Halley Bailey - 1998', 'Coco Jones – 1998']})
I have the following input file in csv:
INPUT
ID,GroupID,Person,Parent
ID_001,A001,John Doe,Yes
ID_002,A001,Mary Jane,No
ID_003,A001,James Smith;John Doe,Yes
ID_004,B003,Nathan Drake,Yes
ID_005,B003,Troy Baker,No
The desired output is the following:
** DESIRED OUTPUT**
ID,GroupID,Person
ID_001,A001,John Doe;Mary Jane;James Smith
ID_003,A001,John Doe;Mary Jane;James Smith
ID_004,B003,Nathan Drake;Troy Baker
Basically, I want to group by the same GroupID and then concatenate all the values present in Person column that belong to that group. Then, in my output, for each group I want to return the ID(s) of those rows where the Parent column is "Yes", the GroupID, and the concatenated person values for each group.
I am able to concatenate all person values for a particular group and remove any duplicate values from the person column in my output. Here is what I have so far:
import pandas as pd
inputcsv = path to the input csv file
outputcsv = path to the output csv file
colnames = ['ID', 'GroupID', 'Person', 'Parent']
df1 = pd.read_csv(inputcsv, names = colnames, header = None, skiprows = 1)
#First I do a groupby on GroupID, concatenate the values in the Person column, and finally remove the duplicate person values from the output before saving the df to a csv.
df2 = df1.groupby('GroupID')['Person'].apply(';'.join).str.split(';').apply(set).apply(';'.join).reset_index()
df2.to_csv(outputcsv, sep=',', index=False)
This yields the following output:
GroupID,Person
A001,John Doe;Mary Jane;James Smith
B003,Nathan Drake;Troy Baker
I can't figure out how to include the ID column and include all rows in a group where the Parent is "Yes" (as shown in the desired output above).
IIUC
df.Person=df.Person.str.split(';')#1st split the string to list
df['Person']=df.groupby(['GroupID']).Person.transform(lambda x : ';'.join(set(sum(x,[]))))# then we do transform , this will add each group rowwise same result , link https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object
df=df.loc[df.Parent.eq('Yes')] # then using Parent to filter
df
Out[239]:
ID GroupID Person Parent
0 ID_001 A001 James Smith;John Doe;Mary Jane Yes
2 ID_003 A001 James Smith;John Doe;Mary Jane Yes
3 ID_004 B003 Troy Baker;Nathan Drake Yes
I am using the Python Package names to generate some first names for QA testing.
The names package contains the function names.get_first_name(gender) which allows either the string male or female as the parameter. Currently I have the following DataFrame:
Marital Gender
0 Single Female
1 Married Male
2 Married Male
3 Single Male
4 Married Female
I have tried the following:
df.loc[df.Gender == 'Male', 'FirstName'] = names.get_first_name(gender = 'male')
df.loc[df.Gender == 'Female', 'FirstName'] = names.get_first_name(gender = 'female')
But all I get in return is the are just two names:
Marital Gender FirstName
0 Single Female Kathleen
1 Married Male David
2 Married Male David
3 Single Male David
4 Married Female Kathleen
Is there a way to call this function separately for each row so not all males/females have the same exact name?
you need apply
df['Firstname']=df['Gender'].str.lower().apply(names.get_first_name)
You can use a list comprehension:
df['Firstname']= [names.get_first_name(gender) for gender in df['Gender'].str.lower()]
And hear is a hack that reads all of the names by gender (together with their probabilities), and then randomly samples.
import names
def get_names(gender):
if not isinstance(gender, (str, unicode)) or gender.lower() not in ('male', 'female'):
raise ValueError('Invalid gender')
with open(names.FILES['first:{}'.format(gender.lower())], 'rb') as fin:
first_names = []
probs = []
for line in fin:
first_name, prob, dummy, dummy = line.strip().split()
first_names.append(first_name)
probs.append(float(prob) / 100)
return pd.DataFrame({'first_name': first_names, 'probability': probs})
def get_random_first_names(n, first_names_by_gender):
first_names = (
first_names_by_gender
.sample(n, replace=True, weights='probability')
.loc[:, 'first_name']
.tolist()
)
return first_names
first_names = {gender: get_names(gender) for gender in ('Male', 'Female')}
>>> get_random_first_names(3, first_names['Male'])
['RICHARD', 'EDWARD', 'HOMER']
>>> get_random_first_names(4, first_names['Female'])
['JANICE', 'CAROLINE', 'DOROTHY', 'DIANE']
If the speed is matter using map
list(map(names.get_first_name,df.Gender))
Out[51]: ['Harriett', 'Parker', 'Alfred', 'Debbie', 'Stanley']
#df['FN']=list(map(names.get_first_name,df.Gender))
So I have all th WECO rules I wrote and it works--on a column of data.
But now I want to group by 'name' and then score the 'score' column.
and my problem is using groupby and trying to output to new df2
-This will be for a many datasets with 5 - 40+ names
-This is one of the rules:
WECO_A = ["N"]
UCL = .2
lastPoint=df.groupby('name').iloc[0]['score']
if lastPoint > UCL:
WECO_A = "Y"
if WECO_A == "Y":
df2['weco'] = df.groupby('name') + 'RULE_A'
else:
df2['weco'] = df.groupby('name') + 'OK'
df:
name score
bob 0.2849
sue 0.1960
ken 0.8427
bob 0.2844
sue 0.2507
ken 0.9904
...etc
and I am looking for this
df2:
name weco
bob RULE_A
sue OK
ken RULE_A
OR even a single column,
df2:
weco
bob RULE_A
sue OK
ken RULE_A
-just an example not sure what real scores would be??
AND, thanks in advance as always..
I have a dataframe with 4 columns. 3 of these columns contain string values (people's names) and the 4th one has an int value (salary for a job done).
The string values are not unique either, the same string will show up several times in each column, but never more than once per row.
data = {
'worker1': ['Sam', 'Jack', 'Matt', 'Paul', 'Tim'],
'worker2': ['Alex', 'Amy', 'Sam', 'Alice', 'Amanda'],
'worker3': ['Alice', 'Aaron', 'Tony', 'Jack', 'Sam'],
'earnings': [4564552, 4573547, 3567567, 6357653, 7648576]}
df = pd.DataFrame(data, columns = ['worker1', 'worker2', 'worker3', 'earnings'])
print(df)
worker1 worker2 worker3 earnings
'Sam' 'Alex' 'Alice' 4564552
'Jack' 'Amy' 'Aaron' 4573547
'Matt' 'Sam' 'Tony' 3567567
'Paul' 'Alice' 'Jack' 6357653
'Tim' 'Amanda' 'Sam' 7648576
So what I need is to sum all the earnings associated to the specific name, regardless if it shows on column1, 2 or 3. I'm not sure if I should use a groupby function for this, build a dictionary or go another route.
This would be what I'm trying to accomplish:
workers total_earnings
Sam 16080695
Alex 4564552
Alice 10922205
Jack 10931200
Amy 4573547
Aaron 4573547
Matt 3567567
Tony 3567567
Paul 6357653
Tim 7648576
Amanda 7648576
I'm quite new to pandas so I'm at a place where I'm not familiar with which functions I can use for something like this. I've mostly tried to use a groupby function but that was a disaster.
Any help would be highly appreciated.
A bit lengthy, but does what you want:
>>> df1 = pd.concat([df.groupby('worker1').sum(), df.groupby('worker2').sum(), df.groupby('worker3').sum()])
>>> df1.groupby(df1.index).sum()
earnings
Aaron 4573547
Alex 4564552
Alice 10922205
Amanda 7648576
Amy 4573547
Jack 10931200
Matt 3567567
Paul 6357653
Sam 15780695
Tim 7648576
Tony 3567567
The difficulty here comes from the way the data frame has been constucted. All worker names should have been in one column and their respective earnings in a second column. There is a term "tidy data" that is worth finding out about https://en.wikipedia.org/wiki/Tidy_data .
The solution below rearranges the data frame and once this has been achieved the total earnings for a given name are easily calculated with a groupby.
df_list = []
columns = df.columns.tolist()
for i in range(3):
df_i = df.loc[:, [columns[i], 'earnings']]
df_i.columns = ['worker', 'earnings']
df_list.append(df_i)
df_1 = pd.concat(df_list)
earnings = df_1.groupby(['worker']).sum()
earnings
Out[50]:
earnings
worker
Aaron 4573547
Alex 4564552
Alice 10922205
Amanda 7648576
Amy 4573547
Jack 10931200
Matt 3567567
Paul 6357653
Sam 15780695
Tim 7648576
Tony 3567567
I managed to do what I wanted with the following code. It does work, but I don't know if this is the right approach or the most efficient way to do this. Having some validation from someone with more experience on whether this is a proper way to tackle this problem would be beneficial. Thank you for all the help you've provided in this!
df1 = df[['worker1', 'worker2', 'worker3', 'earnings']].copy()
df1.dropna(subset=['earnings'], inplace=True)
df1.reset_index(drop=True, inplace=True)
df1 = pd.melt(df1, id_vars = ['earnings'], value_name = 'workers', value_vars = ['worker1', 'worker2', 'worker3'])
df1.drop('variable', axis=1, inplace=True)
df1 = df1.groupby('workers')['earnings'].agg(np.sum)
df1 = pd.DataFrame({'workers':df1.index, 'Earnings':df1.values})
I really like your approach. There are some lines you can do without at least for the data frame defined in your question above. Interestiingly, if you use groupby the way it is coded in my other answer you get a data frame not a series and then you can chain the reset_index method to the same line.
df1 = pd.melt(df, id_vars = ['earnings'], value_name = 'workers', value_vars = ['worker1', 'worker2', 'worker3'])
df1 = df1.drop('variable', axis=1).groupby('workers').sum().reset_index()