Create Names column in Pandas DataFrame - python

I am using the Python Package names to generate some first names for QA testing.
The names package contains the function names.get_first_name(gender) which allows either the string male or female as the parameter. Currently I have the following DataFrame:
Marital Gender
0 Single Female
1 Married Male
2 Married Male
3 Single Male
4 Married Female
I have tried the following:
df.loc[df.Gender == 'Male', 'FirstName'] = names.get_first_name(gender = 'male')
df.loc[df.Gender == 'Female', 'FirstName'] = names.get_first_name(gender = 'female')
But all I get in return is the are just two names:
Marital Gender FirstName
0 Single Female Kathleen
1 Married Male David
2 Married Male David
3 Single Male David
4 Married Female Kathleen
Is there a way to call this function separately for each row so not all males/females have the same exact name?

you need apply
df['Firstname']=df['Gender'].str.lower().apply(names.get_first_name)

You can use a list comprehension:
df['Firstname']= [names.get_first_name(gender) for gender in df['Gender'].str.lower()]
And hear is a hack that reads all of the names by gender (together with their probabilities), and then randomly samples.
import names
def get_names(gender):
if not isinstance(gender, (str, unicode)) or gender.lower() not in ('male', 'female'):
raise ValueError('Invalid gender')
with open(names.FILES['first:{}'.format(gender.lower())], 'rb') as fin:
first_names = []
probs = []
for line in fin:
first_name, prob, dummy, dummy = line.strip().split()
first_names.append(first_name)
probs.append(float(prob) / 100)
return pd.DataFrame({'first_name': first_names, 'probability': probs})
def get_random_first_names(n, first_names_by_gender):
first_names = (
first_names_by_gender
.sample(n, replace=True, weights='probability')
.loc[:, 'first_name']
.tolist()
)
return first_names
first_names = {gender: get_names(gender) for gender in ('Male', 'Female')}
>>> get_random_first_names(3, first_names['Male'])
['RICHARD', 'EDWARD', 'HOMER']
>>> get_random_first_names(4, first_names['Female'])
['JANICE', 'CAROLINE', 'DOROTHY', 'DIANE']

If the speed is matter using map
list(map(names.get_first_name,df.Gender))
Out[51]: ['Harriett', 'Parker', 'Alfred', 'Debbie', 'Stanley']
#df['FN']=list(map(names.get_first_name,df.Gender))

Related

Conditionally merge two dataframe columns

I have two columns in my data frame for gender derived from first name and middle name. I want to create a third column for overall gender. As such, where there is male or female in either column, it should override the unknown in the other column. I've written the following function but end up with the following error:
# Assign gender for names so that number of female names can be counted.
d = gender.Detector()
trnY['Gender_first'] = trnY['first'].map(lambda x: d.get_gender(x))
trnY['Gender_mid'] = trnY['middle'].map(lambda x: d.get_gender(x))
# merge the two gender columns:
def gender(x):
if ['Gender_first'] == male or ['Gender_mid'] == male:
return male
elif ['Gender_first'] == female or ['Gender_mid'] == female:
return female
else:
return unknown
trnY['Gender'] = trnY.apply(gender)
trnY
Error:
--> 50 trnY['Gender'] = trnY.apply(gender)
ValueError: Unable to coerce to Series, the length must be 21: given 1
If you want to use apply() on rows you should pass it the parameter axis=1. In your case.
def gender(x):
if x['Gender_first'] == male or x['Gender_mid'] == male:
return male
elif x['Gender_first'] == female or x['Gender_mid'] == female:
return female
else:
return unknown
trnY['Gender'] = trnY.apply(gender, axis=1)
This should solve your problem.

Unknown values when iterating over dataframe

I am using gender guesser library to guess gender from first name.
import gender_guesser.detector as gender
d = gender.Detector()
print(d.get_gender(u"Bob"))
male
gen = ['Alice', 'Bob', 'Kattie', "Jean", "Gabriel"]
female
male
female
male
male
But when I try to iterate it over pandas dataframe I get output as unknown
for name in df1['first_name'].iteritems():
print(d.get_gender(name))
One way to go is using map.
df1['gender'] = df1['first_name'].map(lambda x: d.get_gender(x))
It will create a new column named "gender". I think it should be faster than iteritems.

Nested if statements with .loc in pandas / python

I am using if in a conditional statement like the below code. If address is NJ then the value of name column is changed to 'N/A'.
df1.loc[df1.Address.isin(['NJ']), 'name'] = 'N/A'
How do I do the same, if I have 'nested if statements' like below?
# this not code just representing the logic
if address isin ('NJ', 'NY'):
if name1 isin ('john', 'bob'):
name1 = 'N/A'
if name2 isin ('mayer', 'dylan'):
name2 = 'N/A'
Can I achieve above logic using df.loc? Or is there any other way to do it?
Separate assignments, as shown by #MartijnPeiters, are a good idea for a small number of conditions.
For a large number of conditions, consider using numpy.select to separate your conditions and choices. This should make your code more readable and easier to maintain.
For example:
import pandas as pd, numpy as np
df = pd.DataFrame({'address': ['NY', 'CA', 'NJ', 'NY', 'WS'],
'name1': ['john', 'mayer', 'dylan', 'bob', 'mary'],
'name2': ['mayer', 'dylan', 'mayer', 'bob', 'bob']})
address_mask = df['address'].isin(('NJ', 'NY'))
conditions = [address_mask & df['name1'].isin(('john', 'bob')),
address_mask & df['name2'].isin(('mayer', 'dylan'))]
choices = ['Option 1', 'Option 2']
df['result'] = np.select(conditions, choices)
print(df)
address name1 name2 result
0 NY john mayer Option 1
1 CA mayer dylan 0
2 NJ dylan mayer Option 2
3 NY bob bob Option 1
4 WS mary bob 0
Use separate assignments. You have different conditions to filter on, you can combine the address and the two name* filters with & (but put parentheses around each test):
df1.loc[(df1.Address.isin(['NJ'])) & (df1.name1 isin ('john', 'bob')), 'name1'] = 'N/A'
df1.loc[(df1.Address.isin(['NJ'])) & (df1.name2 isin ('mayer', 'dylan')), 'name2'] = 'N/A'
You can always store the boolean filters in a variable first:
nj_address = df1.Address.isin(['NJ'])
name1_filter = df1.name1 isin ('john', 'bob')
name2_filter = df1.name2 isin ('mayer', 'dylan')
df1.loc[nj_address & name1_filter, 'name1'] = 'N/A'
df1.loc[nj_address & name2_filter, 'name2'] = 'N/A'

python: splitting a file based on a key word

I have this file:
GSENumber Species Platform Sample Age Tissue Sex Count
GSE11097 Rat GPL1355 GSM280267 4 Liver Male Count
GSE11097 Rat GPL1355 GSM280268 4 Liver Female Count
GSE11097 Rat GPL1355 GSM280269 6 Liver Male Count
GSE11097 Rat GPL1355 GSM280409 6 Liver Female Count
GSE11291 Mouse GPL1261 GSM284967 5 Heart Male Count
GSE11291 Mouse GPL1261 GSM284968 5 Heart Male Count
GSE11291 Mouse GPL1261 GSM284969 5 Heart Male Count
GSE11291 Mouse GPL1261 GSM284970 5 Heart Male Count
GSE11291 Mouse GPL1261 GSM284975 10 Heart Male Count
GSE11291 Mouse GPL1261 GSM284976 10 Heart Male Count
GSE11291 Mouse GPL1261 GSM284987 5 Muscle Male Count
GSE11291 Mouse GPL1261 GSM284988 5 Muscle Female Count
GSE11291 Mouse GPL1261 GSM284989 30 Muscle Male Count
GSE11291 Mouse GPL1261 GSM284990 30 Muscle Male Count
GSE11291 Mouse GPL1261 GSM284991 30 Muscle Male Count
You can see here there is two series (GSE11097 and GSE11291), and I want a summary for each series; The output should be a dictionary like this, for each "GSE" number:
Series Species Platform AgeRange Tissue Sex Count
GSE11097 Rat GPL1355 4-6 Liver Mixed Count
GSE11291 Mouse GPL1261 5-10 Heart Male Count
GSE11291 Mouse GPL1261 5-30 Muscle Mixed Count
So I know one way to do this would be:
Read in the file and make a list of all the GSE numbers.
Then read in the file again and parse based on GSE number.
e.g.
import sys
list_of_series = list(set([line.strip().split()[0] for line in open(sys.argv[1])]))
list_of_dicts = []
for each_list in list_of_series:
temp_dict={"species":"","platform":"","age":[],"tissue":"","Sex":[],"Count":""}
for line in open(sys.argv[1]).readlines()[1:]:
line = line.strip().split()
if line[0] == each_list:
temp_dict["species"] = line[1]
temp_dict["platform"] = line[2]
temp_dict["age"].append(line[4])
temp_dict["tissue"] = line[5]
temp_dict["sex"].append(line[6])
temp_dict["count"] = line[7]
I think this is messy in two ways:
I've to read in the whole file twice (in reality, file much bigger than example here)
This method keeps re-writing over the same dictionary entry with the same word.
Also, There's a problem with the sex, I want to say "if both male and female, put "mixed" in dict, else, put "male" or "female".
I can make this code work, but I'm wondering about quick tips to make the code cleaner/more pythonic?
I agree with Max Paymar that this should be done in a query language. If you really want to do it in Python, the pandas module will help a lot.
import pandas as pd
## columns widths of the fixed width format
fwidths = [12, 8, 8, 12, 4, 8, 8, 5]
## read fixed width format into pandas data frame
df = pd.read_fwf("your_file.txt", widths=fwidths, header=1,
names=["GSENumber", "Species", "Platform", "Sample",
"Age", "Tissue", "Sex", "Count"])
## drop "Sample" column as it is not needed in the output
df = df.drop("Sample", axis=1)
## group by GSENumber
grouped = df.groupby(df.GSENumber)
## aggregate columns for each group
aggregated = grouped.agg({'Species': lambda x: list(x.unique()),
'Platform': lambda x: list(x.unique()),
'Age': lambda x: "%d-%d" % (min(x), max(x)),
'Tissue': lambda x: list(x.unique()),
'Sex': lambda x: "Mixed" if x.nunique() > 1 else list(x.unique()),
'Count': lambda x: list(x.unique())})
print aggregated
This produces pretty much the result you asked for and is much cleaner than parsing the file in pure Python.
import sys
def main():
data = read_data(open(sys.argv[1]))
result = process_rows(data)
format_and_print(result, sys.argv[2])
def read_data(file):
data = [line.strip().split() for line in open(sys.argv[1])]
data.pop(0) # remove header
return data
def process_rows(data):
data_dict = {}
for row in data:
process_row(row, data_dict)
return data_dict
def process_row(row, data_dict):
composite_key = row[0] + row[1] + row[5] #assuming this is how you are grouping the entries
if composite_key in data_dict:
data_dict[composite_key]['age_range'].add(row[4])
if row[5] != data_dict[composite_key]:
data_dict[composite_key]['sex'] = 'Mixed'
#do you need to accumulate the counts? data_dict[composite_key]['count']+=row[6]
else:
data_dict[composite_key] = {
'series': row[0],
'species': row[1],
'platform': row[2],
'age_range': set([row[4]]),
'tissue': row[5],
'sex': row[6],
'count': row[7]
}
def format_and_print(data_dict, outfile):
pass
#you can implement this one :)
if __name__ == "__main__":
main()

Count based on other csv file

I have a dataframe df with two columns called 'MovieName' and 'Actors'. It looks like:
MovieName Actors
lights out Maria Bello
legend Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis
Please note that different actor names are separated by '*'. I have another csv file called gender.csv which has the gender of all actors based on their first names. gender.csv looks like -
ActorName Gender
Tom male
Emily female
Christopher male
I want to add two columns in my dataframe 'female_actors' and 'male_actors' which contains the count of female and male actors in that particular movie respectively.
How do I achieve this task using both df and gender.csv in pandas?
Please note that -
If particular name isn't present in gender.csv, don't count it in the total.
If there is just one actor in a movie, and it isn't present in gender.csv, then it's count should be zero.
Result of above example should be -
MovieName Actors male_actors female_actors
lights out Maria Bello 0 0
legend Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis 2 1
import pandas as pd
df1 = pd.DataFrame({'MovieName': ['lights out', 'legend'], 'Actors':['Maria Bello', 'Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis']})
df2 = pd.DataFrame({'ActorName': ['Tom', 'Emily', 'Christopher'], 'Gender':['male', 'female', 'male']})
def func(actors, gender):
actors = [act.split()[0] for act in actors.split('*')]
n_gender = df2.Gender[df2.Gender==gender][df2.ActorName.isin(actors)].count()
return n_gender
df1['male_actors'] = df1.Actors.apply(lambda x: func(x, 'male'))
df1['female_actors'] = df1.Actors.apply(lambda x: func(x, 'female'))
df1.to_csv('res.csv', index=False)
print df1
Output
Actors,MovieName,male_actors,female_actors
Maria Bello,lights out,0,0
Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis,legend,2,1

Categories