I am using gender guesser library to guess gender from first name.
import gender_guesser.detector as gender
d = gender.Detector()
print(d.get_gender(u"Bob"))
male
gen = ['Alice', 'Bob', 'Kattie', "Jean", "Gabriel"]
female
male
female
male
male
But when I try to iterate it over pandas dataframe I get output as unknown
for name in df1['first_name'].iteritems():
print(d.get_gender(name))
One way to go is using map.
df1['gender'] = df1['first_name'].map(lambda x: d.get_gender(x))
It will create a new column named "gender". I think it should be faster than iteritems.
Related
So, I have a pandas data frame where one column contains the description of the nationality of a user and I want to replace this whole description with the country he's from.
My inputs are the df and the list of countries:
Description
ID
I am from Atlantis
1
My family comes from Narnia
2
["narnia","uzbekistan","Atlantis",...]
I know that:
I only have one country per description
the description contains the name of the country or does not, there is no necessity to infer the country from what he says, I only want to map [phrase containing name of country] to [country].
If I had only one country to replace I could use something like
df.loc[df['description'].str.contains('Atlantis', case=False), 'description'] = 'Atlantis'
I know that, because the country names are organised in a list, I could cycle through it and apply this to all the elements, something like:
for country in country_list:
df.loc[df['description'].str.contains(country, case=False), 'description'] = country
but it seems to me quite unpythonic so I was wondering if anyone could help me finding a better way (that I'm sure exists)
The output should be:
Description
ID
Atlantis
1
Narnia
2
You can use pd.Series.str.extract:
country_list = ["narnia","uzbekistan","Atlantis"]
df = pd.DataFrame({'Description': {0: 'I am from Atlantis',
1: 'My family comes from Narnia'},
'ID': {0: 1, 1: 2}})
print (df["Description"].str.extract(f"({'|'.join(country_list)})", flags=re.I))
0
0 Atlantis
1 Narnia
I am working on a python script that automates some phone calls for me. I have a tool to test with that I can interact with REST API. I need to select a specific carrier based on which country code is entered. So let's say my user enters 12145221414 in my excel document, I want to choose AT&T as the carrier. How would I accept input from the first column of the table and then output what's in the 2nd column?
Obviously this can get a little tricky, since I would need to match up to 3-4 digits on the front of a phone number. My plan is to write a function that then takes the initial number and then plugs the carrier that needs to be used for that country.
Any idea how I could extract this data from the table? How would I make it so that if you entered Barbados (1246), then Lime is selected instead of AT&T?
Here's my code thus far and tables. I'm not sure how I can read one table and then pull data from that table to use for my matching function.
testlist.xlsx
| Number |
|:------------|
|8155555555|
|12465555555|
|12135555555|
|96655555555|
|525555555555|
carriers.xlsx
| countryCode | Carrier |
|:------------|:--------|
|1246|LIME|
|1|AT&T|
|81|Softbank|
|52|Telmex|
|966|Zain|
import pandas as pd
import os
FILE_PATH = "C:/temp/testlist.xlsx"
xl_1 = pd.ExcelFile(FILE_PATH)
num_df = xl_1.parse('Numbers')
FILE_PATH = "C:/temp/carriers.xlsx"
xl_2 = pd.ExcelFile(FILE_PATH)
car_df = xl_2.parse('Carriers')
for index, row in num_df.iterrows():
Any idea how I could extract this data from the table? How would I
make it so that if you entered Barbados (1246), then Lime is selected
instead of AT&T?
carriers.xlsx
countryCode
Carrier
1246
LIME
1
AT&T
81
Softbank
52
Telmex
966
Zain
script.py
import pandas as pd
FILE_PATH = "./carriers.xlsx"
df = pd.read_excel(FILE_PATH)
rows_list = df.to_dict('records')
code_carrier_map = {}
for row in rows_list:
code_carrier_map[row["countryCode"]] = row["Carrier"]
print(type(code_carrier_map), code_carrier_map)
print(f"{code_carrier_map.get(1)=}")
print(f"{code_carrier_map.get(1246)=}")
print(f"{code_carrier_map.get(52)=}")
print(f"{code_carrier_map.get(81)=}")
print(f"{code_carrier_map.get(966)=}")
Output
$ python3 script.py
<class 'dict'> {1246: 'LIME', 1: 'AT&T', 81: 'Softbank', 52: 'Telmex', 966: 'Zain'}
code_carrier_map.get(1)='AT&T'
code_carrier_map.get(1246)='LIME'
code_carrier_map.get(52)='Telmex'
code_carrier_map.get(81)='Softbank'
code_carrier_map.get(966)='Zain'
Then if you want to parse phone numbers, don't reinvent the wheel, just use this phonenumbers library.
Code
import phonenumbers
num = "+12145221414"
phone_number = phonenumbers.parse(num)
print(f"{num=}")
print(f"{phone_number.country_code=}")
print(f"{code_carrier_map.get(phone_number.country_code)=}")
Output
num='+12145221414'
phone_number.country_code=1
code_carrier_map.get(phone_number.country_code)='AT&T'
Let's assume the following input:
>>> df1
Number
0 8155555555
1 12465555555
2 12135555555
3 96655555555
4 525555555555
>>> df2
countryCode Carrier
0 1246 LIME
1 1 AT&T
2 81 Softbank
3 52 Telmex
4 966 Zain
First we need to rework a bit df2 to sort the countryCode in descending order, make it as string and set it to index.
The trick for later is to sort countryCode in descending order. This will ensure that a longer country codes, such as "1246" is matched before a shorter one like "1".
>>> df2 = df2.sort_values(by='countryCode', ascending=False).astype(str).set_index('countryCode')
>>> df2
Carrier
countryCode
1246 LIME
966 Zain
81 Softbank
52 Telmex
1 AT&T
Finally, we use a regex (here '1246|966|81|52|1' using '|'.join(df2.index)) made from the country codes in descending order to extract the longest code, and we map it to the carrier:
(df1.astype(str)['Number']
.str.extract('^(%s)'%'|'.join(df2.index))[0]
.map(df2['Carrier'])
)
output:
0 Softbank
1 LIME
2 AT&T
3 Zain
4 Telmex
Name: 0, dtype: object
NB. to add it to the initial dataframe:
df1['carrier'] = (df1.astype(str)['Number']
.str.extract('^(%s)'%'|'.join(df2.index))[0]
.map(df2['Carrier'])
).to_clipboard(0)
output:
Number carrier
0 8155555555 Softbank
1 12465555555 LIME
2 12135555555 AT&T
3 96655555555 Zain
4 525555555555 Telmex
If I understand it correctly, you just want to get the first characters from the input column (Number) and then match this with the second dataframe from carriers.xlsx.
Extract first characters of a Number column. Hint: The nbr_of_chars variable should be based on the maximum character length of the column countryCode in the carriers.xlsx
nbr_of_chars = 4
df.loc[df['Number'].notnull(), 'FirstCharsColumn'] = df['Number'].str[:nbr_of_chars]
Then the matching should be fairly easy with dataframe joins.
I can think only of an inefficient solution.
First, sort the data frame of carriers in the reverse alphabetical order of country codes. That way, longer prefixes will be closer to the beginning.
codes = xl_2.sort_values('countryCode', ascending=False)
Next, define a function that matches a number with each country code in the second data frame and finds the index of the first match, if any (remember, that match is the longest).
def cc2carrier(num):
matches = codes['countryCode'].apply(lambda x: num.startswith(x))
if not matches.any(): #Not found
return np.nan
return codes.loc[matches.idxmax()]['Carrier']
Now, apply the function to the numbers dataframe:
xl_1['Number'].apply(cc2carrier)
#1 Softbank
#2 LIME
#3 AT&T
#4 Zain
#5 Telmex
#Name: Number, dtype: object
I am using the Titanic dataset to make some filters on the data. I need to find the most youngest passengers who didn't survived. I have got this result by now:
df_kids = df[(df["Survived"] == 0)][["Age","Name","Sex"]].sort_values("Age").head(10)
df_kids
Now I need to say how many of them are male and how female. I have tried a loop but it's giving me zero for both lists all the time. I don't know what I am doing wrong:
list_m = list()
list_f = list()
for i in df_kids:
if [["Sex"] == "male"]:
list_m.append(i)
else:
list_f.append(i)
len(list_m)
len(list_f)
Could you help me, please?
Thanks a lot!
You can create a masking. For example:
male_mask = df_kids['Sex' == 'male']
And use it:
male = df_kids[male_mask]
female = df_kids[~male_mask] # Assuming Sex is either male or female
You can use the shape attribute now if you are interested on counts only.
print(male.shape[0])
print(female.shape[0])
I am using the Python Package names to generate some first names for QA testing.
The names package contains the function names.get_first_name(gender) which allows either the string male or female as the parameter. Currently I have the following DataFrame:
Marital Gender
0 Single Female
1 Married Male
2 Married Male
3 Single Male
4 Married Female
I have tried the following:
df.loc[df.Gender == 'Male', 'FirstName'] = names.get_first_name(gender = 'male')
df.loc[df.Gender == 'Female', 'FirstName'] = names.get_first_name(gender = 'female')
But all I get in return is the are just two names:
Marital Gender FirstName
0 Single Female Kathleen
1 Married Male David
2 Married Male David
3 Single Male David
4 Married Female Kathleen
Is there a way to call this function separately for each row so not all males/females have the same exact name?
you need apply
df['Firstname']=df['Gender'].str.lower().apply(names.get_first_name)
You can use a list comprehension:
df['Firstname']= [names.get_first_name(gender) for gender in df['Gender'].str.lower()]
And hear is a hack that reads all of the names by gender (together with their probabilities), and then randomly samples.
import names
def get_names(gender):
if not isinstance(gender, (str, unicode)) or gender.lower() not in ('male', 'female'):
raise ValueError('Invalid gender')
with open(names.FILES['first:{}'.format(gender.lower())], 'rb') as fin:
first_names = []
probs = []
for line in fin:
first_name, prob, dummy, dummy = line.strip().split()
first_names.append(first_name)
probs.append(float(prob) / 100)
return pd.DataFrame({'first_name': first_names, 'probability': probs})
def get_random_first_names(n, first_names_by_gender):
first_names = (
first_names_by_gender
.sample(n, replace=True, weights='probability')
.loc[:, 'first_name']
.tolist()
)
return first_names
first_names = {gender: get_names(gender) for gender in ('Male', 'Female')}
>>> get_random_first_names(3, first_names['Male'])
['RICHARD', 'EDWARD', 'HOMER']
>>> get_random_first_names(4, first_names['Female'])
['JANICE', 'CAROLINE', 'DOROTHY', 'DIANE']
If the speed is matter using map
list(map(names.get_first_name,df.Gender))
Out[51]: ['Harriett', 'Parker', 'Alfred', 'Debbie', 'Stanley']
#df['FN']=list(map(names.get_first_name,df.Gender))
I have a dataframe df with two columns called 'MovieName' and 'Actors'. It looks like:
MovieName Actors
lights out Maria Bello
legend Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis
Please note that different actor names are separated by '*'. I have another csv file called gender.csv which has the gender of all actors based on their first names. gender.csv looks like -
ActorName Gender
Tom male
Emily female
Christopher male
I want to add two columns in my dataframe 'female_actors' and 'male_actors' which contains the count of female and male actors in that particular movie respectively.
How do I achieve this task using both df and gender.csv in pandas?
Please note that -
If particular name isn't present in gender.csv, don't count it in the total.
If there is just one actor in a movie, and it isn't present in gender.csv, then it's count should be zero.
Result of above example should be -
MovieName Actors male_actors female_actors
lights out Maria Bello 0 0
legend Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis 2 1
import pandas as pd
df1 = pd.DataFrame({'MovieName': ['lights out', 'legend'], 'Actors':['Maria Bello', 'Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis']})
df2 = pd.DataFrame({'ActorName': ['Tom', 'Emily', 'Christopher'], 'Gender':['male', 'female', 'male']})
def func(actors, gender):
actors = [act.split()[0] for act in actors.split('*')]
n_gender = df2.Gender[df2.Gender==gender][df2.ActorName.isin(actors)].count()
return n_gender
df1['male_actors'] = df1.Actors.apply(lambda x: func(x, 'male'))
df1['female_actors'] = df1.Actors.apply(lambda x: func(x, 'female'))
df1.to_csv('res.csv', index=False)
print df1
Output
Actors,MovieName,male_actors,female_actors
Maria Bello,lights out,0,0
Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis,legend,2,1