Conditionally merge two dataframe columns - python

I have two columns in my data frame for gender derived from first name and middle name. I want to create a third column for overall gender. As such, where there is male or female in either column, it should override the unknown in the other column. I've written the following function but end up with the following error:
# Assign gender for names so that number of female names can be counted.
d = gender.Detector()
trnY['Gender_first'] = trnY['first'].map(lambda x: d.get_gender(x))
trnY['Gender_mid'] = trnY['middle'].map(lambda x: d.get_gender(x))
# merge the two gender columns:
def gender(x):
if ['Gender_first'] == male or ['Gender_mid'] == male:
return male
elif ['Gender_first'] == female or ['Gender_mid'] == female:
return female
else:
return unknown
trnY['Gender'] = trnY.apply(gender)
trnY
Error:
--> 50 trnY['Gender'] = trnY.apply(gender)
ValueError: Unable to coerce to Series, the length must be 21: given 1

If you want to use apply() on rows you should pass it the parameter axis=1. In your case.
def gender(x):
if x['Gender_first'] == male or x['Gender_mid'] == male:
return male
elif x['Gender_first'] == female or x['Gender_mid'] == female:
return female
else:
return unknown
trnY['Gender'] = trnY.apply(gender, axis=1)
This should solve your problem.

Related

Pandas - How can I iterate through a column to put respondents into appropriate bins?

I have a DataFrame called df3 with 2 columns - 'fan' and 'Household Income' as seen below. I'm trying to iterate through the 'Household Income' column and if the value of the column is '$0 - $24,999', add it to bin 'low_inc'. If the value of the column is '$25,000 - $49,999', add it to bin 'lowmid_inc', etc. But I'm getting an error saying 'int' object is not iterable.
df3 = df_hif.dropna(subset=['Household Income', 'fan'],how='any')
low_inc = []
lowmid_inc = []
mid_inc = []
midhigh_inc = []
high_inc = []
for inc in df3['Household Income']:
if inc == '$0 - $24,999':
low_inc += 1
elif inc == '$25,000 - $49,999':
lowmid_inc += 1
elif inc == '$50,000 - $99,999':
mid_inc += 1
elif inc == '$100,000 - $149,999':
midhigh_inc += 1
else:
high_inc += 1
#print(low_inc)
Here is a sample of 5 rows of the df used:
Household Income fan
774 25,000− 49,999 Yes
290 50,000− 99,999 No
795 50,000− 99,999 Yes
926 $150,000+ No
1017 $150,000+ Yes
The left column (774, 290, etc.) is the index, showing the respondents ID. The 5 ranges of the different 'Household Income' columns are listed above in my if/else statement, but I'm receiving an error when I try to print out the bins.
For each respondent, I'm trying to add 1 to the buckets 'low_bin', 'high_bin', etc. So I'm trying to count the number of respondents that have a household income between 0-24999, 25000-49000, etc. How can I iterate through a column to count the number of respondents into the appropriate bins?
Iterating in Pandas is not preferable.
You can separate them to different dataframes:
low_inc = df3[df3['Household Income'] == '$0 - $24,999'
lowmid_inc = df3[df3['Household Income'] == '$25,000 - $49,999'
etc...
The len(low_inc) for example will give you the number of rows in each dataframe
Alternatively, try groupby:
df3.grouby('Household Income').count()
I would simply use
df3 = df3['Household Income']
bins = int(max(df3)-min(df3)/25000)
out = df3.hist(bins=10)
finally take the sum of out results in related bins. ex. 25000-50000 will be related to 1 bin whereas 50000-100000 will be 2 bins.

can anyone explain to me why this apply() method isn't working?

This doesn't work:
def rator(row):
if row['country'] == 'Canada':
row['stars'] = 3
elif row['points'] >= 95:
row['stars'] = 3
elif row['points'] >= 85:
row['stars'] = 2
else:
row['stars'] = 1
return row
with_stars = reviews.apply(rator, axis='columns')
But this works:
def rator(row):
if row['country'] == 'Canada':
return 3
elif row['points'] >= 95:
return 3
elif row['points'] >= 85:
return 2
else:
return 1
with_stars = reviews.apply(rator, axis='columns')
I'm practicing on Kaggle, and reading through their tutorial as well as the documentation. I am a bit confused by the concept.
I understand that the apply() method acts on an entire row of a DataFrame, while map() acts on each element in a column. And that it's supposed to return a DataFrame, while map() returns a Series.
Just not sure how the mechanics work here, since it's not letting me return rows inside the function...
some of the data:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco -1.447138 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe #kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos -1.447138 15.0 Douro NaN NaN Roger Voss #vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
Index(['country', 'description', 'designation', 'points', 'price', 'province',
'region_1', 'region_2', 'taster_name', 'taster_twitter_handle', 'title',
'variety', 'winery'],
dtype='object')
https://www.kaggle.com/residentmario/summary-functions-and-maps
When you use apply, the function is applied iteratively to each row (or column, depending on the axis parameter). The return value of apply is not a DataFrame but a Series built using the return values of your function. That means that your second piece of code returns the stars rating of each row, which is used to build a new Series. So a better name for storing the return value is star_ratings instead of with_stars.
If you want to append this Series to your original dataframe you can use:
star_ratings = reviews.apply(rator, axis='columns')
reviews['stars'] = star_ratings
or, more succinctly:
reviews['stars'] = reviews.apply(rator, axis='columns')
As for why your first piece of code does not work, it is because you are trying to add a new column: your are not supposed to mutate the passed object. The official docs state:
Functions that mutate the passed object
can produce unexpected behavior or
errors and are not supported
To better understand the differences between map and apply please see the different responses to this question, as they present many different and correct viewpoints.
You shouldn't use apply with a function that modifies the input. You could change your code to this:
def rator(row):
new_row = row.copy()
if row['country'] == 'Canada':
new_row['stars'] = 3
elif row['points'] >= 95:
new_row['stars'] = 3
elif row['points'] >= 85:
new_row['stars'] = 2
else:
new_row['stars'] = 1
return new_row
with_stars = reviews.apply(rator, axis='columns')
However, it's simpler to just return the column you care about rather than returning an entire dataframe just to change one column. If you write rator to return just one column, but you want to have an entire dataframe, you can do with_stars = reviews.copy() and then with_stars['stars'] = reviews.apply(rator, axis='columns'). Also, if an if branch ends with a return, you can do just if after it rather than elif. You can also simplify your code with cut.

multi query a dataframe table

I have a dataframe as follows:
student_id
gender
major
admitted
35377
female
Chemistry
False
56105
male
Physics
True
etc.
How do I find the admission rate for females?
I have tried:
df.loc[(df['gender'] == "female") & (df['admitted'] == "True")].sum()
But this returns an error:
TypeError: invalid type comparison
I guess the last column is Boolean. can you try this
df[df['gender'] == "F"]['admitted'].sum()
Remove that .loc and use this code: df[(df['gender'] == "female") & (df['admitted'] == "True")].sum()

How to add a header name next to its cell value in python

I have this table as an input and I would like to add the name of the header to its corresponding cells before converting it to a dataframe
I am generating association rules after converting the table to a dataframe and each rule is not clear if it belongs to which antecedent/consequent.
Example for the first column of my desired table:
Age
Age = 45
Age = 30
Age = 45
Age = 80
.
.
and so on for the rest of the columns. What is the best way to access each column and rewrite them? And is there a better solution to reference my values after generating association rules other than adding the name of the header to each cell?
Here is one way to add the column names to all cells:
df = pd.DataFrame({'age':[1,2],'sex':['M','F']})
df = df.applymap(str)
for c in df.columns:
df[c] = df[c].apply(lambda s: "{} = {}".format(c,s))
This yields:
age sex
0 age = 1 sex = M
1 age = 2 sex = F

Create Names column in Pandas DataFrame

I am using the Python Package names to generate some first names for QA testing.
The names package contains the function names.get_first_name(gender) which allows either the string male or female as the parameter. Currently I have the following DataFrame:
Marital Gender
0 Single Female
1 Married Male
2 Married Male
3 Single Male
4 Married Female
I have tried the following:
df.loc[df.Gender == 'Male', 'FirstName'] = names.get_first_name(gender = 'male')
df.loc[df.Gender == 'Female', 'FirstName'] = names.get_first_name(gender = 'female')
But all I get in return is the are just two names:
Marital Gender FirstName
0 Single Female Kathleen
1 Married Male David
2 Married Male David
3 Single Male David
4 Married Female Kathleen
Is there a way to call this function separately for each row so not all males/females have the same exact name?
you need apply
df['Firstname']=df['Gender'].str.lower().apply(names.get_first_name)
You can use a list comprehension:
df['Firstname']= [names.get_first_name(gender) for gender in df['Gender'].str.lower()]
And hear is a hack that reads all of the names by gender (together with their probabilities), and then randomly samples.
import names
def get_names(gender):
if not isinstance(gender, (str, unicode)) or gender.lower() not in ('male', 'female'):
raise ValueError('Invalid gender')
with open(names.FILES['first:{}'.format(gender.lower())], 'rb') as fin:
first_names = []
probs = []
for line in fin:
first_name, prob, dummy, dummy = line.strip().split()
first_names.append(first_name)
probs.append(float(prob) / 100)
return pd.DataFrame({'first_name': first_names, 'probability': probs})
def get_random_first_names(n, first_names_by_gender):
first_names = (
first_names_by_gender
.sample(n, replace=True, weights='probability')
.loc[:, 'first_name']
.tolist()
)
return first_names
first_names = {gender: get_names(gender) for gender in ('Male', 'Female')}
>>> get_random_first_names(3, first_names['Male'])
['RICHARD', 'EDWARD', 'HOMER']
>>> get_random_first_names(4, first_names['Female'])
['JANICE', 'CAROLINE', 'DOROTHY', 'DIANE']
If the speed is matter using map
list(map(names.get_first_name,df.Gender))
Out[51]: ['Harriett', 'Parker', 'Alfred', 'Debbie', 'Stanley']
#df['FN']=list(map(names.get_first_name,df.Gender))

Categories