How can i add the gender column? - python

I have this dataset with 5 columns and lots of rows. I've been asked to get the total number of male and females. They are string and can't figure it out. I have to use numpy too.
Please help.
Thanks
ls = gender.values.tolist()
ls
top = []
for i in ls:
if i == 'M':
top.append(i)
print(i)
I need to sum of the male and female in the above dataset.

You can use value_counts from pandas to see total number of occurrences of a categorical column.
In your case, it can be
df['gender'].value_counts()
IT will return the count of male and female.

Related

Get info about other columns in same row from pandas search

I have a .csv file that looks like the following:
Country Number
United 19
Ireland 17
Afghan 20
My goal is to use python-pandas to find the row with the smallest number, and get the country name of that row.
I know I can use this to get the value of the smallest number.
min = df['Number'].min()
How can I get the country name at the smallest number?
I couldn't figure out how to put in the variable "min" in an expression.
I would use a combination of finding the min and a iloc
df = pd.DataFrame(data)
min_number = df['Column_2'].min()
iloc_number = df.loc[df['Column_2'] == min_number].index.values[0]
df['Column_1'].iloc[iloc_number]
The only downside to this is if you have multiple countries with the same minimal number, but if that is the case you would have to provide more specs to determine the desired country.
If you expect the minimal value to be unique, use idxmin:
df.loc[df['Number'].idxmin(), 'Country']
Output: Ireland
If you have multiple min, this will yield the first one.

Finding number of survived people in Titanic Dataset in Python

From the Titanic Dataset from Kaggle, I'm trying to extract how many people survived and how many died from the survived column. To do this, I imported the pandas library and saved the dataset in the variable dataframe and used the following code:
dataframe['survived'].value_counts()
Which gave me the output as
0 809
1 500
Name: survived, dtype: int64
From this, how do I print just the number of people who survived? Like if I want the count of 1, I need the output as 500. Same thing for when I want just the count of 0.
I tried the following code only to get a SyntaxError
dataframe['survived'].value_counts().1
I'm new to pandas, so I'd really appreciate it if anyone could help me with this!
For your case, you can use sum instead of value_counts because you have a binary column: 1 for survived, 0 for died so the sum get you all survived people:
>>> dataframe['survived'].sum()
500
In case of your column is not binary, you can use:
# 1 stand for survived people here
>>> dataframe['survived'].eq(1).sum()
500
Here is a more human like logic answer.
Ask the data frame for all observations (rows) with survivors in it.
# some datasets would use 'yes', 'si', 'alive', ..
alive = 1
# eq() means equal; like ==
survivors = dataframe[dataframe.survived.eq(alive)]
And then count the observations (rows).
print(len(survivors))
You can use:
dataframe['survived'].value_counts()[0]
or:
dataframe['survived'].value_counts().loc[0]
The .column_name/.index_name syntax is not recommended as it restricts the possibilities to column names that are valid python variables. Strings starting with a number are not valid python variable names.

Finding the difference between two data-frames

I'm trying to find the largest income difference between male and female workers. But I'm not sure how to implement the code. I need some assistance.
aa=industries.F_weekly.max()
bb=industries.M_weekly.max()
cc = (nf.loc[nf['M_weekly'] == bb]) - (nf.loc[nf['F_weekly'] == aa])
cc.max()
cc.min()
Let's say your Dataframe is called df.
First, calculate the absolute value of salary difference, then print max. This can also be done in one line.
df['salary_delta'] = (df['M_weekly'] - df['F_weekly']).abs()
print(max(df['salary_delta']))
In case you want to find the row where salary difference is the highest then try:
df.loc[df['salary_delta'].idxmax()]

Pandas groupby with minimal group size

I have a dataframe df of shape (450 000, 15), containing informations about users, and each line is a different user, with 13 characteristics(age, gender, home-town...) and 1 boolean variable, whether the user has or doesn't have a car.
I would like to regroup my users to find out which groups have the most cars, but I need to keep at least 2500 users in a group to keep it statistically relevant.
test= df.groupby(['Gender'])
test.size() # check the groups size
Gender
Female 150000
Male 300000
dtype: int64
So far so good, I have way more than 2500 user by groups. So I had another grouping criteria :
test2= df.groupby(['Gender','Age'])
test2.size()
Gender Age
Female <30 15022
30-90 134960
90+ 18
Male <20 5040
20-90 291930
90+ 3030
dtype: int64
As we can expect, I now have groups with very little users...
I would like to have something like this :
Gender
Female 150 000 # Don't split here because groups will be too small
# Here I can split, because group size > 2500 :
Gender Age
Male <20 5040
20-90 291930
90+ 3030
dtype: int64
I didn't find a way to group a groupby dataframe based on a size criteria, so I was wondering what would be a pythonic way to handle this.
As I have 13 characteristics to group my users, I am also wondering about the grouping order : should I group by gender first and then by age, or the opposite? This has an impact when using multiple variables with a stop condition.
I don't need to use pandas, but I thought it would be appropriate. The output should look like :
name_of_group group_size
Female 150000
Male, <20 5040
Male, 20-90 291930
Male, 90+ 3030
groupby has to group on a "key" which must be separately computable for each row. That is, there's no way to group by some criterion that depends on aggregate characteristics that you won't know until after the group is created (like its size). You could write code that tries different groupings and uses some heuristic to decide which is "best", but there's nothing built in for this.
Do you want all the groups to have at least 2500 users?
You could so something like this:
# List of all sets of categories you want to test
group_ids_list = [['Gender'], ['Age'], ['Gender','Age']]
# Will be filled with groups that pass your test
valid_groups = []
group_sizes = {}
for group_ids in group_ids_list :
grouped_df = df.groupby(group_id)
for key, group in grouped_df:
if len(group) > 25000:
valid_groups.append(group)
group_sizes[key] = len(group)
group_sizes = pd.Series(group_sizes)
And then you can work using only the valid groupers.
Hope the pseudo-code helps, otherwise please provide a reproducible example.
I think FLab's answer is probably more complete and correct. But if you're after a quick fix;
column = 'Gender'
minimum_size = 2500
valid_groups = [g for g in set(df[col]) if sum(df[col] == g) >= minimum_size]
mask = df[column].isin(valid_groups)
df[mask].groupby(column)

Calculating a probability based on several variables in a Pandas dataframe

I'm still very new to Python and Pandas, so bear with me...
I have a dataframe of passengers on a ship that sunk. I have broken this down into other dataframes by male and female, and also by class to create probabilities for survival. I made a function that compares one dataframe to a dataframe of only survivors, and calculates the probability of survival among this group:
def survivability(total_pass_df, column, value):
survivors = sum(did_survive[column] == value)
total = len(total_pass_df)
survival_prob = round((survivors / total), 2)
return survival_prob
But now I'm trying to compare survivability among smaller groups - male first class passengers vs female third class passengers for example. I did make dataframes for both of these groups, but I still can't use my survivability function because I"m comparing two different columns - sex and class - rather than just one.
I know exactly how I'd do it with Python - loop through the 'survived' column (which is either a 1 or 0), in the dataframe, if it equals 1, then add one to an index value, and once all the data has been gone through, divide the index value by the length of the dataframe to get the probability of survival....
But I'm supposed to use Pandas for this, and I can't for the life of me work out in my head how to do it....
:/
Without a sample of the data frames you're working with, I can't be sure if I understand your question correctly. But based on your description of the pure-Python procedure,
I know exactly how I'd do it with Python - loop through the 'survived' column (which is either a 1 or 0), in the dataframe, if it equals 1, then add one to an index value, and once all the data has been gone through, divide the index value by the length of the dataframe to get the probability of survival....
you can do this in Pandas by simply writing
dataframe['survived'].mean()
That's it. Given that all the values are either 1 or 0, the mean will be the number of 1's divided by the total number of rows.
If you start out with a data frame that has columns like survived, sex, class, and so on, you can elegantly combine this with Pandas' boolean indexing to pick out the survival rates for different groups. Let me use the Socialcops Titanic passengers data set as an example to demonstrate. Assuming the DataFrame is called df, if you want to analyze only male passengers, you can get those records as
df[df['sex'] == 'male']
and then you can take the survived column of that and get the mean.
>>> df[df['sex'] == 'male']['survived'].mean()
0.19198457888493475
So 19% of male passengers survived. If you want to narrow down to male second-class passengers, you'll need to combine the conditions using &, like this:
>>> df[(df['sex'] == 'male') & (df['pclass'] == 2)]['survived'].mean()
0.14619883040935672
This is getting a little unwieldy, but there's an easier way that actually lets you do multiple categories at once. (The catch is that this is a somewhat more advanced Pandas technique and it might take a while to understand it.) Using the DataFrame.groupby() method, you can tell Pandas to group the rows of the data frame according to their values in certain columns. For example,
df.groupby('sex')
tells Pandas to group the rows by their sex: all male passengers' records are in one group, and all female passengers' records are in another group. The thing you get from groupby() is not a DataFrame, it's a special kind of object that lets you apply aggregation functions - that is, functions which take a whole group and turn it into one number (or something). So, for example, if you do this
>>> df.groupby('sex').mean()
pclass survived age sibsp parch fare \
sex
female 2.154506 0.727468 28.687071 0.652361 0.633047 46.198097
male 2.372479 0.190985 30.585233 0.413998 0.247924 26.154601
body
sex
female 166.62500
male 160.39823
you see that for each column, Pandas takes the average over the male passengers' records of all that column's values, and also over all the female passenger's records. All you care about here is the survival rate, so just use
>>> df.groupby('sex').mean()['survived']
sex
female 0.727468
male 0.190985
One big advantage of this is that you can give more than one column to group by, if you want to look at small groups. For example, sex and class:
>>> df.groupby(['sex', 'pclass']).mean()['survived']
sex pclass
female 1 0.965278
2 0.886792
3 0.490741
male 1 0.340782
2 0.146199
3 0.152130
(you have to give groupby a list of column names if you're giving more than one)
Have you tried merging the two dataframes by passenger ID and then doing a pivot table in Pandas with whatever row subtotals and aggfunc=numpy.mean?
import pandas as pd
import numpy as np
# Passenger List
p_list = pd.DataFrame()
p_list['ID'] = [1,2,3,4,5,6]
p_list['Class'] = ['1','2','2','1','2','1']
p_list['Gender'] = ['M','M','F','F','F','F']
# Survivor List
s_list = pd.DataFrame()
s_list['ID'] = [1,2,3,4,5,6]
s_list['Survived'] = [1,0,0,0,1,0]
# Merge the datasets
merged = pd.merge(p_list,s_list,how='left',on=['ID'])
# Pivot to get sub means
result = pd.pivot_table(merged,index=['Class','Gender'],values=['Survived'],aggfunc=np.mean, margins=True)
# Reset the index
for x in range(result.index.nlevels-1,-1,-1):
result.reset_index(level=x,inplace=True)
print result
Class Gender Survived
0 1 F 0.000000
1 1 M 1.000000
2 2 F 0.500000
3 2 M 0.000000
4 All 0.333333

Categories