Calculating a probability based on several variables in a Pandas dataframe - python

I'm still very new to Python and Pandas, so bear with me...
I have a dataframe of passengers on a ship that sunk. I have broken this down into other dataframes by male and female, and also by class to create probabilities for survival. I made a function that compares one dataframe to a dataframe of only survivors, and calculates the probability of survival among this group:
def survivability(total_pass_df, column, value):
survivors = sum(did_survive[column] == value)
total = len(total_pass_df)
survival_prob = round((survivors / total), 2)
return survival_prob
But now I'm trying to compare survivability among smaller groups - male first class passengers vs female third class passengers for example. I did make dataframes for both of these groups, but I still can't use my survivability function because I"m comparing two different columns - sex and class - rather than just one.
I know exactly how I'd do it with Python - loop through the 'survived' column (which is either a 1 or 0), in the dataframe, if it equals 1, then add one to an index value, and once all the data has been gone through, divide the index value by the length of the dataframe to get the probability of survival....
But I'm supposed to use Pandas for this, and I can't for the life of me work out in my head how to do it....
:/

Without a sample of the data frames you're working with, I can't be sure if I understand your question correctly. But based on your description of the pure-Python procedure,
I know exactly how I'd do it with Python - loop through the 'survived' column (which is either a 1 or 0), in the dataframe, if it equals 1, then add one to an index value, and once all the data has been gone through, divide the index value by the length of the dataframe to get the probability of survival....
you can do this in Pandas by simply writing
dataframe['survived'].mean()
That's it. Given that all the values are either 1 or 0, the mean will be the number of 1's divided by the total number of rows.
If you start out with a data frame that has columns like survived, sex, class, and so on, you can elegantly combine this with Pandas' boolean indexing to pick out the survival rates for different groups. Let me use the Socialcops Titanic passengers data set as an example to demonstrate. Assuming the DataFrame is called df, if you want to analyze only male passengers, you can get those records as
df[df['sex'] == 'male']
and then you can take the survived column of that and get the mean.
>>> df[df['sex'] == 'male']['survived'].mean()
0.19198457888493475
So 19% of male passengers survived. If you want to narrow down to male second-class passengers, you'll need to combine the conditions using &, like this:
>>> df[(df['sex'] == 'male') & (df['pclass'] == 2)]['survived'].mean()
0.14619883040935672
This is getting a little unwieldy, but there's an easier way that actually lets you do multiple categories at once. (The catch is that this is a somewhat more advanced Pandas technique and it might take a while to understand it.) Using the DataFrame.groupby() method, you can tell Pandas to group the rows of the data frame according to their values in certain columns. For example,
df.groupby('sex')
tells Pandas to group the rows by their sex: all male passengers' records are in one group, and all female passengers' records are in another group. The thing you get from groupby() is not a DataFrame, it's a special kind of object that lets you apply aggregation functions - that is, functions which take a whole group and turn it into one number (or something). So, for example, if you do this
>>> df.groupby('sex').mean()
pclass survived age sibsp parch fare \
sex
female 2.154506 0.727468 28.687071 0.652361 0.633047 46.198097
male 2.372479 0.190985 30.585233 0.413998 0.247924 26.154601
body
sex
female 166.62500
male 160.39823
you see that for each column, Pandas takes the average over the male passengers' records of all that column's values, and also over all the female passenger's records. All you care about here is the survival rate, so just use
>>> df.groupby('sex').mean()['survived']
sex
female 0.727468
male 0.190985
One big advantage of this is that you can give more than one column to group by, if you want to look at small groups. For example, sex and class:
>>> df.groupby(['sex', 'pclass']).mean()['survived']
sex pclass
female 1 0.965278
2 0.886792
3 0.490741
male 1 0.340782
2 0.146199
3 0.152130
(you have to give groupby a list of column names if you're giving more than one)

Have you tried merging the two dataframes by passenger ID and then doing a pivot table in Pandas with whatever row subtotals and aggfunc=numpy.mean?
import pandas as pd
import numpy as np
# Passenger List
p_list = pd.DataFrame()
p_list['ID'] = [1,2,3,4,5,6]
p_list['Class'] = ['1','2','2','1','2','1']
p_list['Gender'] = ['M','M','F','F','F','F']
# Survivor List
s_list = pd.DataFrame()
s_list['ID'] = [1,2,3,4,5,6]
s_list['Survived'] = [1,0,0,0,1,0]
# Merge the datasets
merged = pd.merge(p_list,s_list,how='left',on=['ID'])
# Pivot to get sub means
result = pd.pivot_table(merged,index=['Class','Gender'],values=['Survived'],aggfunc=np.mean, margins=True)
# Reset the index
for x in range(result.index.nlevels-1,-1,-1):
result.reset_index(level=x,inplace=True)
print result
Class Gender Survived
0 1 F 0.000000
1 1 M 1.000000
2 2 F 0.500000
3 2 M 0.000000
4 All 0.333333

Related

How to use pandas to count rows in which two columns must have one specific string from a specified set of strings for each column?

I have a dataset that includes, among other things, a column for level of education and yearly salary (represented for some godforsaken reason as >50k, >=50k, <50k, etc). I need to figure out how many people with higher education AKA bachelors, masters, and doctorate make more than 50k. That means that I need to select the rows in which there is either a doctorate, bachelors, or masters in the education column, AND the first character of the salary column is '>'. What is the proper syntax for that? Will give more information if needed. Please help.
To select only people with higher education you can use isin passing the list of education degree. For the yearly salary, if you test only against the > (e.g. str.startswith('>')) you could end up including the rows where Year_Salary are also equal to 50k.
import pandas as pd
import numpy as np
#setup
np.random.seed(42)
d = {
'Year_Salary': np.random.choice(['>50k','>=50k','<50k'], size=(50,)),
'Education': np.random.choice(['doctorate','bachelors','masters','undergraduate'], size=(50,))
}
df = pd.DataFrame(d)
#code
filtered_df = df[df['Education'].isin(['doctorate','bachelors','masters']) \
& df['Year_Salary'].str.startswith('>')]
print(filtered_df)
print(filtered_df.shape[0]) # 20 (number of matches)
Output from filtered_df
Year_Salary Education
1 >50k doctorate
4 >50k bachelors
7 >=50k masters
14 >=50k masters
...
To get only the rows where Year_Salary is greater than 50k you could use str.match with the regex ^>\d+, a string that starts with a literal > follow by one or more digits.
df[df['Education'].isin(['doctorate','bachelors','masters']) & (df['Year_Salary'].str.match(r'^>\d+'))]
You can use below statement to filter the dataframe based on condition:
newdf = df[(df.val > 0.5) & (df.val2 == 1)]
OR
you can iter through rows and update the column. Refer the below code:
for index, row in df.iterrows():
....

How to append dataframe with selected columns having higher feature score

Hi I am new to python let me know if the question is not clear.
Here is my dataframe:
df = pd.DataFrame(df_test)
age bmi children charges
0 19 27.900 0 16884.92400
1 18 33.770 1 1725.55230
2 28 33.000 3 4449.46200
3 33 22.705 0 21984.47061
I am applying select 'k' best feature selection using chi squared test for this numerical data
X_clf = numeric_data.iloc[:,0:(col_len-1)]
y_clf = numeric_data.iloc[:,-1]
bestfeatures = SelectKBest(score_func=chi2, k=2)
fit = bestfeatures.fit(X_clf,y_clf)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X_clf.columns)
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
This is my output:
Feature Score
0 age 6703.764216
1 bmi 1592.481991
2 children 1752.136519
I wish to now append my dataframe to contain only the features with 2 highest scores. However I wish to do so without hardcoding the column names while appending into my dataframe.
I have tried to store the column names into a list and append those with highest score but am getting a Value error. Is there any method/function i could try by storing the selected columns and then appending them based on they're scores?
Expected Output: Column 'bmi' is not there as it has lowest of 3 scores
age children charges
0 19 0 16884.92400
1 18 1 1725.55230
2 28 3 4449.46200
3 33 0 21984.47061
So first you want to find out which features have the largest values, then find the Featurename of the columns you do not want to see.
colToDrop = feature.iloc[~feature['Score'].nlargest(2)]['Feature'].values
Next we just filter the original df and remove those columns from the columns list
df[df.columns.drop(colToDrop)]
I believe you need to work on the dataframe featureScores to keep the first 2 features with the highest Score and then use this values as a list to filter the columns in the original dataframe. Something along the lines of:
important_features = featureScores.sort_values('Score',ascending=False)['Feature'].values.tolist()[:2] + ['charges']
filtered_df = df[important_features]
The sort_values() is to make sure the features (in case there are more) are sorted from highest score to lowest score. We then are creating a list of the first 2 values of the column Feature (which has been sorted already) with .values.tolist()[:2]. Since you seem to also want to include the column charges in your output, we are appending it manually with +['charges'] to our list of important_features.
Finally, we're creating a filtered_df by selecing only the important_features columns from the original df.
Edit based on comments:
If you can guarantee charges will be the last column in your original df then you can simply do:
important_features = featureScores.sort_values('Score',ascending=False)['Feature'].values.tolist()[:2] + [df.columns[-1]]
filtered_df = df[important_features]
I see you have previously defined your y column with y_clf = numeric_data.iloc[:,-1] you can then use y_clf.columns or [df.columns[-1]], either should work fine.

How do I groupby a dataframe based on values that are common to multiple columns?

I am trying to aggregate a dataframe based on values that are found in two columns. I am trying to aggregate the dataframe such that the rows that have some value X in either column A or column B are aggregated together.
More concretely, I am trying to do something like this. Let's say I have a dataframe gameStats:
awayTeam homeTeam awayGoals homeGoals
Chelsea Barca 1 2
R. Madrid Barca 2 5
Barca Valencia 2 2
Barca Sevilla 1 0
... and so on
I want to construct a dataframe such that among my rows I would have something like:
team goalsFor goalsAgainst
Barca 10 5
One obvious solution, since the set of unique elements is small, is something like this:
for team in teamList:
aggregateDf = gameStats[(gameStats['homeTeam'] == team) | (gameStats['awayTeam'] == team)]
# do other manipulations of the data then append it to a final dataframe
However, going through a loop seems less elegant. And since I have had this problem before with many unique identifiers, I was wondering if there was a way to do this without using a loop as that seems very inefficient to me.
The solution is 2 folds, first compute goals for each team when they are home and away, then combine them. Something like:
goals_when_away = gameStats.groupby(['awayTeam'])['awayGoals', 'homeGoals'].agg('sum').reset_index().sort_values('awayTeam')
goals_when_home = gameStats.groupby(['homeTeam'])['homeGoals', 'awayGoals'].agg('sum').reset_index().sort_values('homeTeam')
then combine them
np_result = goals_when_away.iloc[:, 1:].values + goals_when_home.iloc[:, 1:].values
pd_result = pd.DataFrame(np_result, columns=['goal_for', 'goal_against'])
result = pd.concat([goals_when_away.iloc[:, :1], pd_result], axis=1, ignore_index=True)
Note .values when summing to get result in numpy array, and ignore_index=True when concat, these are to avoid pandas trap when it sums by column and index names.

Pandas groupby with minimal group size

I have a dataframe df of shape (450 000, 15), containing informations about users, and each line is a different user, with 13 characteristics(age, gender, home-town...) and 1 boolean variable, whether the user has or doesn't have a car.
I would like to regroup my users to find out which groups have the most cars, but I need to keep at least 2500 users in a group to keep it statistically relevant.
test= df.groupby(['Gender'])
test.size() # check the groups size
Gender
Female 150000
Male 300000
dtype: int64
So far so good, I have way more than 2500 user by groups. So I had another grouping criteria :
test2= df.groupby(['Gender','Age'])
test2.size()
Gender Age
Female <30 15022
30-90 134960
90+ 18
Male <20 5040
20-90 291930
90+ 3030
dtype: int64
As we can expect, I now have groups with very little users...
I would like to have something like this :
Gender
Female 150 000 # Don't split here because groups will be too small
# Here I can split, because group size > 2500 :
Gender Age
Male <20 5040
20-90 291930
90+ 3030
dtype: int64
I didn't find a way to group a groupby dataframe based on a size criteria, so I was wondering what would be a pythonic way to handle this.
As I have 13 characteristics to group my users, I am also wondering about the grouping order : should I group by gender first and then by age, or the opposite? This has an impact when using multiple variables with a stop condition.
I don't need to use pandas, but I thought it would be appropriate. The output should look like :
name_of_group group_size
Female 150000
Male, <20 5040
Male, 20-90 291930
Male, 90+ 3030
groupby has to group on a "key" which must be separately computable for each row. That is, there's no way to group by some criterion that depends on aggregate characteristics that you won't know until after the group is created (like its size). You could write code that tries different groupings and uses some heuristic to decide which is "best", but there's nothing built in for this.
Do you want all the groups to have at least 2500 users?
You could so something like this:
# List of all sets of categories you want to test
group_ids_list = [['Gender'], ['Age'], ['Gender','Age']]
# Will be filled with groups that pass your test
valid_groups = []
group_sizes = {}
for group_ids in group_ids_list :
grouped_df = df.groupby(group_id)
for key, group in grouped_df:
if len(group) > 25000:
valid_groups.append(group)
group_sizes[key] = len(group)
group_sizes = pd.Series(group_sizes)
And then you can work using only the valid groupers.
Hope the pseudo-code helps, otherwise please provide a reproducible example.
I think FLab's answer is probably more complete and correct. But if you're after a quick fix;
column = 'Gender'
minimum_size = 2500
valid_groups = [g for g in set(df[col]) if sum(df[col] == g) >= minimum_size]
mask = df[column].isin(valid_groups)
df[mask].groupby(column)

Add multiple columns to multiple data frames

I have a number of number of small dataframes with a date and stock price for a given stock. Someone else showed me how to loop through them so they are contained in a list called all_dfs. So all_dfs[0] would be a dataframe with Date and IBM US equity, all_dfs[1] would be Date and MMM US Equity, etc. (example shown below). The Date column in the dataframes is always the same but the stock names are all different and the numbers associated with that stock column are always different. So when you call all_dfs[1] this is the dataframe you would see (i.e., all_dfs[1].head()):
IDX Date MMM US equity
0 1/3/2000 47.19
1 1/4/2000 45.31
2 1/5/2000 46.63
3 1/6/2000 50.38
I want to add the same additional columns to EVERY dataframe. So I was trying to loop through them and add the columns. The numbers in the stock name columns are the basis for the calculations that make the other columns.
There are more columns to add but I think they will all loop through the same way soc this is a sample of the columns I want to add:
Column 1 to add >>> df['P_CHG1D'] = df['Stock name #1'].pct_change(1) * 100
Column 2 to add >>> df['PCHG_SIG'] = P_CHG1D > 3
Column 3 to add>>> df['PCHG_SIG']= df['PCHG_SIG'].map({True:1,False:0})
This is the code that I have so far but it is returning a syntax errors for the all_dfs[i].
for i in range (len(df.columns)):
for all_dfs[i]:
df['P_CHG1D'] = df.loc[:,0].pct_change(1) * 100
So I also have 2 problems that I can not figure out
I dont know how to add columns to every dataframes in the loop. So I would have to do something like all_dfs[i].['ADD COLUMN NAME'] = df['Stock Name 1'].pct_change(1) * 100
the second part after the = which is the df['Stock Name 1'] this keeps changing (so in this example it is called MMM US Equity but the next time it would be called the column header of the second dataframe - so it could be IBM US Equity) as each dataframe has a different name so I don't know how to call that properly in the loop
I am new to python/pandas so if I am thinking about this the wrong way let me know if there is a better solution.
Consider iterating through the length of alldfs to reference each element in loop by its index. For first new column, use .ix operator to select stock column by its column position of 2 (third column):
for i in range(len(alldfs)):
dfList[i].is_copy = False # TURNS OFF SettingWithCopyWarning
dfList[i]['P_CHG1D'] = dfList[i].ix[:, 2].pct_change(1) * 100
dfList[i]['PCHG_SIG'] = dfList[i]['P_CHG1D'] > 3
dfList[i]['PCHG_SIG_VAL'] = dfList[i]['PCHG_SIG'].map({True:1,False:0})

Categories