Finding the difference between two data-frames - python

I'm trying to find the largest income difference between male and female workers. But I'm not sure how to implement the code. I need some assistance.
aa=industries.F_weekly.max()
bb=industries.M_weekly.max()
cc = (nf.loc[nf['M_weekly'] == bb]) - (nf.loc[nf['F_weekly'] == aa])
cc.max()
cc.min()

Let's say your Dataframe is called df.
First, calculate the absolute value of salary difference, then print max. This can also be done in one line.
df['salary_delta'] = (df['M_weekly'] - df['F_weekly']).abs()
print(max(df['salary_delta']))
In case you want to find the row where salary difference is the highest then try:
df.loc[df['salary_delta'].idxmax()]

Related

Pandas - using group by and including value counts which are larger than n

I have a table which includes salary and company_location.
I was trying to calculate the mean salary of a country, its works:
wage = df.groupby('company_location').mean()['salary']
However, I have many with company_location which have less than 5 entries, I would like to exclude them from the report.
I know how to calculate countries with the top 5 entries:
Top_5 = df['company_location'].value_counts().head(5)
I am just having a problem connecting those to variables into one and making a graph out of it...
Thank you.
You can remove rows whose value occurrence is below a threshold:
df = df[df.groupby('company_location')['company_location'].transform('size') > 5]
You can do the following to only apply the groupby and aggregation to those with more than 5 records:
mask = (df['company_location'].map(df['company_location'].value_counts()) > 5)
wage = df[mask].groupby('company_location')['salary'].mean()

How to calculate relative frequency of an event from a dataframe?

I have a dataframe with temperature data for a certain period. With this data, I want to calculate the relative frequency of the month of August being warmer than 20° as well as January being colder than 2°. I have already managed to extract these two columns in a separate dataframe, to get the count of each temperature event and used the normalize function to get the frequency for each value in percent (see code).
df_temp1[df_temp1.aug >=20]
df_temp1[df_temp1.jan <= 2]
df_temp1['aug'].value_counts()
df_temp1['jan'].value_counts()
df_temp1['aug'].value_counts(normalize=True)*100
df_temp1['jan'].value_counts(normalize=True)*100
What I haven't managed is to calculate the relative frequency for aug>=20, jan<=2, as well as aug>=20 AND jan<=2 and aug>=20 OR jan<=2.
Maybe someone could help me with this problem. Thanks.
I would try something like this:
proprortion_of_augusts_above_20 = (df_temp1['aug'] >= 20).mean()
proprortion_of_januaries_below_20 = (df_temp1['jan'] <= 2).mean()
This calculates it in two steps. First, df_temp1['aug'] >= 20 creates a boolean array, with True representing months above 20, and False representing months which are not.
Then, mean() reinterprets True and False as 1 and 0. The average of this is the percentage of months which fulfill the criteria, divided by 100.
As an aside, I would recommend posting your data in a question, which allows people answering to check whether their solution works.

Calculate weighted average in pandas with unique condition

I'm trying to calculate the weighted average of the "prices" column in the following dataframe for each zone, regardless of hour. I want to essentially sum the quantities that match A, divide each individual quantity row by that amount (to get the weights) and then multiply it by the price.
There are about 200 zones, I'm having a hard time writing something that will generically detect that the Zones match, and not have to write df['ZONE'] = 'A' etc. Please help my lost self =)
HOUR: 1,2,3,1,2,3,1,2,3
ZONE: A,A,A,B,B,B,C,C,C
PRICE: 12,15,16,17,12,11,12,13,15
QUANTITY: 5,6,1 5,7,9 6,3,2
I'm not sure if you can generically write something, but I thought what if I wrote a function where x is my "Zone", create a list with possible zones, and then create a for loop. Here's the function I wrote, doesn't really work - trying to figure out how else I can make it work
def wavgp(x):
df.loc[df['ZONE'].isin([str(x)])] = x
Here is a possible solution using groupby operation:
weighted_price = df.groupby('ZONE').apply(lambda x: (x['PRICE'] * x['QUANTITY']).sum()/x['QUANTITY'].sum())
Explaination
First we groupby zone , for each of these block (of the same zone) we are going to multiply the price by the quantity and sum these values. We divide this result by the sum of the quantity to get your desired result.
ZONE
A 13.833333
B 12.761905
C 12.818182
dtype: float64

How to access values in groupby dataframe with multiple labels?

I am trying to find the values inside dataframe that has been grouped.
I grouped payment data with time the person borrowed the money and months it took for the person to pay and summed the amount they paid. My goal is to find the list of months it took for people to pay back.
For example, how can I know the list of 'month_taken' when start_yyyymm is 201807?
payment_sum_monthly =
payment_data.groupby(['start_yyyymm','month_taken'])
[['amount']].sum()
If I use R and put the payment data in data.table form, I can find out the list of month_taken by
payment_sum_monthly[start_yyyymm == '201807',month_taken]
How can I do this in Python? Thanks.
is_date = payment_data['start_yyyymm'] == "201807"
It should give you all the entities that has 'start_yyyymm' is 201807. Then to call those entities, you can code following:
date_set = payment_data[is_date].copy()
payment_sum_monthly = date_set .groupby('month_taken').aggregate(sum)
payment_sum_monthly
And if you need one more condition you can do following:
condition2 = payment_data['column name'] == condition
payment_data[is_date & condition2]
I hope I got your question right and it helps

Calculating a probability based on several variables in a Pandas dataframe

I'm still very new to Python and Pandas, so bear with me...
I have a dataframe of passengers on a ship that sunk. I have broken this down into other dataframes by male and female, and also by class to create probabilities for survival. I made a function that compares one dataframe to a dataframe of only survivors, and calculates the probability of survival among this group:
def survivability(total_pass_df, column, value):
survivors = sum(did_survive[column] == value)
total = len(total_pass_df)
survival_prob = round((survivors / total), 2)
return survival_prob
But now I'm trying to compare survivability among smaller groups - male first class passengers vs female third class passengers for example. I did make dataframes for both of these groups, but I still can't use my survivability function because I"m comparing two different columns - sex and class - rather than just one.
I know exactly how I'd do it with Python - loop through the 'survived' column (which is either a 1 or 0), in the dataframe, if it equals 1, then add one to an index value, and once all the data has been gone through, divide the index value by the length of the dataframe to get the probability of survival....
But I'm supposed to use Pandas for this, and I can't for the life of me work out in my head how to do it....
:/
Without a sample of the data frames you're working with, I can't be sure if I understand your question correctly. But based on your description of the pure-Python procedure,
I know exactly how I'd do it with Python - loop through the 'survived' column (which is either a 1 or 0), in the dataframe, if it equals 1, then add one to an index value, and once all the data has been gone through, divide the index value by the length of the dataframe to get the probability of survival....
you can do this in Pandas by simply writing
dataframe['survived'].mean()
That's it. Given that all the values are either 1 or 0, the mean will be the number of 1's divided by the total number of rows.
If you start out with a data frame that has columns like survived, sex, class, and so on, you can elegantly combine this with Pandas' boolean indexing to pick out the survival rates for different groups. Let me use the Socialcops Titanic passengers data set as an example to demonstrate. Assuming the DataFrame is called df, if you want to analyze only male passengers, you can get those records as
df[df['sex'] == 'male']
and then you can take the survived column of that and get the mean.
>>> df[df['sex'] == 'male']['survived'].mean()
0.19198457888493475
So 19% of male passengers survived. If you want to narrow down to male second-class passengers, you'll need to combine the conditions using &, like this:
>>> df[(df['sex'] == 'male') & (df['pclass'] == 2)]['survived'].mean()
0.14619883040935672
This is getting a little unwieldy, but there's an easier way that actually lets you do multiple categories at once. (The catch is that this is a somewhat more advanced Pandas technique and it might take a while to understand it.) Using the DataFrame.groupby() method, you can tell Pandas to group the rows of the data frame according to their values in certain columns. For example,
df.groupby('sex')
tells Pandas to group the rows by their sex: all male passengers' records are in one group, and all female passengers' records are in another group. The thing you get from groupby() is not a DataFrame, it's a special kind of object that lets you apply aggregation functions - that is, functions which take a whole group and turn it into one number (or something). So, for example, if you do this
>>> df.groupby('sex').mean()
pclass survived age sibsp parch fare \
sex
female 2.154506 0.727468 28.687071 0.652361 0.633047 46.198097
male 2.372479 0.190985 30.585233 0.413998 0.247924 26.154601
body
sex
female 166.62500
male 160.39823
you see that for each column, Pandas takes the average over the male passengers' records of all that column's values, and also over all the female passenger's records. All you care about here is the survival rate, so just use
>>> df.groupby('sex').mean()['survived']
sex
female 0.727468
male 0.190985
One big advantage of this is that you can give more than one column to group by, if you want to look at small groups. For example, sex and class:
>>> df.groupby(['sex', 'pclass']).mean()['survived']
sex pclass
female 1 0.965278
2 0.886792
3 0.490741
male 1 0.340782
2 0.146199
3 0.152130
(you have to give groupby a list of column names if you're giving more than one)
Have you tried merging the two dataframes by passenger ID and then doing a pivot table in Pandas with whatever row subtotals and aggfunc=numpy.mean?
import pandas as pd
import numpy as np
# Passenger List
p_list = pd.DataFrame()
p_list['ID'] = [1,2,3,4,5,6]
p_list['Class'] = ['1','2','2','1','2','1']
p_list['Gender'] = ['M','M','F','F','F','F']
# Survivor List
s_list = pd.DataFrame()
s_list['ID'] = [1,2,3,4,5,6]
s_list['Survived'] = [1,0,0,0,1,0]
# Merge the datasets
merged = pd.merge(p_list,s_list,how='left',on=['ID'])
# Pivot to get sub means
result = pd.pivot_table(merged,index=['Class','Gender'],values=['Survived'],aggfunc=np.mean, margins=True)
# Reset the index
for x in range(result.index.nlevels-1,-1,-1):
result.reset_index(level=x,inplace=True)
print result
Class Gender Survived
0 1 F 0.000000
1 1 M 1.000000
2 2 F 0.500000
3 2 M 0.000000
4 All 0.333333

Categories