Python pandas generate a table for multiple output variables - python

I have accident data and part of this data includes the year of the accident, degree of injury and age of the injured person. this is an example of the DataFrame:
df = pd.DataFrame({'Year': ['2010', '2010','2010','2010','2010','2011','2011','2011','2011'],
'Degree_injury': ['no_injury', 'death', 'first_aid', 'minor_injury','disability','disability', 'disability', 'death','first_aid'],
'Age': [50,31,40,20,45,29,60,18,48]})
print(df)
I want three output variables to be grouped in a table by year when the age is less than 40 and get counts for number of disabilities, number of deaths, and number of minor injuries.
The output should be like this:
I generated the three variables (num_disability, num_death, num_minor_injury) when the age is < 40 as shown below.
disability_filt = (df['Degree_injury'] =='disability') &\
(df['Age'] <40)
num_disability = df[disability_filt].groupby('Year')['Degree_injury'].count()
death_filt = (df['Degree_injury'] == 'death')& \
(df['Age'] <40)
num_death = df[death_filt].groupby('Year')['Degree_injury'].count()
minor_injury_filt = (df['Degree_injury'] == 'death') & \
(df['Age'] <40)
num_minor_injury = df[minor_injury_filt].groupby('Year')['Degree_injury'].count()
How to combine these variables in one table to be as illustrated in the above table?
Thank you in advance,

Use pivot_table after filter your rows according your condition:
out = df[df['Age'].lt(40)].pivot_table(index='Year', columns='Degree_injury',
values='Age', aggfunc='count', fill_value=0)
print(out)
# Output:
Degree_injury death disability minor_injury
Year
2010 1 0 1
2011 1 1 0

# prep data
df2 = df.loc[df.Age<40,].groupby("Year").Degree_injury.value_counts().to_frame().reset_index(level=0, inplace=False)
df2 = df2.rename(columns={'Degree_injury': 'Count'})
df2['Degree_injury'] = df2.index
df2
# Year Count Degree_injury
# death 2010 1 death
# minor_injury 2010 1 minor_injury
# death 2011 1 death
# disability 2011 1 disability
# pivot result
df2.pivot(index='Year',columns='Degree_injury')
# death disability minor_injury
# Year
# 2010 1.0 NaN 1.0
# 2011 1.0 1.0 NaN

Related

merging dataframes with multiple conditions

I am trying to merge two large data frames based on two common columns in these data frames. there is a small attempt and debate here but no promising solution
df1.year<=df2.year(same or later year to be manufactured)
df1.maker=df2.maker AND df1.location=df2.location
I prepared a small mock data to explain:
first data frame:
data = np.array([[2014,"toyota","california","corolla"],
[2015,"honda"," california", "civic"],
[2020,"hyndai","florida","accent"],
[2017,"nissan","NaN", "sentra"]])
df = pd.DataFrame(data, columns = ['year', 'make','location','model'])
df
second data frame:
data2 = np.array([[2012,"toyota","california","airbag"],
[2017,"toyota","california", "wheel"],
[2022,"hyndai","newyork","seat"],
[2017,"nissan","london", "light"]])
df2 = pd.DataFrame(data2, columns = ['year', 'make','location','id'])
df2
desired output:
data3 = np.array([[2017,"toyota",'corolla',"california", "wheel"]])
df3 = pd.DataFrame(data3, columns = ['year', 'make','model','location','id'])
df3
I tried to use the below approach but it is to slow and also not so accurate:
df4= pd.merge(df,df2, on=['location','make'], how='outer')
df4=df4.dropna()
df4['year'] = df4.apply(lambda x : x['year_y'] if x['year_y'] >= x['year_x'] else "0", axis=1)
You can achieve it with a merge_asof (one to one left merge) and dropna:
# ensure numeric year
df['year'] = pd.to_numeric(df['year'])
df2['year'] = pd.to_numeric(df2['year'])
(pd.merge_asof(df.sort_values('year'),
df2.sort_values('year')
.assign(year2=df2['year']),
on='year', by=['make', 'location'],
direction='forward')
.dropna(subset='id')
.convert_dtypes('year2')
)
NB. The intermediate is the size of df.
Output:
year make location model id year2
0 2014 toyota california corolla wheel 2017
one to many
As merge_asof is a one to one left join, if you want a one to many left join (or right join), you can invert the inputs and the direction.
I added an extra row for 2017 to demonstrate the difference.
year make location id
0 2012 toyota california airbag
1 2017 toyota california wheel
2 2017 toyota california windshield
3 2022 hyndai newyork seat
4 2017 nissan london light
Right join:
(pd.merge_asof(df2.sort_values('year'),
df.sort_values('year'),
on='year', by=['make', 'location'],
direction='backward')
.dropna(subset='model')
)
NB. The intermediate is the size of df2.
Output:
year make location id model
1 2017 toyota california wheel corolla
2 2017 toyota california windshield corolla
this should work:
df4= pd.merge(df,df2, on=['location','make'], how='inner')
df4.where(df4.year_x<=df4.year_y).dropna()
Output:
year_x make location model year_y id
1 2014 toyota california corolla 2017 wheel
Try this code (here 'make' and 'location' are common columns):
df_outer = pd.merge(df, df2, on=['make', 'location'], how='inner')
df3 = df_outer[df['year'] <= df2['year']]

Is there a way to count and calculate mean for text columns using groupby?

I have been using pandas.groupby to pivot data and create descriptive charts and tables for my data. While doing groupby for three variables, I keep running into a DataError: No numeric types to aggregate error while working with the cancelled column.
To describe my data, Year and Month contain yearly and monthly data for multiple columns (multiple years, all months), Type contains the type of order item (Clothes, Appliances, etc.), and cancelled contains yes or no string values to determine whether a order was cancelled or not.
I am hoping to plot a graph and show a table to show what the cancellation rate (and success rate) is by order item. The following is what I'm using so far
df.groupby(['Year', 'Month', 'Type'])['cancelled'].mean()
But this doesn't seem to be working.
Sample
Year Month Type cancelled
2012 1 electronics yes
2012 10 fiber yes
2012 9 clothes no
2013 4 vegetables yes
2013 5 appliances no
2016 3 fiber no
2017 1 clothes yes
Use:
df = pd.DataFrame({
'Year':[2020] * 6,
'Month':[7,8,7,8,7,8],
'cancelled':['yes','no'] * 3,
'Type':list('aaaaba')
})
print (df)
Get counts per Year, Month, Type columns:
df1 = df.groupby(['Year', 'Month', 'Type','cancelled']).size().unstack(fill_value=0)
print (df1)
cancelled no yes
Year Month Type
2020 7 a 0 2
b 0 1
8 a 3 0
And then divide by sum of values for ratio:
df2 = df1.div(df1.sum()).mul(100)
print (df2)
cancelled no yes
Year Month Type
2020 7 a 0.0 66.666667
b 0.0 33.333333
8 a 100.0 0.000000
It's possible I have misunderstood what you want your output to look like, but to find the cancellation rate for each item type, you could do something like this:
# change 'cancelled' to numeric values
df.loc[df['cancelled'] == 'yes', 'cancelled'] = 1
df.loc[df['cancelled'] == 'no', 'cancelled'] = 0
# get the mean of 'cancelled' for each item type
res = {}
for t in df['Type'].unique():
res[t] = df.loc[df['Type'] == t, 'cancelled'].mean()
# if desired, put it into a dataframe
results = pd.DataFrame([res], index=['Rate']).T
Output:
Rate
electronics 1.0
fiber 0.5
clothes 0.5
vegetables 1.0
appliances 0.0
Note: If you want to specify specific years or months, you can do that with loc as well, but given that your example data did not have any repeats within a given year or month, this would return your original dataframe for your given example.

Pandas groupby mean of only positive values

How to get mean of only positive values after groupby in pandas?
MWE:
import numpy as np
import pandas as pd
flights = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/nycflights13.csv?raw=true')
print(flights.shape)
print(flights.iloc[:2,:4])
print()
not_cancelled = flights.dropna(subset=['dep_delay','arr_delay'])
df = (not_cancelled.groupby(['year','month','day'])['arr_delay']
.mean().reset_index()
)
df['avg_delay2'] = df[df.arr_delay>0]['arr_delay'].mean()
print(df.head())
This gives all avg_delay2 values as 16.66.
(336776, 19)
year month day dep_time
0 2013 1 1 517.0
1 2013 1 1 533.0
year month day arr_delay avg_delay2
0 2013 1 1 12.651023 16.665681
1 2013 1 2 12.692888 16.665681
2 2013 1 3 5.733333 16.665681
3 2013 1 4 -1.932819 16.665681
4 2013 1 5 -1.525802 16.665681
Which is WRONG.
# sanity check
a = not_cancelled.query(""" year==2013 & month ==1 & day ==1 """)['arr_delay']
a = a[a>0]
a.mean() # 32.48156182212581
When I do the same thing in R:
library(nycflights13)
not_cancelled = flights %>%
filter( !is.na(dep_delay), !is.na(arr_delay))
df = not_cancelled %>%
group_by(year,month,day) %>%
summarize(
# average delay
avg_delay1 = mean(arr_delay),
# average positive delay
avg_delay2 = mean(arr_delay[arr_delay>0]))
head(df)
It gives correct output for avg_delay2.
year month day avg_delay1 avg_delay2
2013 1 1 12.651023 32.48156
2013 1 2 12.692888 32.02991
2013 1 3 5.733333 27.66087
2013 1 4 -1.932819 28.30976
2013 1 5 -1.525802 22.55882
2013 1 6 4.236429 24.37270
How to do this in Pandas?
I would filter the positive before groupby
df = (not_cancelled[not_cancelled.arr_delay >0].groupby(['year','month','day'])['arr_delay']
.mean().reset_index()
)
df.head()
because, as in your code, df is an separate dataframe after the groupby operation has completed, and
df['avg_delay2'] = df[df.arr_delay>0]['arr_delay'].mean()
assign the same value to df['avg_delay2']
Edit: Similar to R, you can do both in one shot using agg:
def mean_pos(x):
return x[x>0].mean()
df = (not_cancelled.groupby(['year','month','day'])['arr_delay']
.agg({'arr_delay': 'mean', 'arr_delay_2': mean_pos})
)
df.head()
Note that from pandas 23, using dictionary in gropby agg is deprecated and will be removed in future, so we can not use that method.
Warning
df = (not_cancelled.groupby(['year','month','day'])['arr_delay']
.agg({'arr_delay': 'mean', 'arr_delay_2': mean_pos})
)
FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version.
So, I to tackle that problem in this specific case, I came up with another idea.
Create a new column making all non-positive values nans, then do the usual groupby.
import numpy as np
import pandas as pd
# read data
flights = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/nycflights13.csv?raw=true')
# select flights that are not cancelled
df = flights.dropna(subset=['dep_delay','arr_delay'])
# create new column to fill non-positive with nans
df['arr_delay_pos'] = df['arr_delay']
df.loc[df.arr_delay_pos <= 0,'arr_delay_pos'] = np.nan
df.groupby(['year','month','day'])[['arr_delay','arr_delay_pos']].mean().reset_index().head()
It gives:
year month day arr_delay arr_delay_positive
0 2013 1 1 12.651023 32.481562
1 2013 1 2 12.692888 32.029907
2 2013 1 3 5.733333 27.660870
3 2013 1 4 -1.932819 28.309764
4 2013 1 5 -1.525802 22.558824
Sanity check
# sanity check
a = not_cancelled.query(""" year==2013 & month ==1 & day ==1 """)['arr_delay']
a = a[a>0]
a.mean() # 32.48156182212581

Group By, Agg, Reset Index where filter returns null continued

Here is something that would be a sample data set
customer_number ethnicity fiscal_quarter fiscal_year
1 231 Black Quarter 1 2016
2 451 White Quarter 1 2016
3 345 White Quarter 1 2016
I would like to do the check for 'Asian' test on the ethnicity column, group by fiscal_year, fiscal_quarter, and count unique customer_number. But if there is no result for 'Asian', return a dataframe like below.
customer_number fiscal_quarter fiscal_year
1 0 Quarter 1 2016
short answer
# make column `Categorical`, include `'Asian'` as one of the categories
e = df.ethnicity
df['ethnicity'] = pd.Categorical(e, categories=np.append('Asian', e.unique()))
# simple function to be applied. performs 2nd level of `groupby`
def f(df):
s = df.groupby('ethnicity').customer_number.nunique()
return s.loc['Asian']
# initial `groupby`
d = df.groupby(['fiscal_year', 'fiscal_quarter']).apply(f)
d.reset_index(name='nunique')
fiscal_year fiscal_quarter nunique
0 2016 Quarter 1 0
explanation
a convenient way to to groupby and produce aggregated results over groups that didn't exist, is to define the column of groups as 'Categorical' where you define the categories to include the thing that is missing. pandas will include that category in the aggregated results.
However, in this case, I could not groupby with 3 different columns and maintain that same convenience. I had to break up the grouping into 2.
groupby columns that aren't 'Categorical'. Namely ['fiscal_year', 'fiscal_quarter']
apply to groupby in step 1, a function that performs a simple groupby over just ethnicity. This will maintain the desired behavior and report on all categories regardless of whether they are represented in the data.
leaving all categories
e = df.ethnicity
df['ethnicity'] = pd.Categorical(
e, categories=np.append(['Asian', 'Hispanic'], e.unique()))
def f(df):
return df.groupby('ethnicity').customer_number.nunique()
d = df.groupby(['fiscal_year', 'fiscal_quarter']).apply(f)
d.stack().reset_index(name='nunique')
fiscal_year fiscal_quarter ethnicity nunique
0 2016 Quarter 1 Asian 0
1 2016 Quarter 1 Hispanic 0
2 2016 Quarter 1 Black 1
3 2016 Quarter 1 White 1
If I understand correctly what you're looking for, the following should do:
import pandas as pd
# Generate data
d = {'customer_number': [231, 451, 345, 236, 457, 354],
'ethnicity': ['Black', 'White', 'White', 'Black', 'White', 'White'],
'fiscal_quarter': ['Quarter 1','Quarter 1','Quarter 1','Quarter 3','Quarter 3','Quarter 1'],
'fiscal_year': [2016, 2016, 2016, 2015, 2015, 2017]}
df = pd.DataFrame(d)
# Helper function to determine subset of
# dataframe that meets ethnicity condition
def find_ethnicity(dff, ethnicity):
count = dff.customer_number[dff.ethnicity.eq(ethnicity)].nunique()
if count == 0:
dff = dff.head(1).copy()
else:
dff = dff[dff.ethnicity.eq(ethnicity)].copy().head(1)
dff['ethnicity'] = ethnicity
dff['customer_number'] = count
return dff
# Test with ethnicity 'Black' grouping by fiscal_year and fiscal_quarter
print(df.groupby(['fiscal_year', 'fiscal_quarter'], as_index=False).apply(find_ethnicity, 'Black')).reset_index(drop=True)
# customer_number ethnicity fiscal_quarter fiscal_year
# 0 1 Black Quarter 3 2015
# 1 1 Black Quarter 1 2016
# 2 0 Black Quarter 1 2017
# Test with ethnicity 'Asian' grouping by fiscal_year and fiscal_quarter
print(df.groupby(['fiscal_year', 'fiscal_quarter'], as_index=False).apply(find_ethnicity, 'Asian')).reset_index(drop=True)
# customer_number ethnicity fiscal_quarter fiscal_year
# 0 0 Asian Quarter 3 2015
# 1 0 Asian Quarter 1 2016
# 2 0 Asian Quarter 1 2017
# Test with ethnicity 'White' grouping by fiscal_year and fiscal_quarter
print(df.groupby(['fiscal_year', 'fiscal_quarter'], as_index=False).apply(find_ethnicity, 'White')).reset_index(drop=True)
# customer_number ethnicity fiscal_quarter fiscal_year
# 0 1 White Quarter 3 2015
# 1 2 White Quarter 1 2016
# 2 1 White Quarter 1 2017
# Test with ethnicity 'Latino' grouping by fiscal_year and fiscal_quarter
print(df.groupby(['fiscal_year', 'fiscal_quarter'], as_index=False).apply(find_ethnicity, 'Latino')).reset_index(drop=True)
# customer_number ethnicity fiscal_quarter fiscal_year
# 0 0 Latino Quarter 3 2015
# 1 0 Latino Quarter 1 2016
# 2 0 Latino Quarter 1 2017
# Test with ethnicity 'Asian' without grouping
print(find_ethnicity(df, 'Asian'))
# customer_number ethnicity fiscal_quarter fiscal_year
# 0 0 Asian Quarter 1 2016
I hope this proves useful.

How to change value of second duplicate in row

I have not been able to find an answer to the following online. Your valuable help would be much appreciated.
I have a DataFrame like this with 20k rows:
ID Date Country
2008-0001 2008-01-02 India
2008-0001 2008-01-02 France
2008-0002 2008-01-03 USA
I want to take all the duplicates in ID such as in rows 1 and 2 and then increment the second ID to the highest number after the dash.
So for instance because there is already 2008-0002 (assume that 0002 is the highest number after the dash in that column for that year) then I want to increment to one above that so one of the duplicate id values 2008-0001 would become 2008-0003.
I can identify and drop duplicates using the following code
drop_duplicate_df = train_df.drop_duplicates(['ID'])
but this is not what I need.
I believe this will get it done:
isdup = df.duplicated(subset=['ID1', 'ID2'])
dups, uniques = df[isdup], df[~isdup]
ids = ['ID1', 'ID2']
for i, row in dups.iterrows():
while (row[ids] == uniques[ids]).all(axis=1).any():
row.loc['ID2'] += 1
uniques = uniques.append(row)
id1 = uniques.ID1.astype(str)
id2 = uniques.ID2.astype(str).str.zfill(4)
uniques.loc[:, 'ID'] = id1 + '-' + id1
uniques.drop(['ID1', 'ID2'], axis=1, inplace=True)
print uniques.sort_index()
ID Date Country
0 2008-0001 2008-01-02 India
1 2008-0003 2008-01-02 France
2 2008-0002 2008-01-03 USA
The below works with the sample data, and assumes you have data for several years that you all want to relabel according to the same logic:
df.Date = pd.to_datetime(df.Date) # to datetime to extract years
years = df.groupby(df.Date.dt.year) # analysis per year
new_df = pd.DataFrame()
for year, data in years:
data.loc[data.duplicated(subset='ID'), 'ID'] = '{0}-{1}'.format(year, str(int(df.ID.max().split('-')[1]) + 1).zfill(4))
new_df = pd.concat([new_df, data])
to get:
ID Date Country
0 2008-0001 2008-01-02 India
1 2008-0003 2008-01-02 France
2 2008-0002 2008-01-03 USA

Categories