pandas complicated stacked barplot - python

I have the following data:
Year LandUse Region Area
0 2005 Corn LP 2078875
1 2005 Corn UP 149102.4
2 2005 Open Lands LP 271715
3 2005 Open Lands UP 232290.1
4 2005 Soybeans LP 1791342
5 2005 Soybeans UP 50799.12
6 2005 Other Ag LP 638010.4
7 2005 Other Ag UP 125527.2
8 2005 Forests/Wetlands LP 69629.86
9 2005 Forests/Wetlands UP 26511.43
10 2005 Developed LP 10225.56
11 2005 Developed UP 1248.442
12 2010 Corn LP 2303999
13 2010 Corn UP 201977.2
14 2010 Open Lands LP 131696.3
15 2010 Open Lands UP 45845.81
16 2010 Soybeans LP 1811186
17 2010 Soybeans UP 66271.21
18 2010 Other Ag LP 635332.9
19 2010 Other Ag UP 257439.9
20 2010 Forests/Wetlands LP 48124.43
21 2010 Forests/Wetlands UP 23433.76
22 2010 Developed LP 7619.853
23 2010 Developed UP 707.4816
How do I use pandas to make a stacked bar plot that shows area on y-axis and uses 'REGION' to construct the stacks and uses YEAR and LandUse on x-axis.

The main thing with pandas plots is figuring out which shape pandas expects the data to be in. If we reshape so that Year is in the index and different regions are in different columns:
# Assuming that we want to sum the areas for different
# LandUse's within each region
plot_table = df.pivot_table(index='Year', columns='Region',
values='Area', aggfunc='sum')
plot_table
Out[39]:
Region LP UP
Year
2005 4859797.820 585478.6920
2010 4937958.483 595675.3616
The plotting happens pretty straightforwardly:
plot_table.plot(kind='bar', stacked=True)
Having both Year and LandUse on the x-axis doesn't require much extra work, you can
put both in the index when creating the table for plotting:
plot_table = df.pivot_table(index=['Year', 'LandUse'],
columns='Region',
values='Area', aggfunc='sum')

Related

How to get years without starting with df=df.set_index

I have this set of dataframe:Dataframe
I can obtain the values that is 15% greater than the mean by:
df[df['Interest']>(df["Interest"].mean()*1.15)].Interest.to_string()
I obtained all values that are 15% greater than interest in their respective categories
The question is how do I get the year where these values occurred without starting with:
df=df.set_index('Year")
at the start as the function above requires my year values with df.iloc
How do I get the year where these values occurred without starting with df.set_index('Year")
Use .loc:
>>> df
Year Dividends Interest Other Types Rent Royalties Trade Income
0 2007 7632014 4643033 206207 626668 89715 18654926
1 2008 6718487 4220161 379049 735494 58535 29677697
2 2009 1226858 5682198 482776 1015181 138083 22712088
3 2010 978925 2229315 565625 1260765 146791 15219378
4 2011 1500621 2452712 675770 1325025 244073 19697549
5 2012 308064 2346778 591180 1483543 378998 33030888
6 2013 275019 4274425 707344 1664747 296136 17503798
7 2014 226634 3124281 891466 1807172 443671 16023363
8 2015 2171559 3474825 1144862 1858838 585733 16778858
9 2016 767713 4646350 2616322 1942102 458543 13970498
10 2017 759016 4918320 1659303 2001220 796343 9730659
11 2018 687308 6057191 1524474 2127583 1224471 19570540
>>> df.loc[df['Interest']>(df["Interest"].mean()*1.15), ['Year', 'Interest']]
Year Interest
0 2007 4643033
2 2009 5682198
9 2016 4646350
10 2017 4918320
11 2018 6057191
This will return a DataFrame with Year and the Interest values that match your condition
df[df['Interest']>(df["Interest"].mean()*1.15)][['Year', 'Interest']]
This will return the Year :-
df.loc[df["Interest"]>df["Interest"].mean()*1.15]["Year"]

How to multiply two dataframes of different shapes

I have two dataframes:
the first datframe df1 looks like this:
variable value
0 plastic 5774
2 glass 42
4 ferrous metal 642
6 non-ferrous metal 14000
8 paper 4000
Here is the head of the second dataframe df2:
waste_type total_waste_recycled_tonne year energy_saved
non-ferrous metal 160400.0 2015 NaN
glass 14600.0 2015 NaN
ferrous metal 15200 2015 NaN
plastic 766800 2015 NaN
I want to update the energy_saved in the second dataframe df2 such that I multiply the total_waste_recycled_tonne variable in df2 by the variable in df1 into the energy_saved column in df2.
For example:
For plastic: 5774 will be multipled with every waste_type platic with the total_waste_recycled_tonne variable in df2
ie:
energy_saved = 5774 * 766800
Here is what I tried:
df2["energy_saved"] = df1[df1["variable"]=="plastic"]["value"].values[0] * df2["total_waste_recycled_tonne"][df2["waste_type"]=="plastic"]
However the problem was that when I do others, the rest changes back to NaN. I need a better approach to handle this?
Use map:
df2['energy_saved'] = (df2['waste_type'].map(df1.set_index('variable')['value'])
.mul(df2['total_waste_recycled_tonne']
)
Try via merge() and pass how='right':
df=df1[['variable','value']].merge(df2[['waste_type','total_waste_recycled_tonne']],left_on='variable',right_on='waste_type',how='right')
Finally:
df2["energy_saved"]=df['value'].mul(df['total_waste_recycled_tonne'])
Output of df2:
waste_type total_waste_recycled_tonne year energy_saved
0 non-ferrous metal 160400.0 2015 2.245600e+09
1 glass 14600.0 2015 6.132000e+05
2 ferrous metal 15200.0 2015 9.758400e+06
3 plastic 766800.0 2015 4.427503e+09
4 plastic 762700.0 2015 4.403830e+09
A set_index + reindex option:
df2['energy_saved'] = (
df1.set_index('variable').reindex(df2['waste_type'])['value'] *
df2.set_index('waste_type')['total_waste_recycled_tonne']
).values
df2:
waste_type total_waste_recycled_tonne year energy_saved
0 non-ferrous metal 160400.0 2015 2.245600e+09
1 glass 14600.0 2015 6.132000e+05
2 ferrous metal 15200.0 2015 9.758400e+06
3 plastic 766800.0 2015 4.427503e+09
4 plastic 762700.0 2015 4.403830e+09

I have multiIndexes for my dataframe, how do I calculate the sum for one level?

Hi everyone, I want to calculate the sum of Violent_type count according to year. For example, calculating total count of violent_type for year 2013, which is 18728+121662+1035. But I don't know how to select the data when there are multiIndexes. Any advice will be appreciated. Thanks.
The level argument in pandas.DataFrame.groupby() is what you are looking for.
level int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
To answer your question, you only need:
df.groupby(level=[0, 1]).sum()
# or
df.groupby(level=['district', 'year']).sum()
To see the effect
import pandas as pd
iterables = [['001', 'SST'], [2013, 2014], ['Dangerous', 'Non-Violent', 'Violent']]
index = pd.MultiIndex.from_product(iterables, names=['district', 'year', 'Violent_type'])
df = pd.DataFrame(list(range(0, len(index))), index=index, columns=['count'])
'''
print(df)
count
district year Violent_type
001 2013 Dangerous 0
Non-Violent 1
Violent 2
2014 Dangerous 3
Non-Violent 4
Violent 5
SST 2013 Dangerous 6
Non-Violent 7
Violent 8
2014 Dangerous 9
Non-Violent 10
Violent 11
'''
print(df.groupby(level=[0, 1]).sum())
'''
count
district year
001 2013 3
2014 12
SST 2013 21
2014 30
'''
print(df.groupby(level=['district', 'year']).sum())
'''
count
district year
001 2013 3
2014 12
SST 2013 21
2014 30
'''

How to calculate Cumulative Average Revenue ? Python

I want to create a graph that will display the cumulative average revenue for each 'Year Onboarded' (first customer transaction) over a period of time. But I am making mistakes when grouping the information I need.
Toy Data:
dataset = {'ClientId': [1,2,3,1,2,3,1,2,3,1,2,3,4,4,4,4,4,4,4],
'Year Onboarded': [2018,2019,2020,2018,2019,2020,2018,2019,2020,2018,2019,2020,2016,2016,2016,2016,2016,2016,2016],
'Year': [2019,2019,2020,2019,2019,2020,2018,2020,2020,2020,2019,2020,2016,2017,2018,2019,2020,2017,2018],
'Revenue': [100,50,25,30,40,50,60,100,20,40,100,20,5,5,8,4,10,20,8]}
df = pd.DataFrame(data=dataset)
Explanation: Customers have a designated 'Year Onboarded' and they make a transaction every 'Year' mentioned.
Then I calculate the years that have elapsed since the clients onboarded in order to make my graph visually more appealing.
df['Yearsdiff'] = df['Year']-df['Year Onboarded']
To calculate the Cumulative Average Revenue I tried the following methods:
First try:
df = df.join(df.groupby(['Year']).expanding().agg({ 'Revenue': 'mean'})
.reset_index(level=0, drop=True)
.add_suffix('_roll'))
df.groupby(['Year Onboarded', 'Year']).last().drop(columns=['Revenue'])
The output starts to be cumulative but the last row isn't cumulative anymore (not sure why).
Second Try:
df.groupby(['Year Onboarded','Year']).agg('mean') \
.groupby(level=[1]) \
.agg({'Revenue':np.cumsum})
But it doesn't work properly, I tried other ways as well but didn't achieve good results.
To visualize the cumulative average revenue I simply use sns.lineplot
My goal is to get a graph similar as the one below but for that I first need to group my data correctly.
Expected output plot
The Years that we can see on the graph represent the 'Year Onboarded' not the 'Year'.
Can someone help me calculate a Cumulative Average Revenue that works in order to plot a graph similar to the one above? Thank you
Also the data provided in the toy dataset will surely not give something similar to the example plot but the idea should be there.
This is how I would do it and considering the toy data is not the same, probably some changes should be done, but all in all:
import seaborn as sns
df1 = df.copy()
df1['Yearsdiff'] = df1['Year']-df1['Year Onboarded']
df1['Revenue'] = df.groupby(['Year Onboarded'])['Revenue'].transform('mean')
#Find the average revenue per Year Onboarded
df1['Revenue'] = df1.groupby(['Yearsdiff'])['Revenue'].transform('cumsum')
#Calculate the cumulative sum of Revenue (Which is now the average per Year Onboarded) per Yearsdiff (because this will be our X-axis in the plot)
sns.lineplot(x=df1['Yearsdiff'],y=df1['Revenue'],hue=df1['Year'])
#Finally plot the data, using the column 'Year' as hue to account for the different years.
You can create rolling mean like this:
df['rolling_mean'] = df.groupby(['Year Onboarded'])['Revenue'].apply(lambda x: x.rolling(10, 1).mean())
df
# ClientId Year Onboarded Year Revenue rolling_mean
# 0 1 2018 2019 100 100.000000
# 1 2 2019 2019 50 50.000000
# 2 3 2020 2020 25 25.000000
# 3 1 2018 2019 30 65.000000
# 4 2 2019 2019 40 45.000000
# 5 3 2020 2020 50 37.500000
# 6 1 2018 2018 60 63.333333
# 7 2 2019 2020 100 63.333333
# 8 3 2020 2020 20 31.666667
# 9 1 2018 2020 40 57.500000
# 10 2 2019 2019 100 72.500000
# 11 3 2020 2020 20 28.750000
# 12 4 2016 2016 5 5.000000
# 13 4 2016 2017 5 5.000000
# 14 4 2016 2018 8 6.000000
# 15 4 2016 2019 4 5.500000
# 16 4 2016 2020 10 6.400000
# 17 4 2016 2017 20 8.666667
# 18 4 2016 2018 8 8.571429

Matplotlib Plot confusion

I'm not sure what is going on here but I have two seemingly similar bits of code designed to produce graphs in the same format:
apple_fcount = apple_f1.groupby("Year")["Domain Category"].nunique("count")
plt.figure(1); apple_fcount.plot(figsize=(12,6))
plt.xlabel('Year')
plt.ylabel('Number of Fungicides used')
plt.title('Number of fungicides used on apples in the US')
plt.savefig('C:/Users/User/Documents/Work/Year 3/Project/Plots/Apple/apple fcount')
This one produces the graph how I would like it to be seen; y axis shows number of fungicides, x axis shows the respective years. However, the following code, on a different dataset prints a usable graph but the x axis shows years as '1, 2, 3, ...' instead of the actual years.
apple_yplot = apple_y1.groupby('Year')['Value'].sum()
plt.figure(3); apple_yplot.plot(figsize=(12,6))
plt.xlabel('Year')
plt.ylabel('Yield / lb per acre')
plt.title('Graph of apple yield in the US over time')
plt.savefig('C:/Users/User/Documents/Work/Year 3/Project/Plots/Apple/Yield.png')
The only discernable difference I see in the code is that the first counts .nunique() datapoints, whilst the second is a .sum() of all data in the year. I can't imagine thats the reason behind this issue though. Both .groupby() lines print in the same format, with year being correctly displayed there.
apple_fcount:
Year
1991 19
1993 19
1995 21
1997 26
1999 28
2001 27
2003 31
2005 37
2007 30
2009 35
2011 32
Name: Domain Category, dtype: int64
apple_yplot:
Year
2007 405399
2008 541180
2009 483130
2010 473150
2011 468120
2012 417710
2013 529470
2014 510700
Name: Value, dtype: float64

Categories