Plotting pandas groupby - python

I have a dataframe with some car data - the structure is pretty simple. I have an ID, the year of production, the kilometers, the price and the fuel type (petrol/diesel).
In [106]:
stack.head()
Out[106]:
year km price fuel
0 2003 165.286 2.350 petrol
1 2005 195.678 3.350 diesel
2 2002 125.262 2.450 petrol
3 2002 161.000 1.999 petrol
4 2002 164.851 2.599 diesel
I am trying to produce a chart with pylab/matplotlib where the x-axis will be the year and then, using groupby, to have two plots (one for each fuel type) with averages by year (mean function) for price and km.
Any help would be appreciated.

Maybe there's a more straight way to do it, but I would do the following. First groupby and take the means for price:
meanprice = df.groupby(['year','fuel'])['price'].mean().reset_index()
and for km:
meankm = df.groupby(['year','fuel'])['km'].mean().reset_index()
Then I would merge the two resulting dataframes to get all data in one:
d = pd.merge(meanprice,meankm,on=['year','fuel']).set_index('year')
Setting the index as year ley us get the things easy while plotting with pandas. The resulting dataframe is:
fuel price km
year
2002 diesel 2.5990 164.851
2002 petrol 2.2245 143.131
2003 petrol 2.3500 165.286
2005 diesel 3.3500 195.678
at the end you can plot filtering by fuel:
d[d['fuel']=='diesel'].plot(kind='bar')
d[d['fuel']=='petrol'].plot(kind='bar')
obtaining something like:
I don't know if it is the kind of plot you expected, but you can easily modify them with the kind keyword. Hope that helps.

Related

Pandas select rows by multiple conditions on columns

I would like to reduce my code. So instead of 2 lines I would like to select rows by 3 conditions on 2 columns.
My DataFrame contains Country's population between 2000 and 2018 by granularity (Total, Female, Male, Urban, Rural)
Zone Granularity Year Value
0 Afghanistan Total 2000 20779.953
1 Afghanistan Male 2000 10689.508
2 Afghanistan Female 2000 10090.449
3 Afghanistan Rural 2000 15657.474
4 Afghanistan Urban 2000 4436.282
20909 Zimbabwe Total 2018 14438.802
20910 Zimbabwe Male 2018 6879.119
20911 Zimbabwe Female 2018 7559.693
20912 Zimbabwe Rural 2018 11465.748
20913 Zimbabwe Urban 2018 5447.513
I would like all rows of the Year 2017 with granularity Total AND Urban.
I tried something like this below but not working but each condition working well in separate code.
df.loc[(df['Granularity'].isin(['Total', 'Urban'])) & (df['Year'] == '2017')]
Thanks for tips to help
Very likely, you're using the wrong type for the year. I imagine these are integers.
You should try:
df.loc[(df['Granularity'].isin(['Total', 'Urban'])) & df['Year'].eq(2017)]
output (for the Year 2018 as 2017 is missing from the provided data):
Zone Granularity Year Value
20909 Zimbabwe Total 2018 14438.802
20913 Zimbabwe Urban 2018 5447.513

How to sum the values of one column with respect to a value of another column in python

Suppose I have data like this
Year Population
2016 1000
2016 1200
2017 1400
2017 1500
2018 1600
2018 1600
Now I need the data to be unifying the data like this depends upon the year values
Year Population
2016 2200
2017 2900
Here I don't need the values of 2018. Only I need the sum for 2016 and 2017. How to achieve this?
There are just so many ways to achieve this.
You could do:
df.groupby('Year').sum().drop(2018).reset_index()
or:
df.query('Year != 2018').groupby('Year', as_index=False).sum()
output:
Year Population
0 2016 2200
1 2017 2900

Comparing Values in Multiple Columns and Returning all Declining Regions

I have a data frame that is similar to the following, and lets say I have sales amounts for different regions for two different years:
Company
2021 Region 1 Sales
2021 Region 2 Sales
2020 Region 1 Sales
2020 Region 2 Sales
Company 1
300000
150000
250000
149000
Company 2
10000
17000
100000
80000
Company 3
12000
20000
22000
90000
I would like to compare each region for each year to determine which regions have declined in 2021. One caveat is that the regional sales have to be at least $25,000 to be counted. Therefore, I am looking to add a new column with all of the region names that had less than $25,000 in sales in 2021, but more than $25,000 in 2020. The output would look like this, although there will be more columns or "regions" to compare than 2.
Company
2021 Region 1 Sales
2021 Region 2 Sales
2020 Region 1 Sales
2020 Region 2 Sales
2021 Lost Regions
Company 1
300000
150000
250000
149000
None
Company 2
10000
17000
100000
80000
Region 1; Region 2
Company 3
12000
20000
22000
90000
Region 2
Thank you in advance for any assistance, and no rush on this. Hopefully there is a concise way to do this without using if-then and writing out a lot of combinations.
number_of_regions = 2 # You have to change this
def find_declined_regions(row):
result = []
for i in range(1, number_of_regions+1):
if row[f"2021 Region {i} Sales"] < 25000 and row[f"2020 Region {i} Sales"] > 25000:
result.append(f"Region {i}")
return "; ".join(result)
df.apply(find_declined_regions, axis=1)
df is your DataFrame and you have to change number_of_regions based on your problem.
EDIT:
if columns names are all different, There art two cases:
1- You have a list of all regions, so you can do this:
for region in all_regions:
if row[f"2021 {region} Sales"] < 25000 and row[f"2020 {region} Sales"] > 25000:
2- You don't have a list of all regions, so you have to create one:
all_regions = [col[5:-6] for col in df.columns[1:int(len(df.columns)/2)+1]]

Incorrect CAGR output using python in a Pandas dataframe

Apologies if this is an easy fix, but I can't figure out where my problem is - I am a relatively new programmer and have tried to find solutions elsewhere to no luck.
The issue:
I am trying to calculate CAGR in a Pandas Dataframe, but the resultant metric does not match the calculation output in excel and also a third party check.
The Dataframe: simply a listing of countries (rows. Eg 'Afghanistan', 'Albania',..), and a listing of years (cols. Eg '1913', '1914'...) with GDP in the body of the table
The code:
df_gdp['CAGR'] = ((df_gdp['2013']/df_gdp['1913'])**(1/(100)-1)*100)
The result:
I have added in a column at the end with the excel calculated results which show the differences. Indeed even with the first two rows (Afghanistan+Albania) the CAGR calc looks incorrect as it is clear Albania has grown more than Afghanistan
1913 2013 CAGR Excel
country
Afghanistan 4,920,000,000 65,800,000,000 7.673647 2.627
Albania 1,470,000,000 30,700,000,000 4.936023 3.086
Algeria 22,600,000,000 479,000,000,000 4.864466 3.101
Angola 3,230,000,000 152,000,000,000 2.208439 3.927
Problem was in () in formula:
df_gdp['CAGR1'] = ((df_gdp['2013']/df_gdp['1913'])**(1/100)-1) * 100
print (df_gdp)
1913 2013 CAGR Excel CAGR1
Afghanistan 4920000000 65800000000 7.673647 2.627 2.627230
Albania 1470000000 30700000000 4.936023 3.086 3.085649
Algeria 22600000000 479000000000 4.864466 3.101 3.100856
Angola 3230000000 152000000000 2.208439 3.927 3.926526

Rolling mean with varying window length in Python

I am working with NLSY79 data and I am trying to construct a 'smoothed' income variable that averages over a period of 4 years. Between 1979 and 1994, the NLSY conducted surveys annually, while after 1996 the survey was conducted biennially. This means that my smoothed income variable will average four observations prior to 1994 and only two after 1996.
I would like my smoothed income variable to satisfy the following criteria:
1) It should be an average of 4 income observations from 1979 to 1994 and only 2 from 1996 onward
2) The window should START from a given observation rather than be centered at it. Therefore, my smoothed income variable should tell me the average income over the four years starting from that date
3) It should ignore NaNs
It should, therefore, look like the following (note that I only computed values for 'smoothed income' that could be computed with the data I have provided.)
id year income 'smoothed income'
1 1979 20,000 21,250
1 1980 22,000
1 1981 21,000
1 1982 22,000
...
1 2014 34,000 34,500
1 2016 35,000
2 1979 28,000 28,333
2 1980 NaN
2 1981 28,000
2 1982 29,000
I am relatively new to dataframe manipulation with pandas, so here is what I have tried:
smooth = DATA.groupby('id')['income'].rolling(window=4, min_periods=1).mean()
DATA['smoothIncome'] = smooth.reset_index(level=0, drop=True)
This code accounts for NaNs, but otherwise does not accomplish objectives 2) and 3).
Any help would be much appreciated
Ok, I've modified the code provided by ansev to make it work. filling in NaNs was causing the problems.
Here's the modified code:
df.set_index('year').groupby('id').income.apply(lambda x: x.reindex(range(x.index.min(),x.index.max()+1))
.rolling(4, min_periods = 1).mean().shift(-3)).reset_index()
The only problem I have now is that the mean is not calculated when there are fewer that 4 years remaining (e.g. from 2014 onward, because my data goes until 2016). Is there a way of shortening the window length after 2014?

Categories