Incorrect CAGR output using python in a Pandas dataframe - python

Apologies if this is an easy fix, but I can't figure out where my problem is - I am a relatively new programmer and have tried to find solutions elsewhere to no luck.
The issue:
I am trying to calculate CAGR in a Pandas Dataframe, but the resultant metric does not match the calculation output in excel and also a third party check.
The Dataframe: simply a listing of countries (rows. Eg 'Afghanistan', 'Albania',..), and a listing of years (cols. Eg '1913', '1914'...) with GDP in the body of the table
The code:
df_gdp['CAGR'] = ((df_gdp['2013']/df_gdp['1913'])**(1/(100)-1)*100)
The result:
I have added in a column at the end with the excel calculated results which show the differences. Indeed even with the first two rows (Afghanistan+Albania) the CAGR calc looks incorrect as it is clear Albania has grown more than Afghanistan
1913 2013 CAGR Excel
country
Afghanistan 4,920,000,000 65,800,000,000 7.673647 2.627
Albania 1,470,000,000 30,700,000,000 4.936023 3.086
Algeria 22,600,000,000 479,000,000,000 4.864466 3.101
Angola 3,230,000,000 152,000,000,000 2.208439 3.927

Problem was in () in formula:
df_gdp['CAGR1'] = ((df_gdp['2013']/df_gdp['1913'])**(1/100)-1) * 100
print (df_gdp)
1913 2013 CAGR Excel CAGR1
Afghanistan 4920000000 65800000000 7.673647 2.627 2.627230
Albania 1470000000 30700000000 4.936023 3.086 3.085649
Algeria 22600000000 479000000000 4.864466 3.101 3.100856
Angola 3230000000 152000000000 2.208439 3.927 3.926526

Related

In Python, how did I prevent a name from repeating in dataset and sum or average the information of the data given?

**I am analyzing data found on github reflecting COVID-19 (coronavirus) by Our World in Data on GitHub. The data found will be presented in an organized table format.
**
Link to data
https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv
Project Goals:
List every country in alphabetical order
List the total number of tests conducted by each country
List the total number of vaccinated peoples in each country
List the total number of deaths of each country
Show the average age of death to Covid-19 in each country
Here is an example of the data
iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,total_tests,new_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,new_people_vaccinated_smoothed,new_people_vaccinated_smoothed_per_hundred,stringency_index,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,,0.122,0.122,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,,0.122,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,,0.122,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,,0.122,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
Overall Question:
How do I sum the countries together so the information does not repeat? I just want the country to display once, and then next to it the sum of the information given from the data for the number of vaccinations displays. Example of the way
I want the data to display:
Country Vaccinations
Afghanistan 30235
Albania 15032
Andorra 2352
I have tried to import the data with pandasand sum the data but not quite sure how to get specifically for that one country. I want to write it so I can just display the single country by itself, but I run into the issue of either creating a new table and confuse myself. I am a beginner here and this data set is very large.
df = pd.read_csv(r'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv')
df[['location', 'total_vaccinations']].groupby('location').sum().reset_index()
location total_vaccinations
0 Afghanistan 4.013568e+08
1 Africa 1.853572e+11
2 Albania 3.587615e+08
3 Algeria 2.983973e+08
4 Andorra 4.593697e+06
.. ... ...
243 Western Sahara 0.000000e+00
244 World 4.857404e+12
245 Yemen 2.126927e+07
246 Zambia 5.435039e+08
247 Zimbabwe 3.211422e+09
[248 rows x 2 columns]

Create a column that divides the other 2 columns using Pandas Apply()

Given a dataset -
country year cases population
Afghanistan 1999 745 19987071
Brazil 1999 37737 172006362
China 1999 212258 1272915272
Afghanistan 2000 2666 20595360
Brazil 2000 80488 174504898
China 2000 213766 1280428583
The task is to get the ratio of cases to population using the pandas apply function, in a new column called "prevalence"
This is what I have written
def calc_prevalence(G):
assert 'cases' in G.columns and 'population' in G.columns
G_copy = G.copy()
G_copy['prevalence'] = G_copy['cases','population'].apply(lambda x: (x['cases']/x['population']))
display(G_copy)
but I am getting a
KeyError: ('cases', 'population')
Here is a solution that applies a named function to the dataframe without using lambda:
def calculate_ratio(row):
return row['cases']/row['population']
df['prevalence'] = df.apply(calculate_ratio, axis = 1)
print(df)
#output:
country year cases population prevalence
0 Afghanistan 1999 745 19987071 0.000037
1 Brazil 1999 37737 172006362 0.000219
2 China 1999 212258 1272915272 0.000167
3 Afghanistan 2000 2666 20595360 0.000129
4 Brazil 2000 80488 174504898 0.000461
5 China 2000 213766 1280428583 0.000167
First, unless you've been explicitly told to use an apply function here for some reason, you can call the operation on the columns themselves resulting in a much faster vectorized operation. ie;
G_copy['prevalence']=G_copy['cases']/G_copy['population']
Finally, if you must use an apply for some reason, apply on the df instead of the two series;
G_copy['prevalence']=G_copy.apply(lambda row: row['cases']/row['population'],axis=1)

Discrepancy in data values while opening .csv file manually and by using python query

Data Source: https://www.kaggle.com/worldbank/world-development-indicators
Folder: 'world-development-indicators'
When I manually check the database by opening the csv file in MS-Excel, I find the number of years to be from 1960 to 1980 (min year 1960 and max year 1980).
However when I run the below command in python, I get the total number of years to be 1960 to 2015. And the max year to be 2015 (min year continues to be 1960)
data = pd.read_csv('./world-development-indicators/Indicators.csv')
years = data['Year'].unique().tolist()
len(years)
o/p: 56
years.min
o/p: 1960
years.max
o/p: 2015
If the maximum year in .csv file when opened manually is 1980, then why am I getting the maximum value of Year column as 2015 while executing python query.
Has anyone faced such an issue? Can anyone please help?
The file you have mentioned contains 5.65 million records. I have tested this in MS-Excel as well as Libre Office on Linux, it gives me an error message that not all rows can been loaded. Hence, you see records only until 1980.
I did a:
data.describe()
And found the min and max to be 1960 and 2015. Also, the year is increasing in the file. If you do a data.head(5) and data.tail(5), you will notice the following:
data.tail(5)
Out[109]:
CountryName CountryCode ... Year Value
5656453 Zimbabwe ZWE ... 2015 36.0
5656454 Zimbabwe ZWE ... 2015 90.0
5656455 Zimbabwe ZWE ... 2015 242.0
5656456 Zimbabwe ZWE ... 2015 3.3
5656457 Zimbabwe ZWE ... 2015 32.8
[5 rows x 6 columns]
data.head(5)
Out[110]:
CountryName CountryCode ... Year Value
0 Arab World ARB ... 1960 1.335609e+02
1 Arab World ARB ... 1960 8.779760e+01
2 Arab World ARB ... 1960 6.634579e+00
3 Arab World ARB ... 1960 8.102333e+01
4 Arab World ARB ... 1960 3.000000e+06
PS: If you use Spyder, you can open the Variable Explorer section, double click on data, and you should see all the records. I prefer this over opening in Excel because Excel usually truncates the records at the bottom if the file is large.

Plotting pandas groupby

I have a dataframe with some car data - the structure is pretty simple. I have an ID, the year of production, the kilometers, the price and the fuel type (petrol/diesel).
In [106]:
stack.head()
Out[106]:
year km price fuel
0 2003 165.286 2.350 petrol
1 2005 195.678 3.350 diesel
2 2002 125.262 2.450 petrol
3 2002 161.000 1.999 petrol
4 2002 164.851 2.599 diesel
I am trying to produce a chart with pylab/matplotlib where the x-axis will be the year and then, using groupby, to have two plots (one for each fuel type) with averages by year (mean function) for price and km.
Any help would be appreciated.
Maybe there's a more straight way to do it, but I would do the following. First groupby and take the means for price:
meanprice = df.groupby(['year','fuel'])['price'].mean().reset_index()
and for km:
meankm = df.groupby(['year','fuel'])['km'].mean().reset_index()
Then I would merge the two resulting dataframes to get all data in one:
d = pd.merge(meanprice,meankm,on=['year','fuel']).set_index('year')
Setting the index as year ley us get the things easy while plotting with pandas. The resulting dataframe is:
fuel price km
year
2002 diesel 2.5990 164.851
2002 petrol 2.2245 143.131
2003 petrol 2.3500 165.286
2005 diesel 3.3500 195.678
at the end you can plot filtering by fuel:
d[d['fuel']=='diesel'].plot(kind='bar')
d[d['fuel']=='petrol'].plot(kind='bar')
obtaining something like:
I don't know if it is the kind of plot you expected, but you can easily modify them with the kind keyword. Hope that helps.

Getting unique rows conditioned on year pandas python dataframe

I have a dataframe of this form. However, In my final dataframe, I'd like to only get a dataframe that has unique values per year.
Name Org Year
4 New York University doclist[1] 2004
5 Babson College doclist[2] 2008
6 Babson College doclist[5] 2008
So ideally, my dataframe will look like this instead
4 New York University doclist[1] 2004
5 Babson College doclist[2] 2008
What I've done so far. I've used groupby by year, and I seem to be able to get the unique names by year. However, I am stuck because I lose all the other information, such as the "Org" column. Advice appreciated!
#how to get unique rows per year?
q = z.groupby(['Year'])
#print q.head()
#q.reset_index(level=0, drop=True)
q.Name.apply(lambda x: np.unique(x))
For this I get the following output. How do I include the other column information as well as removing the secondary index (eg: 6, 68, 66, 72)
Year
2008 6 Babson College
68 European Economic And Social Committee
66 European Union
72 Ewing Marion Kauffman Foundation
If all you want to do is keep the first entry for each name, you can use drop_duplicates Note that this will keep the first entry based on however your data is sorted, so you may want to sort first if you want keep a specific entry.
In [98]: q.drop_duplicates(subset='Name')
Out[98]:
Name Org Year
0 New York University doclist[1] 2004
1 Babson College doclist[2] 2008

Categories