using numpy to calculate mean - python

I am trying to calculate the mean of GNP for each country from 2006 to 2015. But when I apply the aggregation with mean function, it will not calculate the mean from 2006 to 2015. Instead, it just display the values for each year. Pls tell me what went wrong? I am able to sort by country but the mean just wont work on the data.
wb_indicator = 'NY.GNP.ATLS.CD'
start_year = 2006
end_year = 2015
df_ex = wb.download(indicator = wb_indicator,
country = ['all'],
start = start_year,
end = end_year)
df_ex1 = df_ex.reset_index()
df_ex1.groupby(['country']).agg({'NY.GNP.ATLS.CD': [np.mean]})
df_ex1.head(20)
Output:
country year NY.GNP.ATLS.CD 0 Arab World 2015 2.767920e+12 1 Arab
World 2014 2.897113e+12 2 Arab World 2013 2.832769e+12 3 Arab
World 2012 2.590610e+12 4 Arab World 2011 2.190786e+12 5 Arab
World 2010 2.055967e+12 6 Arab World 2009 1.932056e+12 7 Arab
World 2008 1.858270e+12 8 Arab World 2007 1.547924e+12 9 Arab
World 2006 1.312967e+12 10 Caribbean small states 2015 6.680302e+10
11 Caribbean small states 2014 6.664219e+10

This should work
import pandas as pd
import wbdata as wb
import datetime
wb_indicator = 'NY.GNP.ATLS.CD'
data_date = (datetime.datetime(2006, 1, 1), datetime.datetime(2015, 1, 1))
data = wb.get_data(wb_indicator, data_date=data_date, pandas=True)
gnp_means = data.reset_index().groupby('country').mean()

Related

Dataframes' subtraction and assignment gives back NAs

Let's suppose that I have a dataset (df_data) such as the following:
Time Geography Population
2016 England and Wales 58381200
2017 England and Wales 58744600
2016 Northern Ireland 1862100
2017 Northern Ireland 1870800
2016 Scotland 5404700
2017 Scotland 5424800
2016 Wales 3113200
2017 Wales 3125200
If I do the following:
df_nireland = df_data[df_data['Geography']=='Northern Ireland']
df_wales = df_data[df_data['Geography']=='Wales']
df_scotland = df_data[df_data['Geography']=='Scotland']
df_engl_n_wales = df_data[df_data['Geography']=='England and Wales']
df_england = df_engl_n_wales
df_england['Population'] = df_engl_n_wales['Population'] - df_wales['Population']
then the df_england has NA values at the column Population.
How can I fix this?
By the way, I have read relevant posts but exactly worked for me (.loc, .copy etc).
This is really an organization problem. If you pivot then you can do the subtractions easily, and ensure alignment on Time
df_pop = df.pivot(index='Time', columns='Geography', values='Population')
df_pop['England'] = df_pop['England and Wales'] - df_pop['Wales']
Output df_pop:
Geography England and Wales Northern Ireland Scotland Wales England
Time
2016 58381200 1862100 5404700 3113200 55268000
2017 58744600 1870800 5424800 3125200 55619400
If you need to get back to your original format, then you can do:
df_pop.stack().to_frame('Population').reset_index()
# Time Geography Population
#0 2016 England and Wales 58381200
#1 2016 Northern Ireland 1862100
#2 2016 Scotland 5404700
#3 2016 Wales 3113200
#4 2016 England 55268000
#5 2017 England and Wales 58744600
#6 2017 Northern Ireland 1870800
#7 2017 Scotland 5424800
#8 2017 Wales 3125200
#9 2017 England 55619400
I had simply to do the following:
df_nireland = df_data[df_data['Geography']=='Northern Ireland'].reset_index(drop=True)
df_wales = df_data[df_data['Geography']=='Wales'].reset_index(drop=True)
df_scotland = df_data[df_data['Geography']=='Scotland'].reset_index(drop=True)
df_engl_n_wales = df_data[df_data['Geography']=='England and Wales'].reset_index(drop=True)
df_england = df_engl_n_wales
df_england['Population'] = df_engl_n_wales['Population'] - df_wales['Population']
or better way in principle since you are retaining the indices of the initial dataframe is the following:
df_nireland = df_data[df_data['Geography']=='Northern Ireland']
df_wales = df_data[df_data['Geography']=='Wales']
df_scotland = df_data[df_data['Geography']=='Scotland']
df_engl_n_wales = df_data[df_data['Geography']=='England and Wales']
df_england = df_engl_n_wales
df_england['Population'] = df_engl_n_wales['Population'] - df_wales['Population'].values

pandas groupby include a column in final result

cast year revenue title
id
135397 Chris Pratt 2015 1.392446e+09 Jurassic World
135397 Bryce Dallas Howard 2015 1.392446e+09 Jurassic World
135397 Irrfan Khan 2015 1.392446e+09 Jurassic World
135397 Nick Robinson 2015 1.392446e+09 Jurassic World
Given the above DataFrame, I would like to find the highest earning actors per year (based on the combined revenue of movies they acted in that year). This is what I have so far :
#get the total revenue associated with each cast for each year
f ={'revenue':sum}
#revenue by year for each cast
df_actor_yr = df_actor_yr.groupby(['year', 'cast']).agg(f)
df_actor_yr
year cast
1960 Anthony Perkins 2.359350e+08
Charles Laughton 4.423780e+08
Fred MacMurray 1.843242e+08
Jack Kruschen 1.843242e+08
Jean Simmons 4.423780e+08
John Gavin 2.359350e+08
Kirk Douglas 4.423780e+08
Vera Miles 2.359350e+08
1961 Anthony Quayle 2.108215e+08
Anthony Quinn 2.108215e+08
Ben Wright 1.574815e+09
Betty Lou Gerson 1.574815e+09
...
Next to get the highest earning cast member for each year I did the following
df_actor_yr.reset_index(inplace=True)
g ={"revenue" : max }
df_actor_yr = df_actor_yr.groupby('year').agg(g)
df_actor_yr
revenue
year
1960 4.423780e+08
1961 1.574815e+09
1962 5.045914e+08
1963 5.617734e+08
1964 8.780804e+08
1965 1.129535e+09
1967 1.345551e+09
1968 4.187094e+08
1969 6.081511e+08
...
This only gives me the year and maximum revenue for that year.I would also like to get the corresponding name of the cast member associated with the revenue. How do I go about doing this?
You can split your logic into 2 steps. First sum by cast and year using GroupBy + sum. Then find the maximum revenue by year using GroupBy + idxmax:
# sum by cast and year
df_summed = df.groupby(['cast', 'year'])['revenue'].sum().reset_index()
# maximums by year
res = df_summed.loc[df_summed.groupby('year')['revenue'].idxmax()]
print(res)
cast year revenue
3 NickRobinson 2012 3.401340e+09
0 BryceDallasHoward 2015 1.568978e+09
For the above output, I've used more interesting data:
id cast year revenue title
135397 ChrisPratt 2015 1.392446e+09 JurassicWorld
135397 BryceDallasHoward 2015 1.568978e+09 SomeMovie
135397 IrrfanKhan 2012 1.392446e+09 JurassicWorld
135397 NickRobinson 2012 1.046987e+09 JurassicWorld
135398 NickRobinson 2012 2.354353e+09 SomeOtherMovie

Consolidating data based on two conditions

I have four columns of data that I am trying to consolidate based on two conditions. The data are formatted as follows:
CountyName Year Oil Gas
ANDERSON 2010 1358 0
ANDERSON 2010 621746 4996766
ANDERSON 2011 1587 0
ANDERSON 2011 633120 5020877
ANDERSON 2012 55992 387685
ANDERSON 2012 1342 0
ANDERSON 2013 635572 3036578
ANDERSON 2013 4873 0
ANDERSON 2014 656440 2690333
ANDERSON 2014 12332 0
ANDERSON 2015 608454 2836272
ANDERSON 2015 23339 0
ANDERSON 2016 551728 2682261
ANDERSON 2016 12716 0
ANDERSON 2017 132466 567874
ANDERSON 2017 1709 0
ANDREWS 2010 25701725 1860063
ANDREWS 2010 106351 0
ANDREWS 2011 97772 0
ANDREWS 2011 28818329 1377865
ANDREWS 2012 105062 0
...
I'm interested in combining the respective oil and then gas values for entries that are repeated. For example, I'd like to add all the oil entries for Anderson County for the year 2010 and have that value replace the existing entries in just one row. The code I am using now is summing all the values in the respective county regardless of year, giving me a condensed output like this:
CountyName Year Oil Gas
ANDERSON 3954774
ANDREWS 206472698
...
Here's the code I am using:
import csv
with open('Texas.csv', 'r') as Texas: #opening Texas csv file
TexasReader = csv.reader(Texas)
counties = {}
years = {}
index = 0 and 1
for row in TexasReader:
if index == 0 and 1:
header = row
else:
county = row[0]
year = row[1]
oil = row[2]
gas = row[3]
if county in counties:
counties[county] += int(oil)
else:
counties[county] = int(oil)
index += 1
with open('TexasConsolidated.csv', 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=header, delimiter=',', lineterminator='\n')
writer.writeheader()
for k, v in counties.items():
writer.writerow({header[0]: k, header[2]: v})
This is the line that is doing what you complain of:
if county in counties:
counties[county] += int(oil)
If you want a dict that stores sums over two keys then both values need to be in the dict key.
Add the line
counties_years = {}
then sum like this, using the tuple (county,year) as the key:
if (county,year) in counties_years:
counties_years[(county,year)] += int(oil)
else:
counties_years[(county,year)] = int(oil)

Pandas Split Column String and Plot unique values

I have a dataframe Df that looks like this:
Country Year
0 Australia, USA 2015
1 USA, Hong Kong, UK 1982
2 USA 2012
3 USA 1994
4 USA, France 2013
5 Japan 1988
6 Japan 1997
7 USA 2013
8 Mexico 2000
9 USA, UK 2005
10 USA 2012
11 USA, UK 2014
12 USA 1980
13 USA 1992
14 USA 1997
15 USA 2003
16 USA 2004
17 USA 2007
18 USA, Germany 2009
19 Japan 2006
20 Japan 1995
I want to make a bar chart for the Country column, if i try this
Df.Country.value_counts().plot(kind='bar')
I get this plot
which is incorrect because it doesn't separate the countries. My goal is to obtain a bar chart that plots the count of each country in the column, but to achieve that, first i have to somehow split the string in each row (if needed) and then plot the data. I know i can use Df.Country.str.split(', ') to split the strings, but if i do this i can't plot the data.
Anyone has an idea how to solve this problem?
You could use the vectorized Series.str.split method to split the Countrys:
In [163]: df['Country'].str.split(r',\s+', expand=True)
Out[163]:
0 1 2
0 Australia USA None
1 USA Hong Kong UK
2 USA None None
3 USA None None
4 USA France None
...
If you stack this DataFrame to move all the values into a single column, then you can apply value_counts and plot as before:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(
{'Country': ['Australia, USA', 'USA, Hong Kong, UK', 'USA', 'USA', 'USA, France', 'Japan', 'Japan', 'USA', 'Mexico', 'USA, UK', 'USA', 'USA, UK', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA, Germany', 'Japan', 'Japan'],
'Year': [2015, 1982, 2012, 1994, 2013, 1988, 1997, 2013, 2000, 2005, 2012, 2014, 1980, 1992, 1997, 2003, 2004, 2007, 2009, 2006, 1995]})
counts = df['Country'].str.split(r',\s+', expand=True).stack().value_counts()
counts.plot(kind='bar')
plt.show()
from collections import Counter
c = pd.Series(Counter(df.Country.str.split(',').sum()))
>>> c.plot(kind='bar', title='Country Count')
new_df = pd.concat([Series(row['Year'], row['Country'].split(',')) for _, row in DF.iterrows()]).reset_index()
(DF is your old DF).
this will give you one data point for each country name.
Hope this helps.
Cheers!

Find key from value for Pandas Series

I have a dictionary whose values are in a pandas series. I want to make a new series that will look up a value in a series and return a new series with associated key. Example:
import pandas as pd
df = pd.DataFrame({'season' : ['Nor 2014', 'Nor 2013', 'Nor 2013', 'Norv 2013',
'Swe 2014', 'Swe 2014', 'Swe 2013',
'Swe 2013', 'Sven 2013', 'Sven 2013', 'Norv 2014']})
nmdict = {'Norway' : [s for s in list(set(df.season)) if 'No' in s],
'Sweden' : [s for s in list(set(df.season)) if 'S' in s]}
Desired result with df['country'] as the new column name:
season country
0 Nor 2014 Norway
1 Nor 2013 Norway
2 Nor 2013 Norway
3 Norv 2013 Norway
4 Swe 2014 Sweden
5 Swe 2014 Sweden
6 Swe 2013 Sweden
7 Swe 2013 Sweden
8 Sven 2013 Sweden
9 Sven 2013 Sweden
10 Norv 2014 Norway
Due to nature of my data I must manually make the nmdict as shown. I've tried this but couldn't reverse my nmdict as arrays are not same length.
More importantly, I think my approach may be wrong. I'm coming from Excel and thinking of a vlookup solution, but according to this answer, I shouldn't be using the dictionary in this way.
Any answers appreciated.
I've done it in a verbose manner to allow you to follow through.
First, let's define a function that determines the value 'country'
In [4]: def get_country(s):
...: if 'Nor' in s:
...: return 'Norway'
...: if 'S' in s:
...: return 'Sweden'
...: # return 'Default Country' # if you get unmatched values
In [5]: get_country('Sven')
Out[5]: 'Sweden'
In [6]: get_country('Norv')
Out[6]: 'Norway'
We can use map to run get_country on every row. Pandas DataFrames also have a apply() which works similarly*.
In [7]: map(get_country, df['season'])
Out[7]:
['Norway',
'Norway',
'Norway',
'Norway',
'Sweden',
'Sweden',
'Sweden',
'Sweden',
'Sweden',
'Sweden',
'Norway']
Now we assign that result to the column called 'country'
In [8]: df['country'] = map(get_country, df['season'])
Let's view the final result:
In [9]: df
Out[9]:
season country
0 Nor 2014 Norway
1 Nor 2013 Norway
2 Nor 2013 Norway
3 Norv 2013 Norway
4 Swe 2014 Sweden
5 Swe 2014 Sweden
6 Swe 2013 Sweden
7 Swe 2013 Sweden
8 Sven 2013 Sweden
9 Sven 2013 Sweden
10 Norv 2014 Norway
*With apply() here's how it would look:
In [16]: df['country'] = df['season'].apply(get_country)
In [17]: df
Out[17]:
season country
0 Nor 2014 Norway
1 Nor 2013 Norway
2 Nor 2013 Norway
3 Norv 2013 Norway
4 Swe 2014 Sweden
5 Swe 2014 Sweden
6 Swe 2013 Sweden
7 Swe 2013 Sweden
8 Sven 2013 Sweden
9 Sven 2013 Sweden
10 Norv 2014 Norway
A more scalable country matcher
pseudo-code only :)
# Modify this as needed
country_matchers = {
'Norway': ['Nor', 'Norv'],
'Sweden': ['S', 'Swed'],
}
def get_country(s):
"""
Run the passed string s against "matchers" for each country
Return the first matched country
"""
for country, matchers in country_matchers.items():
for matcher in matchers:
if matcher in s:
return country
IIUC, I would do the following:
df['country'] = df['season'].apply(lambda x: 'Norway' if 'No' in x else 'Sweden' if 'S' in x else x)
You could create the country dictionary using a dictionary comprehension:
country_id = df.season.str.split().str.get(0).drop_duplicates()
country_dict = {c: ('Norway' if c.startswith('N') else 'Sweden') for c in country_id.values}
to get:
{'Nor': 'Norway', 'Swe': 'Sweden', 'Sven': 'Sweden', 'Norv': 'Norway'}
This works fine for two countries, otherwise you can apply a self-defined function in similar way:
def country_dict(country_id):
if country_id.startswith('S'):
return 'Sweden'
elif country_id.startswith('N'):
return 'Norway'
elif country_id.startswith('XX'):
return ...
else:
return 'default'
Either way, map the dictionary to the country_id part of the season column, extracted using pandas string methods:
df['country'] = df.season.str.split().str.get(0).map(country_dict)
season country
0 Nor 2014 Norway
1 Nor 2013 Norway
2 Nor 2013 Norway
3 Norv 2013 Norway
4 Swe 2014 Sweden
5 Swe 2014 Sweden
6 Swe 2013 Sweden
7 Swe 2013 Sweden
8 Sven 2013 Sweden
9 Sven 2013 Sweden
10 Norv 2014 Norway

Categories