Is there a way to group plots based on matching row values? - python

I have a data frame like shown below.
Country Type 2011 2012 2013
Afghanistan Estimate -1.63 -1.57 -1.41
Afghanistan Sources 5 8 7
Afghanistan Percentile 0.95 0.94 2.36
.
.
.
Zambia Estimate 1.63 1.57 1.41
Zambia Sources 7 10 8
Zambia Percentile 0.88 0.77 1.54
I am hoping to generate plots (preferably line graphs) for each country (Type will be used as legend). Is there a way to group plots for each country? I am relatively new and don't know where to begin.

I'm afraid you can't get away with at least some transformations.
If it's OK to use Seaborn for plotting, it could look something like this:
import pandas as pd
import seaborn as sns
from io import StringIO
df = pd.read_csv(StringIO('''
Country,Type,2011,2012,2013
Afghanistan,Estimate,-1.63,-1.57,-1.41
Afghanistan,Sources,5,8,7
Afghanistan,Percentile,0.95,0.94,2.36
Zambia,Estimate,1.63,1.57,1.41
Zambia,Sources,7,10,8
Zambia,Percentile,0.88,0.77,1.54
'''), dtype={'Country' : 'string',
'Type' : 'string',
'2011' : 'float',
'2012' : 'float',
'2013' : 'float'})
# Country Type 2011 2012 2013
# 0 Afghanistan Estimate -1.63 -1.57 -1.41
# 1 Afghanistan Sources 5.00 8.00 7.00
# 2 Afghanistan Percentile 0.95 0.94 2.36
# 3 Zambia Estimate 1.63 1.57 1.41
# 4 Zambia Sources 7.00 10.00 8.00
# 5 Zambia Percentile 0.88 0.77 1.54
# transform to long format
df = df.melt(id_vars=['Country', 'Type'],
value_vars=['2011','2012','2013'],
var_name='Year')
# df after melt:
# Country Type Year value
# 0 Afghanistan Estimate 2011 -1.63
# 1 Afghanistan Sources 2011 5.00
# 2 Afghanistan Percentile 2011 0.95
# 3 Zambia Estimate 2011 1.63
# 4 Zambia Sources 2011 7.00
# 5 Zambia Percentile 2011 0.88
# 6 Afghanistan Estimate 2012 -1.57
# 7 Afghanistan Sources 2012 8.00
# 8 Afghanistan Percentile 2012 0.94
# 9 Zambia Estimate 2012 1.57
# 10 Zambia Sources 2012 10.00
# 11 Zambia Percentile 2012 0.77
# 12 Afghanistan Estimate 2013 -1.41
# 13 Afghanistan Sources 2013 7.00
# 14 Afghanistan Percentile 2013 2.36
# 15 Zambia Estimate 2013 1.41
# 16 Zambia Sources 2013 8.00
# 17 Zambia Percentile 2013 1.54
sns.relplot(data=df, kind='line', x='Year',
y='value', hue='Type', col="Country")

Related

<bound method NDFrame.head error on Jupiter notebook

I am getting this error <bound method NDFrame.head of .
It is not showing my data frame properly, what should I do?
My code is basic, here it is:
import pandas as pd
df = pd.read_csv("/Users/shloak/Desktop/Pandas/Avacado/avocado.csv”)
albany_df = df[ df['region'] == "Albany"]
albany_df.head
This is my output
<bound method NDFrame.head of Unnamed: 0 Date AveragePrice Total Volume 4046 4225 \
0 0 2015-12-27 1.33 64236.62 1036.74 54454.85
1 1 2015-12-20 1.35 54876.98 674.28 44638.81
2 2 2015-12-13 0.93 118220.22 794.70 109149.67
3 3 2015-12-06 1.08 78992.15 1132.00 71976.41
4 4 2015-11-29 1.28 51039.60 941.48 43838.39
... ... ... ... ... ... ...
17608 7 2018-02-04 1.52 4124.96 118.38 420.36
17609 8 2018-01-28 1.32 6987.56 433.66 374.96
17610 9 2018-01-21 1.54 3346.54 14.67 253.01
17611 10 2018-01-14 1.47 4140.95 7.30 301.87
17612 11 2018-01-07 1.54 4816.90 43.51 412.17
4770 Total Bags Small Bags Large Bags XLarge Bags type \
0 48.16 8696.87 8603.62 93.25 0.0 conventional
1 58.33 9505.56 9408.07 97.49 0.0 conventional
2 130.50 8145.35 8042.21 103.14 0.0 conventional
3 72.58 5811.16 5677.40 133.76 0.0 conventional
4 75.78 6183.95 5986.26 197.69 0.0 conventional
... ... ... ... ... ... ...
17608 0.00 3586.22 3586.22 0.00 0.0 organic
17609 0.00 6178.94 6178.94 0.00 0.0 organic
17610 0.00 3078.86 3078.86 0.00 0.0 organic
17611 0.00 3831.78 3831.78 0.00 0.0 organic
17612 0.00 4361.22 4357.89 3.33 0.0 organic
year region
0 2015 Albany
1 2015 Albany
2 2015 Albany
3 2015 Albany
4 2015 Albany
... ... ...
17608 2018 Albany
17609 2018 Albany
17610 2018 Albany
17611 2018 Albany
17612 2018 Albany
[338 rows x 14 columns]>
What is the reason for this? I have Pythonn 3.9 and Pandas 1.1.3
head is a method, you need to call it, like this: albany_df.head().
Right now you are not getting an error, but you print the method itself instead of the result of calling it.

Merging csv files with some columns same and others different in python

I am new to coding and I'm having an issue merging csv files. I have searched similar questions and haven't found a fix. Just to include some relevant details:
CSV files are cancer types over the period of 1950 - 2017 for different countries (lung cancer, colorectal cancer, stomach cancer, liver cancer and breast cancer)
Below is an example of the layout of lung cancer.
dlung.describe(include='all')
dlung
Year Cancer Country Gender ASR SE
0 1950 Lung Australia Male 13.89 0.56
1 1951 Lung Australia Male 14.84 0.57
2 1952 Lung Australia Male 17.19 0.61
3 1953 Lung Australia Male 18.21 0.62
4 1954 Lung Australia Male 19.05 0.63
5 1955 Lung Australia Male 20.65 0.65
6 1956 Lung Australia Male 22.05 0.67
7 1957 Lung Australia Male 23.93 0.69
8 1958 Lung Australia Male 23.77 0.68
9 1959 Lung Australia Male 26.12 0.71
10 1960 Lung Australia Male 27.08 0.72
I am interested in joining all cancer types into one dataframe based on shared columns (year, country).
I have tried different methods, but they all seem to duplicate Year and Country (as below)
This one wasn't bad, but I have two columns for year and country
df_lung_colorectal = pd.concat([dlung, dcolorectal], axis = 1)
df_lung_colorectal
Year Cancer Country Gender ASR SE Year Cancer Country Gender ASR SE
If I continue like this, I will end up with 5 identical columns for YEAR and 5 for COUNTRY.
Any ideas on how merge all values that are independent (Cancer type and associated ASR (standardized risk), and SE values) with only one column for YEAR, COUNTRY (and GENDER) if possible?
Yes, it is possible if use DataFrame.set_index, but then are duplicated another columns names:
print (dlung)
Year Cancer Country Gender ASR SE
0 1950 Lung Australia Male 13.89 0.56
1 1951 Lung Australia Male 14.84 0.57
2 1952 Lung Australia Male 17.19 0.61
3 1953 Lung Australia Male 18.21 0.62
4 1954 Lung Australia Male 19.05 0.63
print (dcolorectal)
Year Cancer Country Gender ASR SE
6 1950 colorectal Australia Male 22.05 0.67
7 1951 colorectal Australia Male 23.93 0.69
8 1952 colorectal Australia Male 23.77 0.68
9 1953 colorectal Australia Male 26.12 0.71
10 1954 colorectal Australia Male 27.08 0.72
df_lung_colorectal = pd.concat([dlung.set_index(['Year','Country','Gender']),
dcolorectal.set_index(['Year','Country','Gender'])], axis = 1)
print (df_lung_colorectal)
Cancer ASR SE Cancer ASR SE
Year Country Gender
1950 Australia Male Lung 13.89 0.56 colorectal 22.05 0.67
1951 Australia Male Lung 14.84 0.57 colorectal 23.93 0.69
1952 Australia Male Lung 17.19 0.61 colorectal 23.77 0.68
1953 Australia Male Lung 18.21 0.62 colorectal 26.12 0.71
1954 Australia Male Lung 19.05 0.63 colorectal 27.08 0.72
But I think better is first concat all DataFrame together with axis=0, what is default value, so should be removed and last reshape by DataFrame.set_index and DataFrame.unstack:
df = pd.concat([dlung, dcolorectal]).set_index(['Year','Country','Gender','Cancer']).unstack()
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
Year Country Gender ASR_Lung ASR_colorectal SE_Lung SE_colorectal
0 1950 Australia Male 13.89 22.05 0.56 0.67
1 1951 Australia Male 14.84 23.93 0.57 0.69
2 1952 Australia Male 17.19 23.77 0.61 0.68
3 1953 Australia Male 18.21 26.12 0.62 0.71
4 1954 Australia Male 19.05 27.08 0.63 0.72
Concat with axis=0 to merge them row-wise.
with axis=1 you are asking it to Concat side-to-side.

Sorting values in a dataframe

I've been trying to sort the values on my a dataframe that I've been given to work on.
The following is my dataframe.
1981 1.78
1982 1.74
1983 1.61
1984 1.62
1985 1.61
1986 1.43
1987 1.62
1988 1.96
1989 1.75
1990 1.83
1991 1.73
1992 1.72
1993 1.74
1994 1.71
1995 1.67
1996 1.66
1997 1.61
1998 1.48
1999 1.47
2000 1.6
2001 1.41
2002 1.37
2003 1.27
2004 1.26
2005 1.26
2006 1.28
2007 1.29
2008 1.28
2009 1.22
2010 1.15
2011 1.2
2012 1.29
2013 1.19
2014 1.25
2015 1.24
2016 1.2
2017 1.16
2018 1.14
I've been trying to sort my dataframe in descending order such that the highest values on the right would appear first. However whenever I try to sort it, it would only sort based on the year which are the values on the left.
dataframe.sort_values('1')
I've tried using sort_values and indicating '1' as the column that I want sorted. This however returns ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'>
From the error that OP mentioned, The data structure is a Series and hence the sort function should just be called directly
s = s.sort_values(ascending=False)
The error was raised because, in pandas.Series.sort_values the first argument is axis.
The argument of sort_values() should be column name:
df=df.sort_values("col2")

Loop only takes last value

I have a dataFrame with country-specific population for each year and a pandas Series with the world population for each year.
This is the Series I am using:
pop_tot = df3.groupby('Year')['population'].sum()
Year
1990 4.575442e+09
1991 4.659075e+09
1992 4.699921e+09
1993 4.795129e+09
1994 4.862547e+09
1995 4.949902e+09
... ...
2017 6.837429e+09
and this is the DataFrame I am using
Country Year HDI population
0 Afghanistan 1990 NaN 1.22491e+07
1 Albania 1990 0.645 3.28654e+06
2 Algeria 1990 0.577 2.59124e+07
3 Andorra 1990 NaN 54509
4 Angola 1990 NaN 1.21714e+07
... ... ... ... ...
4096 Uzbekistan 2017 0.71 3.23872e+07
4097 Vanuatu 2017 0.603 276244
4098 Zambia 2017 0.588 1.70941e+07
4099 Zimbabwe 2017 0.535 1.65299e+07
I want to calculate the proportion of the world's population that the population of that country represents for each year, so I loop over the Series and the DataFrame as follows:
j = 0
for i in range(len(df3)):
if df3.iloc[i,1]==pop_tot.index[j]:
df3['pop_tot']=pop_tot[j] #Sanity check
df3['weighted']=df3['population']/pop_tot[j]
*df3.iloc[i,2]
else:
j=j+1
However, the DataFrame that I get in return is not the expected one. I end up dividing all the values by the total population of 2017, thus giving me proportions which are not the correct ones for that year (i.e. for this first rows, pop_tot should be 4.575442e+09 as it corresponds to 1990 according to the Series above and not 6.837429e+09 which corresponds to 2017).
Country Year HDI population pop_tot weighted
0 Albania 1990 0.645 3.28654e+06 6.837429e+09 0.000257158
1 Algeria 1990 0.577 2.59124e+07 6.837429e+09 0.00202753
2 Argentina 1990 0.704 3.27297e+07 6.837429e+09 0.00256096
I can't see however what's the mistake in the loop.
Thanks in advance.
You don't need loop, you can use groupby.transform to create the column pop_tot in df3 directly. then for the column weighted just do column operation, such as:
df3['pop_tot'] = df3.groupby('Year')['population'].transform(sum)
df3['weighted'] = df3['population']/df3['pop_tot']
As #roganjosh pointed out, the problem with your method is that you replace the whole columns pop_tot and weighted everytime your condition if is met, so at the last iteration where this condition is met, the year being probably 2017, you define the value of the column pop_tot being the one of 2017 and calculate the weithed with this value as well.
You dont have to loop, its slower and can make things really complex quite fast. Use pandas and numpys vectorized solutions like this for example:
df['pop_tot'] = df.population.sum()
df['weighted'] = df.population / df.population.sum()
print(df)
Country Year HDI population pop_tot weighted
0 Afghanistan 1990 NaN 12249100.0 53673949.0 0.228213
1 Albania 1990 0.645 3286540.0 53673949.0 0.061232
2 Algeria 1990 0.577 25912400.0 53673949.0 0.482774
3 Andorra 1990 NaN 54509.0 53673949.0 0.001016
4 Angola 1990 NaN 12171400.0 53673949.0 0.226766
Edit after OP's comment
df['pop_tot'] = df.groupby('Year').population.transform('sum')
df['weighted'] = df.population / df['pop_tot']
print(df)
Country Year HDI population pop_tot weighted
0 Afghanistan 1990 NaN 12249100.0 53673949.0 0.228213
1 Albania 1990 0.645 3286540.0 53673949.0 0.061232
2 Algeria 1990 0.577 25912400.0 53673949.0 0.482774
3 Andorra 1990 NaN 54509.0 53673949.0 0.001016
4 Angola 1990 NaN 12171400.0 53673949.0 0.226766
note
I used the small dataset you gave as example:
Country Year HDI population
0 Afghanistan 1990 NaN 12249100.0
1 Albania 1990 0.645 3286540.0
2 Algeria 1990 0.577 25912400.0
3 Andorra 1990 NaN 54509.0
4 Angola 1990 NaN 12171400.0

Pandas DataFrame from WB WDI data: combine year columns into "year" variable and merge rows

I have a dataset (.tsv file) with the following columns. (It's the World Bank's new WDI all-in all-time single-download dataset. Nice!)
country countrycode varname 1960 1961 1962
afghanistan AFG GDP 5.6 5.7 5.8
afghanistan AFG Gini .77 .78 .75
afghanistan AFG educ 8.1 8.2 8.3
afghanistan AFG pop 888 889 890
albania ALB GDP 6.6 6.7 6.8
albania ALB Gini .45 .46 .47
albania ALB educ 6.2 6.3 6.4
albania ALB pop 777 778 779
I need a pandas DataFrame with ['GDP','Gini','edu','pop'] as columns, along with ['country', 'countrycode', 'year']. So the values for "year" are currently columns!
And I'd like there to be only one row for each country-year combination.
For instance, the columns and first row would be
country countrycode year GDP Gini educ pop
afghanistan AFG 1960 5.6 .77 8.1 888
This seems like some complex pivot or opposite-of-"melt", but I cannot figure it out.
In [59]: df
Out[59]:
country countrycode varname 1960 1961 1962
0 afghanistan AFG GDP 5.60 5.70 5.80
1 afghanistan AFG Gini 0.77 0.78 0.75
2 afghanistan AFG educ 8.10 8.20 8.30
3 afghanistan AFG pop 888.00 889.00 890.00
4 albania ALB GDP 6.60 6.70 6.80
5 albania ALB Gini 0.45 0.46 0.47
6 albania ALB educ 6.20 6.30 6.40
7 albania ALB pop 777.00 778.00 779.00
In [60]: df = df.set_index(['country', 'countrycode', 'varname'])
In [61]: df.columns.name = 'year'
In [62]: df.stack().unstack('varname')
Out[62]:
varname GDP Gini educ pop
country countrycode year
afghanistan AFG 1960 5.6 0.77 8.1 888
1961 5.7 0.78 8.2 889
1962 5.8 0.75 8.3 890
albania ALB 1960 6.6 0.45 6.2 777
1961 6.7 0.46 6.3 778
1962 6.8 0.47 6.4 779
The latter is a frame with a MutliIndex, you can do reset_index to move the MultiIndex to regular columns.
Group your DataFrame by country and countrycode and then apply your own function:
In [13]: def f(df):
....: del df['country']
....: del df['countrycode']
....: df = df.set_index('varname')
....: df.index.name = None
....: df = df.T
....: df.index.name = 'year'
....: return df
....:
In [14]: df.groupby(['country', 'countrycode']).apply(f).reset_index()
Out[14]:
country countrycode year GDP Gini educ pop
0 afghanistan AFG 1960 5.6 0.77 8.1 888
1 afghanistan AFG 1961 5.7 0.78 8.2 889
2 afghanistan AFG 1962 5.8 0.75 8.3 890
3 albania ALB 1960 6.6 0.45 6.2 777
4 albania ALB 1961 6.7 0.46 6.3 778
5 albania ALB 1962 6.8 0.47 6.4 779
I'm suggesting that #Wouter may put this in his (accepted) answer, as it uses the actual names from the WDI data, and makes it more cut and paste for someone else using them. Sorry -- I'm sure this isn't the right way to communicate this...
For any variables that you want to keep/use, just give them a name in this dict:
WDIconversions={"Year":'year',
"YearCode":'',
"Country Name":'country_name_wb',
"Country Code":'countryCode_ISO3_WB',
"Inflation, consumer prices (annual %)":'',
"Inflation, GDP deflator (annual %)":'',
"GDP per capita, PPP (constant 2005 international $)":'GDPpc',
"Firms with female participation in ownership (% of firms)":'',
"Investment in energy with private participation (current US$)":'',
"Investment in telecoms with private participation (current US$)":'',
"Investment in transport with private participation (current US$)":'',
"Investment in water and sanitation with private participation (current US$)":'',
"Labor participation rate, female (% of female population ages 15+)":'',
"Labor participation rate, male (% of male population ages 15+)":'',
"Labor participation rate, total (% of total population ages 15+)":'',
"Ratio of female to male labor participation rate (%)":'',
"Life expectancy at birth, female (years)":'',
"Life expectancy at birth, male (years)":'',
"Life expectancy at birth, total (years)":'lifeExpectancy',
"Population, total":'nat_pop',
"GINI index":'GiniWB',
} # etc etc etc
dfW=pd.read_table(WBDrawfile)
df = dfW.set_index(['Country Name','Country Code','Indicator Name'])
del df['Indicator Code']
df.columns.name = 'year'
df=df.stack().unstack('Indicator Name')
df=df[[kk for kk,ii in WDIconversions.items() if ii and kk in df]].reset_index().rename(columns=WDIconversions)
That results in:
df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12983 entries, 0 to 12982
Data columns:
country_name_wb 12983 non-null values
countryCode_ISO3_WB 12983 non-null values
year 12983 non-null values
GiniWB 845 non-null values
nat_pop 12601 non-null values
GDPpc 6292 non-null values
educPrimary 4949 non-null values
lifeExpectancy 11077 non-null values
dtypes: float64(5), object(3)

Categories