Loop only takes last value - python

I have a dataFrame with country-specific population for each year and a pandas Series with the world population for each year.
This is the Series I am using:
pop_tot = df3.groupby('Year')['population'].sum()
Year
1990 4.575442e+09
1991 4.659075e+09
1992 4.699921e+09
1993 4.795129e+09
1994 4.862547e+09
1995 4.949902e+09
... ...
2017 6.837429e+09
and this is the DataFrame I am using
Country Year HDI population
0 Afghanistan 1990 NaN 1.22491e+07
1 Albania 1990 0.645 3.28654e+06
2 Algeria 1990 0.577 2.59124e+07
3 Andorra 1990 NaN 54509
4 Angola 1990 NaN 1.21714e+07
... ... ... ... ...
4096 Uzbekistan 2017 0.71 3.23872e+07
4097 Vanuatu 2017 0.603 276244
4098 Zambia 2017 0.588 1.70941e+07
4099 Zimbabwe 2017 0.535 1.65299e+07
I want to calculate the proportion of the world's population that the population of that country represents for each year, so I loop over the Series and the DataFrame as follows:
j = 0
for i in range(len(df3)):
if df3.iloc[i,1]==pop_tot.index[j]:
df3['pop_tot']=pop_tot[j] #Sanity check
df3['weighted']=df3['population']/pop_tot[j]
*df3.iloc[i,2]
else:
j=j+1
However, the DataFrame that I get in return is not the expected one. I end up dividing all the values by the total population of 2017, thus giving me proportions which are not the correct ones for that year (i.e. for this first rows, pop_tot should be 4.575442e+09 as it corresponds to 1990 according to the Series above and not 6.837429e+09 which corresponds to 2017).
Country Year HDI population pop_tot weighted
0 Albania 1990 0.645 3.28654e+06 6.837429e+09 0.000257158
1 Algeria 1990 0.577 2.59124e+07 6.837429e+09 0.00202753
2 Argentina 1990 0.704 3.27297e+07 6.837429e+09 0.00256096
I can't see however what's the mistake in the loop.
Thanks in advance.

You don't need loop, you can use groupby.transform to create the column pop_tot in df3 directly. then for the column weighted just do column operation, such as:
df3['pop_tot'] = df3.groupby('Year')['population'].transform(sum)
df3['weighted'] = df3['population']/df3['pop_tot']
As #roganjosh pointed out, the problem with your method is that you replace the whole columns pop_tot and weighted everytime your condition if is met, so at the last iteration where this condition is met, the year being probably 2017, you define the value of the column pop_tot being the one of 2017 and calculate the weithed with this value as well.

You dont have to loop, its slower and can make things really complex quite fast. Use pandas and numpys vectorized solutions like this for example:
df['pop_tot'] = df.population.sum()
df['weighted'] = df.population / df.population.sum()
print(df)
Country Year HDI population pop_tot weighted
0 Afghanistan 1990 NaN 12249100.0 53673949.0 0.228213
1 Albania 1990 0.645 3286540.0 53673949.0 0.061232
2 Algeria 1990 0.577 25912400.0 53673949.0 0.482774
3 Andorra 1990 NaN 54509.0 53673949.0 0.001016
4 Angola 1990 NaN 12171400.0 53673949.0 0.226766
Edit after OP's comment
df['pop_tot'] = df.groupby('Year').population.transform('sum')
df['weighted'] = df.population / df['pop_tot']
print(df)
Country Year HDI population pop_tot weighted
0 Afghanistan 1990 NaN 12249100.0 53673949.0 0.228213
1 Albania 1990 0.645 3286540.0 53673949.0 0.061232
2 Algeria 1990 0.577 25912400.0 53673949.0 0.482774
3 Andorra 1990 NaN 54509.0 53673949.0 0.001016
4 Angola 1990 NaN 12171400.0 53673949.0 0.226766
note
I used the small dataset you gave as example:
Country Year HDI population
0 Afghanistan 1990 NaN 12249100.0
1 Albania 1990 0.645 3286540.0
2 Algeria 1990 0.577 25912400.0
3 Andorra 1990 NaN 54509.0
4 Angola 1990 NaN 12171400.0

Related

Python pandas Dataframe column to rows manipulation [duplicate]

This question already has answers here:
Pandas Melt Function
(2 answers)
Closed 1 year ago.
I'm trying to transpose a few columns while keeping the other columns. I'm having a hard time with pivot codes or transpose codes as it doesn't really give me the output I need.
Can anyone help?
I have this data frame:
EmpID
Goal
week 1
week 2
week 3
week 4
1
556
54
33
24
54
2
342
32
32
56
43
3
534
43
65
64
21
4
244
45
87
5
22
My expected dataframe output is:
EmpID
Goal
Weeks
Actual
1
556
week 1
54
1
556
week 2
33
1
556
week 3
24
1
556
week 4
54
and so on until the full employee IDs are listed..
Something like this.
# Python - melt DF
import pandas as pd
d = {'Country Code': [1960, 1961, 1962, 1963, 1964, 1965, 1966],
'ABW': [2.615300, 2.734390, 2.678430, 2.929920, 2.963250, 3.060540, 4.349760],
'AFG': [0.249760, 0.218480, 0.210840, 0.217240, 0.211410, 0.209910, 0.671330],
'ALB': ['NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 1.12214]}
df = pd.DataFrame(data=d)
print(df)
df1 = (df.melt(['Country Code'], var_name='Year', value_name='Econometric_Metric')
.sort_values(['Country Code','Year'])
.reset_index(drop=True))
print(df1)
df2 = (df.set_index(['Country Code'])
.stack(dropna=False)
.reset_index(name='Econometric_Metric')
.rename(columns={'level_1':'Year'}))
print(df2)
# BEFORE
ABW AFG ALB Country Code
0 2.61530 0.24976 NaN 1960
1 2.73439 0.21848 NaN 1961
2 2.67843 0.21084 NaN 1962
3 2.92992 0.21724 NaN 1963
4 2.96325 0.21141 NaN 1964
5 3.06054 0.20991 NaN 1965
6 4.34976 0.67133 1.12214 1966
# AFTER
Country Code Year Econometric_Metric
0 1960 ABW 2.6153
1 1960 AFG 0.24976
2 1960 ALB NaN
3 1961 ABW 2.73439
4 1961 AFG 0.21848
5 1961 ALB NaN
6 1962 ABW 2.67843
7 1962 AFG 0.21084
8 1962 ALB NaN
9 1963 ABW 2.92992
10 1963 AFG 0.21724
11 1963 ALB NaN
12 1964 ABW 2.96325
13 1964 AFG 0.21141
14 1964 ALB NaN
15 1965 ABW 3.06054
16 1965 AFG 0.20991
17 1965 ALB NaN
18 1966 ABW 4.34976
19 1966 AFG 0.67133
20 1966 ALB 1.12214
Country Code Year Econometric_Metric
0 1960 ABW 2.6153
1 1960 AFG 0.24976
2 1960 ALB NaN
3 1961 ABW 2.73439
4 1961 AFG 0.21848
5 1961 ALB NaN
6 1962 ABW 2.67843
7 1962 AFG 0.21084
8 1962 ALB NaN
9 1963 ABW 2.92992
10 1963 AFG 0.21724
11 1963 ALB NaN
12 1964 ABW 2.96325
13 1964 AFG 0.21141
14 1964 ALB NaN
15 1965 ABW 3.06054
16 1965 AFG 0.20991
17 1965 ALB NaN
18 1966 ABW 4.34976
19 1966 AFG 0.67133
20 1966 ALB 1.12214
Also, take a look at the link below, for more info.
https://www.dataindependent.com/pandas/pandas-melt/

Applying rolling median across row for pandas dataframe

I would like to apply a rolling median to replace NaN values in the following dataframe, with a window size of 3:
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
17 366000.0 278000.0 330000.0 NaN 434000.0 470600.0 433000.0 456000.0 556300.0 580200.0 635300.0 690600.0 800000.0 NaN 821500.0 ... 850800.0 905000.0 947500.0 1016500.0 1043900.0 1112800.0 1281900.0 1312700.0 1422000.0 1526900.0 1580000.0 1599000.0 1580000.0 NaN NaN
However pandas rolling function seems to work for columns and not along a row. How can i fix this? Also, the solution should NOT change any of the non NAN values in that row
First compute the rolling medians by using rolling() with axis=1 (row-wise), min_periods=0 (to handle NaN), and closed='both' (otherwise left edge gets excluded).
Then replace only the NaN entries with these medians by using fillna().
medians = df.rolling(3, min_periods=0, closed='both', axis=1).median()
df = df.fillna(medians)
# 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ... 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
# 17 366000.0 278000.0 330000.0 330000.0 434000.0 470600.0 433000.0 456000.0 556300.0 580200.0 ... 1112800.0 1281900.0 1312700.0 1422000.0 1526900.0 1580000.0 1599000.0 1580000.0 1580000.0 1589500.0

How to replace NA's with the mean of variables when the mean is calculated per each Country's data

I need help to code in python to replace NA's with the mean of the variable (suicide rates is the variable, and there are 18 years of data for each country (country is another variable)). So I want the mean of the 17 years worth of suicide rates for the specific country to replace the NA for the 18th year. Example - Saudi Arabia has one year of data missing out of the 18 years. I want to find the mean of the 17 years suicide rates and replace the NA with that year. I need to have the code loop through to replace the NA's for every variable. All variables are rates of suicides or deaths. The picture shows a highlighted cell which is an example of one that is missing data. Each country has the data for the 18 years from 1990 to 2018.
Suppose you had this dataframe:
ID Year Entity Variable_1 Variable_2
0 0 2000 Canada 120.0 600.0
1 1 2001 Canada 100.0 700.0
2 2 2002 Canada NaN 800.0
3 3 2000 Switzerland 300.0 200.0
4 4 2001 Switzerland 400.0 NaN
5 5 2002 Switzerland 500.0 400.0
You could create another dataframe with the means for each country and variable:
means = df.groupby('Entity').mean()
Then you could loop through each country and each variable, and set the missing values to the appropriate mean for that country and variable:
for country in df.Entity:
for col in df.drop(columns = ['ID','Year','Entity']).columns:
df.loc[(df.Entity == country) & (df[col].isnull()),col] = means.loc[country,col]
Result:
ID Year Entity Variable_1 Variable_2
0 0 2000 Canada 120.0 600.0
1 1 2001 Canada 100.0 700.0
2 2 2002 Canada 110.0 800.0
3 3 2000 Switzerland 300.0 200.0
4 4 2001 Switzerland 400.0 300.0
5 5 2002 Switzerland 500.0 400.0

Pandas merge 2 databases based on 2 keys

I am trying to merge 2 pandas df's:
ISO3 Year Calories
AFG 1960 2300
AFG 1961 2323
...
USA 2005 2800
USA 2006 2828
and
ISO3 Year GDP
AFG 1980 3600
AFG 1981 3636
...
USA 2049 10000
USA 2050 10100
I have tried pd.merge(df1,df2,on=['ISO3','Year'],how=outer) and many others, but for some reason it does not work, any help?
pd.merge(df1, df2, on=['ISO3', 'Year'], how='outer') should work just fine. Can you modify the example, or post a concrete example of df1, df2 which is not yielding the correct result?
In [58]: df1 = pd.read_table('data', sep='\s+')
In [59]: df1
Out[59]:
ISO3 Year Calories
0 AFG 1960 2300
1 AFG 1961 2323
2 USA 2005 2800
3 USA 2006 2828
In [60]: df2 = pd.read_table('data2', sep='\s+')
In [61]: df2
Out[61]:
ISO3 Year GDP
0 AFG 1980 3600
1 AFG 1981 3636
2 USA 2049 10000
3 USA 2050 10100
In [62]: pd.merge(df1, df2, on=['ISO3', 'Year'], how='outer')
Out[62]:
ISO3 Year Calories GDP
0 AFG 1960 2300 NaN
1 AFG 1961 2323 NaN
2 USA 2005 2800 NaN
3 USA 2006 2828 NaN
4 AFG 1980 NaN 3600
5 AFG 1981 NaN 3636
6 USA 2049 NaN 10000
7 USA 2050 NaN 10100

Pandas DataFrame from WB WDI data: combine year columns into "year" variable and merge rows

I have a dataset (.tsv file) with the following columns. (It's the World Bank's new WDI all-in all-time single-download dataset. Nice!)
country countrycode varname 1960 1961 1962
afghanistan AFG GDP 5.6 5.7 5.8
afghanistan AFG Gini .77 .78 .75
afghanistan AFG educ 8.1 8.2 8.3
afghanistan AFG pop 888 889 890
albania ALB GDP 6.6 6.7 6.8
albania ALB Gini .45 .46 .47
albania ALB educ 6.2 6.3 6.4
albania ALB pop 777 778 779
I need a pandas DataFrame with ['GDP','Gini','edu','pop'] as columns, along with ['country', 'countrycode', 'year']. So the values for "year" are currently columns!
And I'd like there to be only one row for each country-year combination.
For instance, the columns and first row would be
country countrycode year GDP Gini educ pop
afghanistan AFG 1960 5.6 .77 8.1 888
This seems like some complex pivot or opposite-of-"melt", but I cannot figure it out.
In [59]: df
Out[59]:
country countrycode varname 1960 1961 1962
0 afghanistan AFG GDP 5.60 5.70 5.80
1 afghanistan AFG Gini 0.77 0.78 0.75
2 afghanistan AFG educ 8.10 8.20 8.30
3 afghanistan AFG pop 888.00 889.00 890.00
4 albania ALB GDP 6.60 6.70 6.80
5 albania ALB Gini 0.45 0.46 0.47
6 albania ALB educ 6.20 6.30 6.40
7 albania ALB pop 777.00 778.00 779.00
In [60]: df = df.set_index(['country', 'countrycode', 'varname'])
In [61]: df.columns.name = 'year'
In [62]: df.stack().unstack('varname')
Out[62]:
varname GDP Gini educ pop
country countrycode year
afghanistan AFG 1960 5.6 0.77 8.1 888
1961 5.7 0.78 8.2 889
1962 5.8 0.75 8.3 890
albania ALB 1960 6.6 0.45 6.2 777
1961 6.7 0.46 6.3 778
1962 6.8 0.47 6.4 779
The latter is a frame with a MutliIndex, you can do reset_index to move the MultiIndex to regular columns.
Group your DataFrame by country and countrycode and then apply your own function:
In [13]: def f(df):
....: del df['country']
....: del df['countrycode']
....: df = df.set_index('varname')
....: df.index.name = None
....: df = df.T
....: df.index.name = 'year'
....: return df
....:
In [14]: df.groupby(['country', 'countrycode']).apply(f).reset_index()
Out[14]:
country countrycode year GDP Gini educ pop
0 afghanistan AFG 1960 5.6 0.77 8.1 888
1 afghanistan AFG 1961 5.7 0.78 8.2 889
2 afghanistan AFG 1962 5.8 0.75 8.3 890
3 albania ALB 1960 6.6 0.45 6.2 777
4 albania ALB 1961 6.7 0.46 6.3 778
5 albania ALB 1962 6.8 0.47 6.4 779
I'm suggesting that #Wouter may put this in his (accepted) answer, as it uses the actual names from the WDI data, and makes it more cut and paste for someone else using them. Sorry -- I'm sure this isn't the right way to communicate this...
For any variables that you want to keep/use, just give them a name in this dict:
WDIconversions={"Year":'year',
"YearCode":'',
"Country Name":'country_name_wb',
"Country Code":'countryCode_ISO3_WB',
"Inflation, consumer prices (annual %)":'',
"Inflation, GDP deflator (annual %)":'',
"GDP per capita, PPP (constant 2005 international $)":'GDPpc',
"Firms with female participation in ownership (% of firms)":'',
"Investment in energy with private participation (current US$)":'',
"Investment in telecoms with private participation (current US$)":'',
"Investment in transport with private participation (current US$)":'',
"Investment in water and sanitation with private participation (current US$)":'',
"Labor participation rate, female (% of female population ages 15+)":'',
"Labor participation rate, male (% of male population ages 15+)":'',
"Labor participation rate, total (% of total population ages 15+)":'',
"Ratio of female to male labor participation rate (%)":'',
"Life expectancy at birth, female (years)":'',
"Life expectancy at birth, male (years)":'',
"Life expectancy at birth, total (years)":'lifeExpectancy',
"Population, total":'nat_pop',
"GINI index":'GiniWB',
} # etc etc etc
dfW=pd.read_table(WBDrawfile)
df = dfW.set_index(['Country Name','Country Code','Indicator Name'])
del df['Indicator Code']
df.columns.name = 'year'
df=df.stack().unstack('Indicator Name')
df=df[[kk for kk,ii in WDIconversions.items() if ii and kk in df]].reset_index().rename(columns=WDIconversions)
That results in:
df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12983 entries, 0 to 12982
Data columns:
country_name_wb 12983 non-null values
countryCode_ISO3_WB 12983 non-null values
year 12983 non-null values
GiniWB 845 non-null values
nat_pop 12601 non-null values
GDPpc 6292 non-null values
educPrimary 4949 non-null values
lifeExpectancy 11077 non-null values
dtypes: float64(5), object(3)

Categories