Pandas merge 2 databases based on 2 keys - python

I am trying to merge 2 pandas df's:
ISO3 Year Calories
AFG 1960 2300
AFG 1961 2323
...
USA 2005 2800
USA 2006 2828
and
ISO3 Year GDP
AFG 1980 3600
AFG 1981 3636
...
USA 2049 10000
USA 2050 10100
I have tried pd.merge(df1,df2,on=['ISO3','Year'],how=outer) and many others, but for some reason it does not work, any help?

pd.merge(df1, df2, on=['ISO3', 'Year'], how='outer') should work just fine. Can you modify the example, or post a concrete example of df1, df2 which is not yielding the correct result?
In [58]: df1 = pd.read_table('data', sep='\s+')
In [59]: df1
Out[59]:
ISO3 Year Calories
0 AFG 1960 2300
1 AFG 1961 2323
2 USA 2005 2800
3 USA 2006 2828
In [60]: df2 = pd.read_table('data2', sep='\s+')
In [61]: df2
Out[61]:
ISO3 Year GDP
0 AFG 1980 3600
1 AFG 1981 3636
2 USA 2049 10000
3 USA 2050 10100
In [62]: pd.merge(df1, df2, on=['ISO3', 'Year'], how='outer')
Out[62]:
ISO3 Year Calories GDP
0 AFG 1960 2300 NaN
1 AFG 1961 2323 NaN
2 USA 2005 2800 NaN
3 USA 2006 2828 NaN
4 AFG 1980 NaN 3600
5 AFG 1981 NaN 3636
6 USA 2049 NaN 10000
7 USA 2050 NaN 10100

Related

Python pandas Dataframe column to rows manipulation [duplicate]

This question already has answers here:
Pandas Melt Function
(2 answers)
Closed 1 year ago.
I'm trying to transpose a few columns while keeping the other columns. I'm having a hard time with pivot codes or transpose codes as it doesn't really give me the output I need.
Can anyone help?
I have this data frame:
EmpID
Goal
week 1
week 2
week 3
week 4
1
556
54
33
24
54
2
342
32
32
56
43
3
534
43
65
64
21
4
244
45
87
5
22
My expected dataframe output is:
EmpID
Goal
Weeks
Actual
1
556
week 1
54
1
556
week 2
33
1
556
week 3
24
1
556
week 4
54
and so on until the full employee IDs are listed..
Something like this.
# Python - melt DF
import pandas as pd
d = {'Country Code': [1960, 1961, 1962, 1963, 1964, 1965, 1966],
'ABW': [2.615300, 2.734390, 2.678430, 2.929920, 2.963250, 3.060540, 4.349760],
'AFG': [0.249760, 0.218480, 0.210840, 0.217240, 0.211410, 0.209910, 0.671330],
'ALB': ['NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 1.12214]}
df = pd.DataFrame(data=d)
print(df)
df1 = (df.melt(['Country Code'], var_name='Year', value_name='Econometric_Metric')
.sort_values(['Country Code','Year'])
.reset_index(drop=True))
print(df1)
df2 = (df.set_index(['Country Code'])
.stack(dropna=False)
.reset_index(name='Econometric_Metric')
.rename(columns={'level_1':'Year'}))
print(df2)
# BEFORE
ABW AFG ALB Country Code
0 2.61530 0.24976 NaN 1960
1 2.73439 0.21848 NaN 1961
2 2.67843 0.21084 NaN 1962
3 2.92992 0.21724 NaN 1963
4 2.96325 0.21141 NaN 1964
5 3.06054 0.20991 NaN 1965
6 4.34976 0.67133 1.12214 1966
# AFTER
Country Code Year Econometric_Metric
0 1960 ABW 2.6153
1 1960 AFG 0.24976
2 1960 ALB NaN
3 1961 ABW 2.73439
4 1961 AFG 0.21848
5 1961 ALB NaN
6 1962 ABW 2.67843
7 1962 AFG 0.21084
8 1962 ALB NaN
9 1963 ABW 2.92992
10 1963 AFG 0.21724
11 1963 ALB NaN
12 1964 ABW 2.96325
13 1964 AFG 0.21141
14 1964 ALB NaN
15 1965 ABW 3.06054
16 1965 AFG 0.20991
17 1965 ALB NaN
18 1966 ABW 4.34976
19 1966 AFG 0.67133
20 1966 ALB 1.12214
Country Code Year Econometric_Metric
0 1960 ABW 2.6153
1 1960 AFG 0.24976
2 1960 ALB NaN
3 1961 ABW 2.73439
4 1961 AFG 0.21848
5 1961 ALB NaN
6 1962 ABW 2.67843
7 1962 AFG 0.21084
8 1962 ALB NaN
9 1963 ABW 2.92992
10 1963 AFG 0.21724
11 1963 ALB NaN
12 1964 ABW 2.96325
13 1964 AFG 0.21141
14 1964 ALB NaN
15 1965 ABW 3.06054
16 1965 AFG 0.20991
17 1965 ALB NaN
18 1966 ABW 4.34976
19 1966 AFG 0.67133
20 1966 ALB 1.12214
Also, take a look at the link below, for more info.
https://www.dataindependent.com/pandas/pandas-melt/

How to delete rows for column having Non-NaN values

Input Dataframe(df)
Country Region Date Value.....
ABW NaN 01-01-2020 123
ABW NaN 02-01-2020 1234
ABW NaN 03-01-2020 3242
USA NaN 04-01-2020 4354
USA NaN 05-01-2020 43543
USA NaN 06-01-2020 34534
USA NaN 07-01-2020 435
USA WA 08-01-2020 43345
USA WA 09-01-2020 345
USA WV 10-01-2020 345
.
.
.
.
Expected Output(df1)
Country Region Date Value.....
ABW NaN 01-01-2020 123
ABW NaN 02-01-2020 1234
ABW NaN 03-01-2020 3242
USA NaN 04-01-2020 4354
USA NaN 05-01-2020 43543
USA NaN 06-01-2020 34534
USA NaN 07-01-2020 435
.
.
.
.
So from the above dataframe you can see that the column 'Region' has NaN as well as non-null values, I'd like to remove the entire row where column 'Region' has non-NaN values.
Also, AFTER performing the above operation, if I wanted to entirely remove the Region column, how to do that in the fastest possible way(10k+ columns)?? Experts, please help!
FINAL Expected Output
Country Date Value.....
ABW 01-01-2020 123
ABW 02-01-2020 1234
ABW 03-01-2020 3242
USA 04-01-2020 4354
USA 05-01-2020 43543
USA 06-01-2020 34534
USA 07-01-2020 435
Here's the code I tried
df1=df1.isnull(df1['Region'])
Error
df1=df.isnull(df['Region'])
TypeError: isnull() takes 1 positional argument but 2 were given
Using #BEN_YO's suggestion, this is what I did, works fine
filtered_df = df1[df1['Region'].isnull()]

Loop only takes last value

I have a dataFrame with country-specific population for each year and a pandas Series with the world population for each year.
This is the Series I am using:
pop_tot = df3.groupby('Year')['population'].sum()
Year
1990 4.575442e+09
1991 4.659075e+09
1992 4.699921e+09
1993 4.795129e+09
1994 4.862547e+09
1995 4.949902e+09
... ...
2017 6.837429e+09
and this is the DataFrame I am using
Country Year HDI population
0 Afghanistan 1990 NaN 1.22491e+07
1 Albania 1990 0.645 3.28654e+06
2 Algeria 1990 0.577 2.59124e+07
3 Andorra 1990 NaN 54509
4 Angola 1990 NaN 1.21714e+07
... ... ... ... ...
4096 Uzbekistan 2017 0.71 3.23872e+07
4097 Vanuatu 2017 0.603 276244
4098 Zambia 2017 0.588 1.70941e+07
4099 Zimbabwe 2017 0.535 1.65299e+07
I want to calculate the proportion of the world's population that the population of that country represents for each year, so I loop over the Series and the DataFrame as follows:
j = 0
for i in range(len(df3)):
if df3.iloc[i,1]==pop_tot.index[j]:
df3['pop_tot']=pop_tot[j] #Sanity check
df3['weighted']=df3['population']/pop_tot[j]
*df3.iloc[i,2]
else:
j=j+1
However, the DataFrame that I get in return is not the expected one. I end up dividing all the values by the total population of 2017, thus giving me proportions which are not the correct ones for that year (i.e. for this first rows, pop_tot should be 4.575442e+09 as it corresponds to 1990 according to the Series above and not 6.837429e+09 which corresponds to 2017).
Country Year HDI population pop_tot weighted
0 Albania 1990 0.645 3.28654e+06 6.837429e+09 0.000257158
1 Algeria 1990 0.577 2.59124e+07 6.837429e+09 0.00202753
2 Argentina 1990 0.704 3.27297e+07 6.837429e+09 0.00256096
I can't see however what's the mistake in the loop.
Thanks in advance.
You don't need loop, you can use groupby.transform to create the column pop_tot in df3 directly. then for the column weighted just do column operation, such as:
df3['pop_tot'] = df3.groupby('Year')['population'].transform(sum)
df3['weighted'] = df3['population']/df3['pop_tot']
As #roganjosh pointed out, the problem with your method is that you replace the whole columns pop_tot and weighted everytime your condition if is met, so at the last iteration where this condition is met, the year being probably 2017, you define the value of the column pop_tot being the one of 2017 and calculate the weithed with this value as well.
You dont have to loop, its slower and can make things really complex quite fast. Use pandas and numpys vectorized solutions like this for example:
df['pop_tot'] = df.population.sum()
df['weighted'] = df.population / df.population.sum()
print(df)
Country Year HDI population pop_tot weighted
0 Afghanistan 1990 NaN 12249100.0 53673949.0 0.228213
1 Albania 1990 0.645 3286540.0 53673949.0 0.061232
2 Algeria 1990 0.577 25912400.0 53673949.0 0.482774
3 Andorra 1990 NaN 54509.0 53673949.0 0.001016
4 Angola 1990 NaN 12171400.0 53673949.0 0.226766
Edit after OP's comment
df['pop_tot'] = df.groupby('Year').population.transform('sum')
df['weighted'] = df.population / df['pop_tot']
print(df)
Country Year HDI population pop_tot weighted
0 Afghanistan 1990 NaN 12249100.0 53673949.0 0.228213
1 Albania 1990 0.645 3286540.0 53673949.0 0.061232
2 Algeria 1990 0.577 25912400.0 53673949.0 0.482774
3 Andorra 1990 NaN 54509.0 53673949.0 0.001016
4 Angola 1990 NaN 12171400.0 53673949.0 0.226766
note
I used the small dataset you gave as example:
Country Year HDI population
0 Afghanistan 1990 NaN 12249100.0
1 Albania 1990 0.645 3286540.0
2 Algeria 1990 0.577 25912400.0
3 Andorra 1990 NaN 54509.0
4 Angola 1990 NaN 12171400.0

Transpose and widen Data

My panda data frame looks like as follows:
Country Code 1960 1961 1962 1963 1964 1965 1966 1967 1968 ... 2015
ABW 2.615300 2.734390 2.678430 2.929920 2.963250 3.060540 ... 4.349760
AFG 0.249760 0.218480 0.210840 0.217240 0.211410 0.209910 ... 0.671330
ALB NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.12214
...
How can I transpose it that it looks like as follows?
Country_Code Year Econometric_Metric
ABW 1960 2.615300
ABW 1961 2.734390
ABW 1962 2.678430
...
ABW 2015 4.349760
AFG 1960 0.249760
AFG 1961 0.218480
AFG 1962 0.210840
...
AFG 2015 0.671330
ALB 1960 NaN
ALB 1961 NaN
ALB 1962 NaN
ALB 2015 1.12214
...
Thanks.
I think need melt with sort_values:
df = (df.melt(['Country Code'], var_name='Year', value_name='Econometric_Metric')
.sort_values(['Country Code','Year'])
.reset_index(drop=True))
Or set_index with stack:
df = (df.set_index(['Country Code'])
.stack(dropna=False)
.reset_index(name='Econometric_Metric')
.rename(columns={'level_1':'Year'}))
print (df.head(10))
Country Code Year Econometric_Metric
0 ABW 1960 2.61530
1 ABW 1961 2.73439
2 ABW 1962 2.67843
3 ABW 1963 2.92992
4 ABW 1964 2.96325
5 ABW 1965 3.06054
6 ABW 1966 NaN
7 ABW 1967 NaN
8 ABW 1968 NaN
9 ABW 2015 4.34976

Pandas DataFrame from WB WDI data: combine year columns into "year" variable and merge rows

I have a dataset (.tsv file) with the following columns. (It's the World Bank's new WDI all-in all-time single-download dataset. Nice!)
country countrycode varname 1960 1961 1962
afghanistan AFG GDP 5.6 5.7 5.8
afghanistan AFG Gini .77 .78 .75
afghanistan AFG educ 8.1 8.2 8.3
afghanistan AFG pop 888 889 890
albania ALB GDP 6.6 6.7 6.8
albania ALB Gini .45 .46 .47
albania ALB educ 6.2 6.3 6.4
albania ALB pop 777 778 779
I need a pandas DataFrame with ['GDP','Gini','edu','pop'] as columns, along with ['country', 'countrycode', 'year']. So the values for "year" are currently columns!
And I'd like there to be only one row for each country-year combination.
For instance, the columns and first row would be
country countrycode year GDP Gini educ pop
afghanistan AFG 1960 5.6 .77 8.1 888
This seems like some complex pivot or opposite-of-"melt", but I cannot figure it out.
In [59]: df
Out[59]:
country countrycode varname 1960 1961 1962
0 afghanistan AFG GDP 5.60 5.70 5.80
1 afghanistan AFG Gini 0.77 0.78 0.75
2 afghanistan AFG educ 8.10 8.20 8.30
3 afghanistan AFG pop 888.00 889.00 890.00
4 albania ALB GDP 6.60 6.70 6.80
5 albania ALB Gini 0.45 0.46 0.47
6 albania ALB educ 6.20 6.30 6.40
7 albania ALB pop 777.00 778.00 779.00
In [60]: df = df.set_index(['country', 'countrycode', 'varname'])
In [61]: df.columns.name = 'year'
In [62]: df.stack().unstack('varname')
Out[62]:
varname GDP Gini educ pop
country countrycode year
afghanistan AFG 1960 5.6 0.77 8.1 888
1961 5.7 0.78 8.2 889
1962 5.8 0.75 8.3 890
albania ALB 1960 6.6 0.45 6.2 777
1961 6.7 0.46 6.3 778
1962 6.8 0.47 6.4 779
The latter is a frame with a MutliIndex, you can do reset_index to move the MultiIndex to regular columns.
Group your DataFrame by country and countrycode and then apply your own function:
In [13]: def f(df):
....: del df['country']
....: del df['countrycode']
....: df = df.set_index('varname')
....: df.index.name = None
....: df = df.T
....: df.index.name = 'year'
....: return df
....:
In [14]: df.groupby(['country', 'countrycode']).apply(f).reset_index()
Out[14]:
country countrycode year GDP Gini educ pop
0 afghanistan AFG 1960 5.6 0.77 8.1 888
1 afghanistan AFG 1961 5.7 0.78 8.2 889
2 afghanistan AFG 1962 5.8 0.75 8.3 890
3 albania ALB 1960 6.6 0.45 6.2 777
4 albania ALB 1961 6.7 0.46 6.3 778
5 albania ALB 1962 6.8 0.47 6.4 779
I'm suggesting that #Wouter may put this in his (accepted) answer, as it uses the actual names from the WDI data, and makes it more cut and paste for someone else using them. Sorry -- I'm sure this isn't the right way to communicate this...
For any variables that you want to keep/use, just give them a name in this dict:
WDIconversions={"Year":'year',
"YearCode":'',
"Country Name":'country_name_wb',
"Country Code":'countryCode_ISO3_WB',
"Inflation, consumer prices (annual %)":'',
"Inflation, GDP deflator (annual %)":'',
"GDP per capita, PPP (constant 2005 international $)":'GDPpc',
"Firms with female participation in ownership (% of firms)":'',
"Investment in energy with private participation (current US$)":'',
"Investment in telecoms with private participation (current US$)":'',
"Investment in transport with private participation (current US$)":'',
"Investment in water and sanitation with private participation (current US$)":'',
"Labor participation rate, female (% of female population ages 15+)":'',
"Labor participation rate, male (% of male population ages 15+)":'',
"Labor participation rate, total (% of total population ages 15+)":'',
"Ratio of female to male labor participation rate (%)":'',
"Life expectancy at birth, female (years)":'',
"Life expectancy at birth, male (years)":'',
"Life expectancy at birth, total (years)":'lifeExpectancy',
"Population, total":'nat_pop',
"GINI index":'GiniWB',
} # etc etc etc
dfW=pd.read_table(WBDrawfile)
df = dfW.set_index(['Country Name','Country Code','Indicator Name'])
del df['Indicator Code']
df.columns.name = 'year'
df=df.stack().unstack('Indicator Name')
df=df[[kk for kk,ii in WDIconversions.items() if ii and kk in df]].reset_index().rename(columns=WDIconversions)
That results in:
df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12983 entries, 0 to 12982
Data columns:
country_name_wb 12983 non-null values
countryCode_ISO3_WB 12983 non-null values
year 12983 non-null values
GiniWB 845 non-null values
nat_pop 12601 non-null values
GDPpc 6292 non-null values
educPrimary 4949 non-null values
lifeExpectancy 11077 non-null values
dtypes: float64(5), object(3)

Categories