Merging csv files with some columns same and others different in python - python

I am new to coding and I'm having an issue merging csv files. I have searched similar questions and haven't found a fix. Just to include some relevant details:
CSV files are cancer types over the period of 1950 - 2017 for different countries (lung cancer, colorectal cancer, stomach cancer, liver cancer and breast cancer)
Below is an example of the layout of lung cancer.
dlung.describe(include='all')
dlung
Year Cancer Country Gender ASR SE
0 1950 Lung Australia Male 13.89 0.56
1 1951 Lung Australia Male 14.84 0.57
2 1952 Lung Australia Male 17.19 0.61
3 1953 Lung Australia Male 18.21 0.62
4 1954 Lung Australia Male 19.05 0.63
5 1955 Lung Australia Male 20.65 0.65
6 1956 Lung Australia Male 22.05 0.67
7 1957 Lung Australia Male 23.93 0.69
8 1958 Lung Australia Male 23.77 0.68
9 1959 Lung Australia Male 26.12 0.71
10 1960 Lung Australia Male 27.08 0.72
I am interested in joining all cancer types into one dataframe based on shared columns (year, country).
I have tried different methods, but they all seem to duplicate Year and Country (as below)
This one wasn't bad, but I have two columns for year and country
df_lung_colorectal = pd.concat([dlung, dcolorectal], axis = 1)
df_lung_colorectal
Year Cancer Country Gender ASR SE Year Cancer Country Gender ASR SE
If I continue like this, I will end up with 5 identical columns for YEAR and 5 for COUNTRY.
Any ideas on how merge all values that are independent (Cancer type and associated ASR (standardized risk), and SE values) with only one column for YEAR, COUNTRY (and GENDER) if possible?

Yes, it is possible if use DataFrame.set_index, but then are duplicated another columns names:
print (dlung)
Year Cancer Country Gender ASR SE
0 1950 Lung Australia Male 13.89 0.56
1 1951 Lung Australia Male 14.84 0.57
2 1952 Lung Australia Male 17.19 0.61
3 1953 Lung Australia Male 18.21 0.62
4 1954 Lung Australia Male 19.05 0.63
print (dcolorectal)
Year Cancer Country Gender ASR SE
6 1950 colorectal Australia Male 22.05 0.67
7 1951 colorectal Australia Male 23.93 0.69
8 1952 colorectal Australia Male 23.77 0.68
9 1953 colorectal Australia Male 26.12 0.71
10 1954 colorectal Australia Male 27.08 0.72
df_lung_colorectal = pd.concat([dlung.set_index(['Year','Country','Gender']),
dcolorectal.set_index(['Year','Country','Gender'])], axis = 1)
print (df_lung_colorectal)
Cancer ASR SE Cancer ASR SE
Year Country Gender
1950 Australia Male Lung 13.89 0.56 colorectal 22.05 0.67
1951 Australia Male Lung 14.84 0.57 colorectal 23.93 0.69
1952 Australia Male Lung 17.19 0.61 colorectal 23.77 0.68
1953 Australia Male Lung 18.21 0.62 colorectal 26.12 0.71
1954 Australia Male Lung 19.05 0.63 colorectal 27.08 0.72
But I think better is first concat all DataFrame together with axis=0, what is default value, so should be removed and last reshape by DataFrame.set_index and DataFrame.unstack:
df = pd.concat([dlung, dcolorectal]).set_index(['Year','Country','Gender','Cancer']).unstack()
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
Year Country Gender ASR_Lung ASR_colorectal SE_Lung SE_colorectal
0 1950 Australia Male 13.89 22.05 0.56 0.67
1 1951 Australia Male 14.84 23.93 0.57 0.69
2 1952 Australia Male 17.19 23.77 0.61 0.68
3 1953 Australia Male 18.21 26.12 0.62 0.71
4 1954 Australia Male 19.05 27.08 0.63 0.72

Concat with axis=0 to merge them row-wise.
with axis=1 you are asking it to Concat side-to-side.

Related

Is there a way to group plots based on matching row values?

I have a data frame like shown below.
Country Type 2011 2012 2013
Afghanistan Estimate -1.63 -1.57 -1.41
Afghanistan Sources 5 8 7
Afghanistan Percentile 0.95 0.94 2.36
.
.
.
Zambia Estimate 1.63 1.57 1.41
Zambia Sources 7 10 8
Zambia Percentile 0.88 0.77 1.54
I am hoping to generate plots (preferably line graphs) for each country (Type will be used as legend). Is there a way to group plots for each country? I am relatively new and don't know where to begin.
I'm afraid you can't get away with at least some transformations.
If it's OK to use Seaborn for plotting, it could look something like this:
import pandas as pd
import seaborn as sns
from io import StringIO
df = pd.read_csv(StringIO('''
Country,Type,2011,2012,2013
Afghanistan,Estimate,-1.63,-1.57,-1.41
Afghanistan,Sources,5,8,7
Afghanistan,Percentile,0.95,0.94,2.36
Zambia,Estimate,1.63,1.57,1.41
Zambia,Sources,7,10,8
Zambia,Percentile,0.88,0.77,1.54
'''), dtype={'Country' : 'string',
'Type' : 'string',
'2011' : 'float',
'2012' : 'float',
'2013' : 'float'})
# Country Type 2011 2012 2013
# 0 Afghanistan Estimate -1.63 -1.57 -1.41
# 1 Afghanistan Sources 5.00 8.00 7.00
# 2 Afghanistan Percentile 0.95 0.94 2.36
# 3 Zambia Estimate 1.63 1.57 1.41
# 4 Zambia Sources 7.00 10.00 8.00
# 5 Zambia Percentile 0.88 0.77 1.54
# transform to long format
df = df.melt(id_vars=['Country', 'Type'],
value_vars=['2011','2012','2013'],
var_name='Year')
# df after melt:
# Country Type Year value
# 0 Afghanistan Estimate 2011 -1.63
# 1 Afghanistan Sources 2011 5.00
# 2 Afghanistan Percentile 2011 0.95
# 3 Zambia Estimate 2011 1.63
# 4 Zambia Sources 2011 7.00
# 5 Zambia Percentile 2011 0.88
# 6 Afghanistan Estimate 2012 -1.57
# 7 Afghanistan Sources 2012 8.00
# 8 Afghanistan Percentile 2012 0.94
# 9 Zambia Estimate 2012 1.57
# 10 Zambia Sources 2012 10.00
# 11 Zambia Percentile 2012 0.77
# 12 Afghanistan Estimate 2013 -1.41
# 13 Afghanistan Sources 2013 7.00
# 14 Afghanistan Percentile 2013 2.36
# 15 Zambia Estimate 2013 1.41
# 16 Zambia Sources 2013 8.00
# 17 Zambia Percentile 2013 1.54
sns.relplot(data=df, kind='line', x='Year',
y='value', hue='Type', col="Country")

Merge 2 data frames based on 2 condition in pandas

How to replace the NaN values with the values from the first df:
country sex year cancer
0 Albania female 2000 32
1 Albania male 2000 58
2 Antigua female 2000 2
3 Antigua male 2000 5
4 Argen female 2000 591
5 Argen male 2000 2061
in the second df:
country year sex cancer
0 Albania 1985 female NaN
1 Albania 1985 male NaN
2 Albania 1986 female NaN
3 Albania 1986 male NaN
4 Albania 1987 female 25.0
5 Antigua 1992 male NaN
6 Antigua 1985 female NaN
the final should look like:
country year sex cancer
0 Albania 1985 female 32
1 Albania 1985 male 58
2 Albania 1986 female 32
3 Albania 1986 male 58
4 Albania 1987 female 25
5 Antigua 1992 male 5
6 Antigua 1985 female 2
Important are 2 conditions Country and Sex
I am end up using fillna
df2.set_index(['country','sex'],inplace=True)
df2['cancer']=df2['cancer'].fillna(df1.set_index(['country','sex']).cancer)
df2.reset_index(inplace=True)
df2
Out[745]:
country sex year cancer
0 Albania female 1985 32.0
1 Albania male 1985 58.0
2 Albania female 1986 32.0
3 Albania male 1986 58.0
4 Albania female 1987 25.0
5 Antigua male 1992 5.0
6 Antigua female 1985 2.0

Loop only takes last value

I have a dataFrame with country-specific population for each year and a pandas Series with the world population for each year.
This is the Series I am using:
pop_tot = df3.groupby('Year')['population'].sum()
Year
1990 4.575442e+09
1991 4.659075e+09
1992 4.699921e+09
1993 4.795129e+09
1994 4.862547e+09
1995 4.949902e+09
... ...
2017 6.837429e+09
and this is the DataFrame I am using
Country Year HDI population
0 Afghanistan 1990 NaN 1.22491e+07
1 Albania 1990 0.645 3.28654e+06
2 Algeria 1990 0.577 2.59124e+07
3 Andorra 1990 NaN 54509
4 Angola 1990 NaN 1.21714e+07
... ... ... ... ...
4096 Uzbekistan 2017 0.71 3.23872e+07
4097 Vanuatu 2017 0.603 276244
4098 Zambia 2017 0.588 1.70941e+07
4099 Zimbabwe 2017 0.535 1.65299e+07
I want to calculate the proportion of the world's population that the population of that country represents for each year, so I loop over the Series and the DataFrame as follows:
j = 0
for i in range(len(df3)):
if df3.iloc[i,1]==pop_tot.index[j]:
df3['pop_tot']=pop_tot[j] #Sanity check
df3['weighted']=df3['population']/pop_tot[j]
*df3.iloc[i,2]
else:
j=j+1
However, the DataFrame that I get in return is not the expected one. I end up dividing all the values by the total population of 2017, thus giving me proportions which are not the correct ones for that year (i.e. for this first rows, pop_tot should be 4.575442e+09 as it corresponds to 1990 according to the Series above and not 6.837429e+09 which corresponds to 2017).
Country Year HDI population pop_tot weighted
0 Albania 1990 0.645 3.28654e+06 6.837429e+09 0.000257158
1 Algeria 1990 0.577 2.59124e+07 6.837429e+09 0.00202753
2 Argentina 1990 0.704 3.27297e+07 6.837429e+09 0.00256096
I can't see however what's the mistake in the loop.
Thanks in advance.
You don't need loop, you can use groupby.transform to create the column pop_tot in df3 directly. then for the column weighted just do column operation, such as:
df3['pop_tot'] = df3.groupby('Year')['population'].transform(sum)
df3['weighted'] = df3['population']/df3['pop_tot']
As #roganjosh pointed out, the problem with your method is that you replace the whole columns pop_tot and weighted everytime your condition if is met, so at the last iteration where this condition is met, the year being probably 2017, you define the value of the column pop_tot being the one of 2017 and calculate the weithed with this value as well.
You dont have to loop, its slower and can make things really complex quite fast. Use pandas and numpys vectorized solutions like this for example:
df['pop_tot'] = df.population.sum()
df['weighted'] = df.population / df.population.sum()
print(df)
Country Year HDI population pop_tot weighted
0 Afghanistan 1990 NaN 12249100.0 53673949.0 0.228213
1 Albania 1990 0.645 3286540.0 53673949.0 0.061232
2 Algeria 1990 0.577 25912400.0 53673949.0 0.482774
3 Andorra 1990 NaN 54509.0 53673949.0 0.001016
4 Angola 1990 NaN 12171400.0 53673949.0 0.226766
Edit after OP's comment
df['pop_tot'] = df.groupby('Year').population.transform('sum')
df['weighted'] = df.population / df['pop_tot']
print(df)
Country Year HDI population pop_tot weighted
0 Afghanistan 1990 NaN 12249100.0 53673949.0 0.228213
1 Albania 1990 0.645 3286540.0 53673949.0 0.061232
2 Algeria 1990 0.577 25912400.0 53673949.0 0.482774
3 Andorra 1990 NaN 54509.0 53673949.0 0.001016
4 Angola 1990 NaN 12171400.0 53673949.0 0.226766
note
I used the small dataset you gave as example:
Country Year HDI population
0 Afghanistan 1990 NaN 12249100.0
1 Albania 1990 0.645 3286540.0
2 Algeria 1990 0.577 25912400.0
3 Andorra 1990 NaN 54509.0
4 Angola 1990 NaN 12171400.0

How to convert multiple rows into stacked columns in excel /python?

Is there an easy way to convert from Type A to Type B.
Note : Kutools (Plugin in Excel) provides a solution for it but that is not robust and does not seem scalable.
Any workaround for this ?
Considering you can make the df look like below : (just remove the top row which says Type A)
GDP per capita 1950 1951 1952 1953
0 Antigua and Barbuda 3544 3633 3723 3817
1 Argentina 7540 7612 7019 7198
2 Armenia 1862 1834 1914 1958
3 Aruba 3897 3994 4094 4196
4 Australia 12073 12229 12084 12228
5 Austria 6919 7382 7386 7692
Using pd.melt()
>>pd.melt(df,id_vars='GDP per capita',var_name='Year',value_name='GDP Value')
GDP per capita Year GDP Value
0 Antigua and Barbuda 1950 3544
1 Argentina 1950 7540
2 Armenia 1950 1862
3 Aruba 1950 3897
4 Australia 1950 12073
5 Austria 1950 6919
6 Antigua and Barbuda 1951 3633
7 Argentina 1951 7612
8 Armenia 1951 1834
9 Aruba 1951 3994
10 Australia 1951 12229
11 Austria 1951 7382
12 Antigua and Barbuda 1952 3723
13 Argentina 1952 7019
14 Armenia 1952 1914
15 Aruba 1952 4094
16 Australia 1952 12084
17 Austria 1952 7386
18 Antigua and Barbuda 1953 3817
19 Argentina 1953 7198
20 Armenia 1953 1958
21 Aruba 1953 4196
22 Australia 1953 12228
23 Austria 1953 7692
To get the exact look like the image you have posted use:
df1=pd.melt(df,id_vars='GDP per capita',var_name='Year',value_name='GDP Value')
df1.rename(columns={'GDP per capita':'Country'},inplace=True)
df1['GDP'] = 'GDP per capita'
df1 = df1[['GDP','Country','Year','GDP Value']]
df1.to_csv('filepath+filename.csv,index=False)

Pandas DataFrame from WB WDI data: combine year columns into "year" variable and merge rows

I have a dataset (.tsv file) with the following columns. (It's the World Bank's new WDI all-in all-time single-download dataset. Nice!)
country countrycode varname 1960 1961 1962
afghanistan AFG GDP 5.6 5.7 5.8
afghanistan AFG Gini .77 .78 .75
afghanistan AFG educ 8.1 8.2 8.3
afghanistan AFG pop 888 889 890
albania ALB GDP 6.6 6.7 6.8
albania ALB Gini .45 .46 .47
albania ALB educ 6.2 6.3 6.4
albania ALB pop 777 778 779
I need a pandas DataFrame with ['GDP','Gini','edu','pop'] as columns, along with ['country', 'countrycode', 'year']. So the values for "year" are currently columns!
And I'd like there to be only one row for each country-year combination.
For instance, the columns and first row would be
country countrycode year GDP Gini educ pop
afghanistan AFG 1960 5.6 .77 8.1 888
This seems like some complex pivot or opposite-of-"melt", but I cannot figure it out.
In [59]: df
Out[59]:
country countrycode varname 1960 1961 1962
0 afghanistan AFG GDP 5.60 5.70 5.80
1 afghanistan AFG Gini 0.77 0.78 0.75
2 afghanistan AFG educ 8.10 8.20 8.30
3 afghanistan AFG pop 888.00 889.00 890.00
4 albania ALB GDP 6.60 6.70 6.80
5 albania ALB Gini 0.45 0.46 0.47
6 albania ALB educ 6.20 6.30 6.40
7 albania ALB pop 777.00 778.00 779.00
In [60]: df = df.set_index(['country', 'countrycode', 'varname'])
In [61]: df.columns.name = 'year'
In [62]: df.stack().unstack('varname')
Out[62]:
varname GDP Gini educ pop
country countrycode year
afghanistan AFG 1960 5.6 0.77 8.1 888
1961 5.7 0.78 8.2 889
1962 5.8 0.75 8.3 890
albania ALB 1960 6.6 0.45 6.2 777
1961 6.7 0.46 6.3 778
1962 6.8 0.47 6.4 779
The latter is a frame with a MutliIndex, you can do reset_index to move the MultiIndex to regular columns.
Group your DataFrame by country and countrycode and then apply your own function:
In [13]: def f(df):
....: del df['country']
....: del df['countrycode']
....: df = df.set_index('varname')
....: df.index.name = None
....: df = df.T
....: df.index.name = 'year'
....: return df
....:
In [14]: df.groupby(['country', 'countrycode']).apply(f).reset_index()
Out[14]:
country countrycode year GDP Gini educ pop
0 afghanistan AFG 1960 5.6 0.77 8.1 888
1 afghanistan AFG 1961 5.7 0.78 8.2 889
2 afghanistan AFG 1962 5.8 0.75 8.3 890
3 albania ALB 1960 6.6 0.45 6.2 777
4 albania ALB 1961 6.7 0.46 6.3 778
5 albania ALB 1962 6.8 0.47 6.4 779
I'm suggesting that #Wouter may put this in his (accepted) answer, as it uses the actual names from the WDI data, and makes it more cut and paste for someone else using them. Sorry -- I'm sure this isn't the right way to communicate this...
For any variables that you want to keep/use, just give them a name in this dict:
WDIconversions={"Year":'year',
"YearCode":'',
"Country Name":'country_name_wb',
"Country Code":'countryCode_ISO3_WB',
"Inflation, consumer prices (annual %)":'',
"Inflation, GDP deflator (annual %)":'',
"GDP per capita, PPP (constant 2005 international $)":'GDPpc',
"Firms with female participation in ownership (% of firms)":'',
"Investment in energy with private participation (current US$)":'',
"Investment in telecoms with private participation (current US$)":'',
"Investment in transport with private participation (current US$)":'',
"Investment in water and sanitation with private participation (current US$)":'',
"Labor participation rate, female (% of female population ages 15+)":'',
"Labor participation rate, male (% of male population ages 15+)":'',
"Labor participation rate, total (% of total population ages 15+)":'',
"Ratio of female to male labor participation rate (%)":'',
"Life expectancy at birth, female (years)":'',
"Life expectancy at birth, male (years)":'',
"Life expectancy at birth, total (years)":'lifeExpectancy',
"Population, total":'nat_pop',
"GINI index":'GiniWB',
} # etc etc etc
dfW=pd.read_table(WBDrawfile)
df = dfW.set_index(['Country Name','Country Code','Indicator Name'])
del df['Indicator Code']
df.columns.name = 'year'
df=df.stack().unstack('Indicator Name')
df=df[[kk for kk,ii in WDIconversions.items() if ii and kk in df]].reset_index().rename(columns=WDIconversions)
That results in:
df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12983 entries, 0 to 12982
Data columns:
country_name_wb 12983 non-null values
countryCode_ISO3_WB 12983 non-null values
year 12983 non-null values
GiniWB 845 non-null values
nat_pop 12601 non-null values
GDPpc 6292 non-null values
educPrimary 4949 non-null values
lifeExpectancy 11077 non-null values
dtypes: float64(5), object(3)

Categories