How to replace the NaN values with the values from the first df:
country sex year cancer
0 Albania female 2000 32
1 Albania male 2000 58
2 Antigua female 2000 2
3 Antigua male 2000 5
4 Argen female 2000 591
5 Argen male 2000 2061
in the second df:
country year sex cancer
0 Albania 1985 female NaN
1 Albania 1985 male NaN
2 Albania 1986 female NaN
3 Albania 1986 male NaN
4 Albania 1987 female 25.0
5 Antigua 1992 male NaN
6 Antigua 1985 female NaN
the final should look like:
country year sex cancer
0 Albania 1985 female 32
1 Albania 1985 male 58
2 Albania 1986 female 32
3 Albania 1986 male 58
4 Albania 1987 female 25
5 Antigua 1992 male 5
6 Antigua 1985 female 2
Important are 2 conditions Country and Sex
I am end up using fillna
df2.set_index(['country','sex'],inplace=True)
df2['cancer']=df2['cancer'].fillna(df1.set_index(['country','sex']).cancer)
df2.reset_index(inplace=True)
df2
Out[745]:
country sex year cancer
0 Albania female 1985 32.0
1 Albania male 1985 58.0
2 Albania female 1986 32.0
3 Albania male 1986 58.0
4 Albania female 1987 25.0
5 Antigua male 1992 5.0
6 Antigua female 1985 2.0
Related
This question already has answers here:
Pandas Melt Function
(2 answers)
Closed 1 year ago.
I'm trying to transpose a few columns while keeping the other columns. I'm having a hard time with pivot codes or transpose codes as it doesn't really give me the output I need.
Can anyone help?
I have this data frame:
EmpID
Goal
week 1
week 2
week 3
week 4
1
556
54
33
24
54
2
342
32
32
56
43
3
534
43
65
64
21
4
244
45
87
5
22
My expected dataframe output is:
EmpID
Goal
Weeks
Actual
1
556
week 1
54
1
556
week 2
33
1
556
week 3
24
1
556
week 4
54
and so on until the full employee IDs are listed..
Something like this.
# Python - melt DF
import pandas as pd
d = {'Country Code': [1960, 1961, 1962, 1963, 1964, 1965, 1966],
'ABW': [2.615300, 2.734390, 2.678430, 2.929920, 2.963250, 3.060540, 4.349760],
'AFG': [0.249760, 0.218480, 0.210840, 0.217240, 0.211410, 0.209910, 0.671330],
'ALB': ['NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 1.12214]}
df = pd.DataFrame(data=d)
print(df)
df1 = (df.melt(['Country Code'], var_name='Year', value_name='Econometric_Metric')
.sort_values(['Country Code','Year'])
.reset_index(drop=True))
print(df1)
df2 = (df.set_index(['Country Code'])
.stack(dropna=False)
.reset_index(name='Econometric_Metric')
.rename(columns={'level_1':'Year'}))
print(df2)
# BEFORE
ABW AFG ALB Country Code
0 2.61530 0.24976 NaN 1960
1 2.73439 0.21848 NaN 1961
2 2.67843 0.21084 NaN 1962
3 2.92992 0.21724 NaN 1963
4 2.96325 0.21141 NaN 1964
5 3.06054 0.20991 NaN 1965
6 4.34976 0.67133 1.12214 1966
# AFTER
Country Code Year Econometric_Metric
0 1960 ABW 2.6153
1 1960 AFG 0.24976
2 1960 ALB NaN
3 1961 ABW 2.73439
4 1961 AFG 0.21848
5 1961 ALB NaN
6 1962 ABW 2.67843
7 1962 AFG 0.21084
8 1962 ALB NaN
9 1963 ABW 2.92992
10 1963 AFG 0.21724
11 1963 ALB NaN
12 1964 ABW 2.96325
13 1964 AFG 0.21141
14 1964 ALB NaN
15 1965 ABW 3.06054
16 1965 AFG 0.20991
17 1965 ALB NaN
18 1966 ABW 4.34976
19 1966 AFG 0.67133
20 1966 ALB 1.12214
Country Code Year Econometric_Metric
0 1960 ABW 2.6153
1 1960 AFG 0.24976
2 1960 ALB NaN
3 1961 ABW 2.73439
4 1961 AFG 0.21848
5 1961 ALB NaN
6 1962 ABW 2.67843
7 1962 AFG 0.21084
8 1962 ALB NaN
9 1963 ABW 2.92992
10 1963 AFG 0.21724
11 1963 ALB NaN
12 1964 ABW 2.96325
13 1964 AFG 0.21141
14 1964 ALB NaN
15 1965 ABW 3.06054
16 1965 AFG 0.20991
17 1965 ALB NaN
18 1966 ABW 4.34976
19 1966 AFG 0.67133
20 1966 ALB 1.12214
Also, take a look at the link below, for more info.
https://www.dataindependent.com/pandas/pandas-melt/
Suppose I have the following pandas dataframe:
Date Region Country Cases Deaths Lat Long
2020-03-08 Northern Territory Australia 27 49 -12.4634 130.8456
2020-03-09 Northern Territory Australia 80 85 -12.4634 130.8456
2020-03-12 Northern Territory Australia 35 73 -12.4634 130.8456
2020-03-08 Western Australia Australia 48 20 -31.9505 115.8605
2020-03-09 Western Australia Australia 70 12 -31.9505 115.8605
2020-03-10 Western Australia Australia 66 95 -31.9505 115.8605
2020-03-11 Western Australia Australia 31 38 -31.9505 115.8605
2020-03-12 Western Australia Australia 40 83 -31.9505 115.8605
I need to update the dataframe with the missing dates on the Northern Terriroty, 2020-3-10 and 2020-3-11. However, I want to use all the information except for cases and deaths. Like this:
Date Region Country Cases Deaths Lat Long
2020-03-08 Northern Territory Australia 27 49 -12.4634 130.8456
2020-03-09 Northern Territory Australia 80 85 -12.4634 130.8456
2020-03-10 Northern Territory Australia 0 0 -12.4634 130.8456
2020-03-11 Northern Territory Australia 0 0 -12.4634 130.8456
2020-03-12 Northern Territory Australia 35 73 -12.4634 130.8456
2020-03-08 Western Australia Australia 48 20 -31.9505 115.8605
2020-03-09 Western Australia Australia 70 12 -31.9505 115.8605
2020-03-10 Western Australia Australia 66 95 -31.9505 115.8605
2020-03-11 Western Australia Australia 31 38 -31.9505 115.8605
2020-03-12 Western Australia Australia 40 83 -31.9505 115.8605
The only way I can think of doing this is to iterate through all combinations of dates and countries.
EDIT
Efran seems to be on the right track but I can't get it to work. Here is the actual data I'm working with instead of a toy example.
import pandas as pd
unique_group = ['province','country','county']
csbs_df = pd.read_csv(
'https://jordansdatabucket.s3-us-west-2.amazonaws.com/covid19data/csbs_df.csv.gz', index_col=0)
csbs_df['Date'] = pd.to_datetime(csbs_df['Date'], infer_datetime_format=True)
new_df = (
csbs_df.set_index('Date')
.groupby(unique_group)
.resample('D').first()
.fillna(dict.fromkeys(['confirmed', 'deaths'], 0))
.ffill()
.reset_index(level=3)
.reset_index(drop=True))
new_df.head()
Date id lat lon Timestamp province country_code country county confirmed deaths source Date_text
0 2020-03-25 1094.0 32.534893 -86.642709 2020-03-25 00:00:00+00:00 Alabama US US Autauga 1.0 0.0 CSBS 03/25/20
1 2020-03-26 901.0 32.534893 -86.642709 2020-03-26 00:00:00+00:00 Alabama US US Autauga 4.0 0.0 CSBS 03/26/20
2 2020-03-24 991.0 30.735891 -87.723525 2020-03-24 00:00:00+00:00 Alabama US US Baldwin 3.0 0.0 CSBS 03/24/20
3 2020-03-25 1080.0 30.735891 -87.723525 2020-03-25 00:00:00+00:00 Alabama US US Baldwin 4.0 0.0 CSBS 03/25/20
4 2020-03-26 1139.0 30.735891 -87.723525 2020-03-26 16:52:00+00:00 Alabama US US Baldwin 4.0 0.0 CSBS 03/26/20
You can see that it is not inserting the day resample as its specified. I'm not sure whats wrong.
Edit 2
Here is my solution based on Erfan.
import pandas as pd
csbs_df = pd.read_csv(
'https://jordansdatabucket.s3-us-west-2.amazonaws.com/covid19data/csbs_df.csv.gz', index_col=0)
date_range = pd.date_range(csbs_df['Date'].min(),csbs_df['Date'].max(),freq='1D')
unique_group = ['country','province','county']
gb = csbs_df.groupby(unique_group)
sub_dfs =[]
for g in gb.groups:
sub_df = gb.get_group(g)
sub_df = (
sub_df.set_index('Date')
.reindex(date_range)
.fillna(dict.fromkeys(['confirmed', 'deaths'], 0))
.bfill()
.ffill()
.reset_index()
.rename({'index':'Date'},axis=1)
.drop({'id':1},axis=1))
sub_df['Date_text'] = sub_df['Date'].dt.strftime('%m/%d/%y')
sub_df['Timestamp'] = pd.to_datetime(sub_df['Date'],utc=True)
sub_dfs.append(sub_df)
all_concat = pd.concat(sub_dfs)
assert((all_concat.groupby(['province','country','county']).count() == 3).all().all())
Using GroupBy.resample, ffill and fillna:
The idea here is that we want to "fill" the missing gaps of dates for each group of Region and Country. This is called resampling of timeseries.
So that's why we use GroupBy.resample instead of DataFrame.resample here. Further more fillna and ffill is needed to fill the data accordingly to your logic.
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
dfn = (
df.set_index('Date')
.groupby(['Region', 'Country'])
.resample('D').first()
.fillna(dict.fromkeys(['Cases', 'Deaths'], 0))
.ffill()
.reset_index(level=2)
.reset_index(drop=True)
)
Date Region Country Cases Deaths Lat Long
0 2020-03-08 Northern Territory Australia 27.0 49.0 -12.4634 130.8456
1 2020-03-09 Northern Territory Australia 80.0 85.0 -12.4634 130.8456
2 2020-03-10 Northern Territory Australia 0.0 0.0 -12.4634 130.8456
3 2020-03-11 Northern Territory Australia 0.0 0.0 -12.4634 130.8456
4 2020-03-12 Northern Territory Australia 35.0 73.0 -12.4634 130.8456
5 2020-03-08 Western Australia Australia 48.0 20.0 -31.9505 115.8605
6 2020-03-09 Western Australia Australia 70.0 12.0 -31.9505 115.8605
7 2020-03-10 Western Australia Australia 66.0 95.0 -31.9505 115.8605
8 2020-03-11 Western Australia Australia 31.0 38.0 -31.9505 115.8605
9 2020-03-12 Western Australia Australia 40.0 83.0 -31.9505 115.8605
Edit:
Seems indeed that not all places have same start and end date, so we have to take that into account, the following works:
csbs_df = pd.read_csv(
'https://jordansdatabucket.s3-us-west-2.amazonaws.com/covid19data/csbs_df.csv.gz'
).iloc[:, 1:]
csbs_df['Date_text'] = pd.to_datetime(csbs_df['Date_text'])
date_range = pd.date_range(csbs_df['Date_text'].min(), csbs_df['Date_text'].max(), freq='D')
def reindex_dates(data, dates):
data = data.reindex(dates).fillna(dict.fromkeys(['Cases', 'Deaths'], 0)).ffill().bfill()
return data
dfn = (
csbs_df.set_index('Date_text')
.groupby('id').apply(lambda x: reindex_dates(x, date_range))
.reset_index(level=0, drop=True)
.reset_index()
.rename(columns={'index': 'Date'})
)
print(dfn.head())
Date id lat lon Timestamp \
0 2020-03-24 0.0 40.714550 -74.007140 2020-03-24 00:00:00+00:00
1 2020-03-25 0.0 40.714550 -74.007140 2020-03-25 00:00:00+00:00
2 2020-03-26 0.0 40.714550 -74.007140 2020-03-26 00:00:00+00:00
3 2020-03-24 1.0 41.163198 -73.756063 2020-03-24 00:00:00+00:00
4 2020-03-25 1.0 41.163198 -73.756063 2020-03-25 00:00:00+00:00
Date province country_code country county confirmed deaths \
0 2020-03-24 New York US US New York 13119.0 125.0
1 2020-03-25 New York US US New York 15597.0 192.0
2 2020-03-26 New York US US New York 20011.0 280.0
3 2020-03-24 New York US US Westchester 2894.0 0.0
4 2020-03-25 New York US US Westchester 3891.0 1.0
source
0 CSBS
1 CSBS
2 CSBS
3 CSBS
4 CSBS
I need to reshape a csv pivot table. A small extract looks like:
country location confirmedcases_10-02-2020 deaths_10-02-2020 confirmedcases_11-02-2020 deaths_11-02-2020
0 Australia New South Wales 4.0 0.0 4 0.0
1 Australia Victoria 4.0 0.0 4 0.0
2 Australia Queensland 5.0 0.0 5 0.0
3 Australia South Australia 2.0 0.0 2 0.0
4 Cambodia Sihanoukville 1.0 0.0 1 0.0
5 Canada Ontario 3.0 0.0 3 0.0
6 Canada British Columbia 4.0 0.0 4 0.0
7 China Hubei 31728.0 974.0 33366 1068.0
8 China Zhejiang 1177.0 0.0 1131 0.0
9 China Guangdong 1177.0 1.0 1219 1.0
10 China Henan 1105.0 7.0 1135 8.0
11 China Hunan 912.0 1.0 946 2.0
12 China Anhui 860.0 4.0 889 4.0
13 China Jiangxi 804.0 1.0 844 1.0
14 China Chongqing 486.0 2.0 505 3.0
15 China Sichuan 417.0 1.0 436 1.0
16 China Shandong 486.0 1.0 497 1.0
17 China Jiangsu 515.0 0.0 543 0.0
18 China Shanghai 302.0 1.0 311 1.0
19 China Beijing 342.0 3.0 352 3.0
is there any ready to use pandas tool to achieve it?
into something like:
country location date confirmedcases deaths
0 Australia New South Wales 2020-02-10 4.0 0.0
1 Australia Victoria 2020-02-10 4.0 0.0
2 Australia Queensland 2020-02-10 5.0 0.0
3 Australia South Australia 2020-02-10 2.0 0.0
4 Cambodia Sihanoukville 2020-02-10 1.0 0.0
5 Canada Ontario 2020-02-10 3.0 0.0
6 Canada British Columbia 2020-02-10 4.0 0.0
7 China Hubei 2020-02-10 31728.0 974.0
8 China Zhejiang 2020-02-10 1177.0 0.0
9 China Guangdong 2020-02-10 1177.0 1.0
10 China Henan 2020-02-10 1105.0 7.0
11 China Hunan 2020-02-10 912.0 1.0
12 China Anhui 2020-02-10 860.0 4.0
13 China Jiangxi 2020-02-10 804.0 1.0
14 China Chongqing 2020-02-10 486.0 2.0
15 China Sichuan 2020-02-10 417.0 1.0
16 China Shandong 2020-02-10 486.0 1.0
17 China Jiangsu 2020-02-10 515.0 0.0
18 China Shanghai 2020-02-10 302.0 1.0
19 China Beijing 2020-02-10 342.0 3.0
20 Australia New South Wales 2020-02-11 4.0 0.0
21 Australia Victoria 2020-02-11 4.0 0.0
22 Australia Queensland 2020-02-11 5.0 0.0
23 Australia South Australia 2020-02-11 2.0 0.0
24 Cambodia Sihanoukville 2020-02-11 1.0 0.0
25 Canada Ontario 2020-02-11 3.0 0.0
26 Canada British Columbia 2020-02-11 4.0 0.0
27 China Hubei 2020-02-11 33366.0 1068.0
28 China Zhejiang 2020-02-11 1131.0 0.0
29 China Guangdong 2020-02-11 1219.0 1.0
30 China Henan 2020-02-11 1135.0 8.0
31 China Hunan 2020-02-11 946.0 2.0
32 China Anhui 2020-02-11 889.0 4.0
33 China Jiangxi 2020-02-11 844.0 1.0
34 China Chongqing 2020-02-11 505.0 3.0
35 China Sichuan 2020-02-11 436.0 1.0
36 China Shandong 2020-02-11 497.0 1.0
37 China Jiangsu 2020-02-11 543.0 0.0
38 China Shanghai 2020-02-11 311.0 1.0
39 China Beijing 2020-02-11 352.0 3.0
Use pd.wide_to_long:
print (pd.wide_to_long(df,stubnames=["confirmedcases","deaths"],
i=["country","location"],j="date",sep="_",
suffix=r'\d{2}-\d{2}-\d{4}').reset_index())
country location date confirmedcases deaths
0 Australia New South Wales 10-02-2020 4.0 0.0
1 Australia New South Wales 11-02-2020 4.0 0.0
2 Australia Victoria 10-02-2020 4.0 0.0
3 Australia Victoria 11-02-2020 4.0 0.0
4 Australia Queensland 10-02-2020 5.0 0.0
5 Australia Queensland 11-02-2020 5.0 0.0
6 Australia South Australia 10-02-2020 2.0 0.0
7 Australia South Australia 11-02-2020 2.0 0.0
8 Cambodia Sihanoukville 10-02-2020 1.0 0.0
9 Cambodia Sihanoukville 11-02-2020 1.0 0.0
10 Canada Ontario 10-02-2020 3.0 0.0
11 Canada Ontario 11-02-2020 3.0 0.0
12 Canada British Columbia 10-02-2020 4.0 0.0
13 Canada British Columbia 11-02-2020 4.0 0.0
14 China Hubei 10-02-2020 31728.0 974.0
15 China Hubei 11-02-2020 33366.0 1068.0
16 China Zhejiang 10-02-2020 1177.0 0.0
17 China Zhejiang 11-02-2020 1131.0 0.0
18 China Guangdong 10-02-2020 1177.0 1.0
19 China Guangdong 11-02-2020 1219.0 1.0
20 China Henan 10-02-2020 1105.0 7.0
21 China Henan 11-02-2020 1135.0 8.0
22 China Hunan 10-02-2020 912.0 1.0
23 China Hunan 11-02-2020 946.0 2.0
24 China Anhui 10-02-2020 860.0 4.0
25 China Anhui 11-02-2020 889.0 4.0
26 China Jiangxi 10-02-2020 804.0 1.0
27 China Jiangxi 11-02-2020 844.0 1.0
28 China Chongqing 10-02-2020 486.0 2.0
29 China Chongqing 11-02-2020 505.0 3.0
30 China Sichuan 10-02-2020 417.0 1.0
31 China Sichuan 11-02-2020 436.0 1.0
32 China Shandong 10-02-2020 486.0 1.0
33 China Shandong 11-02-2020 497.0 1.0
34 China Jiangsu 10-02-2020 515.0 0.0
35 China Jiangsu 11-02-2020 543.0 0.0
36 China Shanghai 10-02-2020 302.0 1.0
37 China Shanghai 11-02-2020 311.0 1.0
38 China Beijing 10-02-2020 342.0 3.0
39 China Beijing 11-02-2020 352.0 3.0
Yes, and you can achieve it by reshaping the dataframe.
Firs you have to melt the columns to have them as values:
df = df.melt(['country', 'location'],
[ p for p in df.columns if p not in ['country', 'location'] ],
'key',
'value')
#> country location key value
#> 0 Australia New South Wales confirmedcases_10-02-2020 4
#> 1 Australia Victoria confirmedcases_10-02-2020 4
#> 2 Australia Queensland confirmedcases_10-02-2020 5
#> 3 Australia South Australia confirmedcases_10-02-2020 2
#> 4 Cambodia Sihanoukville confirmedcases_10-02-2020 1
#> .. ... ... ... ...
#> 75 China Sichuan deaths_11-02-2020 1
#> 76 China Shandong deaths_11-02-2020 1
#> 77 China Jiangsu deaths_11-02-2020 0
#> 78 China Shanghai deaths_11-02-2020 1
#> 79 China Beijing deaths_11-02-2020 3
After that you need to separate the values in the column key:
key_split_series = df.key.str.split("_", expand=True)
df["key"] = key_split_series[0]
df["date"] = key_split_series[1]
#> country location key value date
#> 0 Australia New South Wales confirmedcases 4 10-02-2020
#> 1 Australia Victoria confirmedcases 4 10-02-2020
#> 2 Australia Queensland confirmedcases 5 10-02-2020
#> 3 Australia South Australia confirmedcases 2 10-02-2020
#> 4 Cambodia Sihanoukville confirmedcases 1 10-02-2020
#> .. ... ... ... ... ...
#> 75 China Sichuan deaths 1 11-02-2020
#> 76 China Shandong deaths 1 11-02-2020
#> 77 China Jiangsu deaths 0 11-02-2020
#> 78 China Shanghai deaths 1 11-02-2020
#> 79 China Beijing deaths 3 11-02-2020
In the end, you just need to pivot the table to have confirmedcases and deaths back as columns:
df = df.set_index(["country", "location", "date", "key"])["value"].unstack().reset_index()
#> key country location date confirmedcases deaths
#> 0 Australia New South Wales 10-02-2020 4 0
#> 1 Australia New South Wales 11-02-2020 4 0
#> 2 Australia Queensland 10-02-2020 5 0
#> 3 Australia Queensland 11-02-2020 5 0
#> 4 Australia South Australia 10-02-2020 2 0
#> .. ... ... ... ... ...
#> 35 China Shanghai 11-02-2020 311 1
#> 36 China Sichuan 10-02-2020 417 1
#> 37 China Sichuan 11-02-2020 436 1
#> 38 China Zhejiang 10-02-2020 1177 0
#> 39 China Zhejiang 11-02-2020 1131 0
Use {dataframe}.reshape((-1,1)) if there is only one feature and {dataframe}.reshape((1,-1)) if there is only one sample
I am new to coding and I'm having an issue merging csv files. I have searched similar questions and haven't found a fix. Just to include some relevant details:
CSV files are cancer types over the period of 1950 - 2017 for different countries (lung cancer, colorectal cancer, stomach cancer, liver cancer and breast cancer)
Below is an example of the layout of lung cancer.
dlung.describe(include='all')
dlung
Year Cancer Country Gender ASR SE
0 1950 Lung Australia Male 13.89 0.56
1 1951 Lung Australia Male 14.84 0.57
2 1952 Lung Australia Male 17.19 0.61
3 1953 Lung Australia Male 18.21 0.62
4 1954 Lung Australia Male 19.05 0.63
5 1955 Lung Australia Male 20.65 0.65
6 1956 Lung Australia Male 22.05 0.67
7 1957 Lung Australia Male 23.93 0.69
8 1958 Lung Australia Male 23.77 0.68
9 1959 Lung Australia Male 26.12 0.71
10 1960 Lung Australia Male 27.08 0.72
I am interested in joining all cancer types into one dataframe based on shared columns (year, country).
I have tried different methods, but they all seem to duplicate Year and Country (as below)
This one wasn't bad, but I have two columns for year and country
df_lung_colorectal = pd.concat([dlung, dcolorectal], axis = 1)
df_lung_colorectal
Year Cancer Country Gender ASR SE Year Cancer Country Gender ASR SE
If I continue like this, I will end up with 5 identical columns for YEAR and 5 for COUNTRY.
Any ideas on how merge all values that are independent (Cancer type and associated ASR (standardized risk), and SE values) with only one column for YEAR, COUNTRY (and GENDER) if possible?
Yes, it is possible if use DataFrame.set_index, but then are duplicated another columns names:
print (dlung)
Year Cancer Country Gender ASR SE
0 1950 Lung Australia Male 13.89 0.56
1 1951 Lung Australia Male 14.84 0.57
2 1952 Lung Australia Male 17.19 0.61
3 1953 Lung Australia Male 18.21 0.62
4 1954 Lung Australia Male 19.05 0.63
print (dcolorectal)
Year Cancer Country Gender ASR SE
6 1950 colorectal Australia Male 22.05 0.67
7 1951 colorectal Australia Male 23.93 0.69
8 1952 colorectal Australia Male 23.77 0.68
9 1953 colorectal Australia Male 26.12 0.71
10 1954 colorectal Australia Male 27.08 0.72
df_lung_colorectal = pd.concat([dlung.set_index(['Year','Country','Gender']),
dcolorectal.set_index(['Year','Country','Gender'])], axis = 1)
print (df_lung_colorectal)
Cancer ASR SE Cancer ASR SE
Year Country Gender
1950 Australia Male Lung 13.89 0.56 colorectal 22.05 0.67
1951 Australia Male Lung 14.84 0.57 colorectal 23.93 0.69
1952 Australia Male Lung 17.19 0.61 colorectal 23.77 0.68
1953 Australia Male Lung 18.21 0.62 colorectal 26.12 0.71
1954 Australia Male Lung 19.05 0.63 colorectal 27.08 0.72
But I think better is first concat all DataFrame together with axis=0, what is default value, so should be removed and last reshape by DataFrame.set_index and DataFrame.unstack:
df = pd.concat([dlung, dcolorectal]).set_index(['Year','Country','Gender','Cancer']).unstack()
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
Year Country Gender ASR_Lung ASR_colorectal SE_Lung SE_colorectal
0 1950 Australia Male 13.89 22.05 0.56 0.67
1 1951 Australia Male 14.84 23.93 0.57 0.69
2 1952 Australia Male 17.19 23.77 0.61 0.68
3 1953 Australia Male 18.21 26.12 0.62 0.71
4 1954 Australia Male 19.05 27.08 0.63 0.72
Concat with axis=0 to merge them row-wise.
with axis=1 you are asking it to Concat side-to-side.
I have a dataset that has been merged together to fill missing values from one another.
The problem is that I have some columns with missing data that I want to now fill with the values that aren't missing.
The merged data set looks like this for an input:
Name State ID Number_x Number_y Op_x Op_y
Johnson AL 1 1 nan 1956 nan
Johnson AL 1 nan nan 1956 nan
Johnson AL 2 1 nan 1999 nan
Johnson AL 2 0 nan 1999 nan
Debra AK 1A 0 nan 2000 nan
Debra AK 1B nan 20 nan 1997
Debra AK 2 nan 10 nan 2009
Debra AK 3 nan 1 nan 2008
.
.
What I'd want for an output is this:
Name State ID Number_x Number_y Op_x Op_y
Johnson AL 1 1 1 1956 1956
Johnson AL 2 1 1 1999 1999
Johnson AL 2 0 0 1999 1999
Debra AK 1A 0 0 2000 2000
Debra AK 1B 20 20 1997 1997
Debra AK 2 10 10 2009 2009
Debra AK 3 1 1 2008 2008
.
.
So I want it so that all nan values are replaced by the associated values in their columns - match Number_x to Number_y and Op_x to Op_y.
One thing to note is that when there are two IDs that are the same sometimes their values will be different; like Johnson with ID = 2 which has different numbers but the same op values. I want to keep these because I need to investigate them more.
Also, if the row has two missing values for Number_x and Number_y I want to take that row out - like Johnson with Number_x and Number_y missing as a nan value.
let us do groupby with axis =1
df.groupby(df.columns.str.split('_').str[0],1).first().dropna(subset=['Number','Op'])
ID Name Number Op State
0 1 Johnson 1.0 1956.0 AL
2 2 Johnson 1.0 1999.0 AL
3 2 Johnson 0.0 1999.0 AL
4 1A Debra 0.0 2000.0 AK
5 1B Debra 20.0 1997.0 AK
6 2 Debra 10.0 2009.0 AK
7 3 Debra 1.0 2008.0 AK