Pandas: Fill missing date with information from other rows - python

Suppose I have the following pandas dataframe:
Date Region Country Cases Deaths Lat Long
2020-03-08 Northern Territory Australia 27 49 -12.4634 130.8456
2020-03-09 Northern Territory Australia 80 85 -12.4634 130.8456
2020-03-12 Northern Territory Australia 35 73 -12.4634 130.8456
2020-03-08 Western Australia Australia 48 20 -31.9505 115.8605
2020-03-09 Western Australia Australia 70 12 -31.9505 115.8605
2020-03-10 Western Australia Australia 66 95 -31.9505 115.8605
2020-03-11 Western Australia Australia 31 38 -31.9505 115.8605
2020-03-12 Western Australia Australia 40 83 -31.9505 115.8605
I need to update the dataframe with the missing dates on the Northern Terriroty, 2020-3-10 and 2020-3-11. However, I want to use all the information except for cases and deaths. Like this:
Date Region Country Cases Deaths Lat Long
2020-03-08 Northern Territory Australia 27 49 -12.4634 130.8456
2020-03-09 Northern Territory Australia 80 85 -12.4634 130.8456
2020-03-10 Northern Territory Australia 0 0 -12.4634 130.8456
2020-03-11 Northern Territory Australia 0 0 -12.4634 130.8456
2020-03-12 Northern Territory Australia 35 73 -12.4634 130.8456
2020-03-08 Western Australia Australia 48 20 -31.9505 115.8605
2020-03-09 Western Australia Australia 70 12 -31.9505 115.8605
2020-03-10 Western Australia Australia 66 95 -31.9505 115.8605
2020-03-11 Western Australia Australia 31 38 -31.9505 115.8605
2020-03-12 Western Australia Australia 40 83 -31.9505 115.8605
The only way I can think of doing this is to iterate through all combinations of dates and countries.
EDIT
Efran seems to be on the right track but I can't get it to work. Here is the actual data I'm working with instead of a toy example.
import pandas as pd
unique_group = ['province','country','county']
csbs_df = pd.read_csv(
'https://jordansdatabucket.s3-us-west-2.amazonaws.com/covid19data/csbs_df.csv.gz', index_col=0)
csbs_df['Date'] = pd.to_datetime(csbs_df['Date'], infer_datetime_format=True)
new_df = (
csbs_df.set_index('Date')
.groupby(unique_group)
.resample('D').first()
.fillna(dict.fromkeys(['confirmed', 'deaths'], 0))
.ffill()
.reset_index(level=3)
.reset_index(drop=True))
new_df.head()
Date id lat lon Timestamp province country_code country county confirmed deaths source Date_text
0 2020-03-25 1094.0 32.534893 -86.642709 2020-03-25 00:00:00+00:00 Alabama US US Autauga 1.0 0.0 CSBS 03/25/20
1 2020-03-26 901.0 32.534893 -86.642709 2020-03-26 00:00:00+00:00 Alabama US US Autauga 4.0 0.0 CSBS 03/26/20
2 2020-03-24 991.0 30.735891 -87.723525 2020-03-24 00:00:00+00:00 Alabama US US Baldwin 3.0 0.0 CSBS 03/24/20
3 2020-03-25 1080.0 30.735891 -87.723525 2020-03-25 00:00:00+00:00 Alabama US US Baldwin 4.0 0.0 CSBS 03/25/20
4 2020-03-26 1139.0 30.735891 -87.723525 2020-03-26 16:52:00+00:00 Alabama US US Baldwin 4.0 0.0 CSBS 03/26/20
You can see that it is not inserting the day resample as its specified. I'm not sure whats wrong.
Edit 2
Here is my solution based on Erfan.
import pandas as pd
csbs_df = pd.read_csv(
'https://jordansdatabucket.s3-us-west-2.amazonaws.com/covid19data/csbs_df.csv.gz', index_col=0)
date_range = pd.date_range(csbs_df['Date'].min(),csbs_df['Date'].max(),freq='1D')
unique_group = ['country','province','county']
gb = csbs_df.groupby(unique_group)
sub_dfs =[]
for g in gb.groups:
sub_df = gb.get_group(g)
sub_df = (
sub_df.set_index('Date')
.reindex(date_range)
.fillna(dict.fromkeys(['confirmed', 'deaths'], 0))
.bfill()
.ffill()
.reset_index()
.rename({'index':'Date'},axis=1)
.drop({'id':1},axis=1))
sub_df['Date_text'] = sub_df['Date'].dt.strftime('%m/%d/%y')
sub_df['Timestamp'] = pd.to_datetime(sub_df['Date'],utc=True)
sub_dfs.append(sub_df)
all_concat = pd.concat(sub_dfs)
assert((all_concat.groupby(['province','country','county']).count() == 3).all().all())

Using GroupBy.resample, ffill and fillna:
The idea here is that we want to "fill" the missing gaps of dates for each group of Region and Country. This is called resampling of timeseries.
So that's why we use GroupBy.resample instead of DataFrame.resample here. Further more fillna and ffill is needed to fill the data accordingly to your logic.
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
dfn = (
df.set_index('Date')
.groupby(['Region', 'Country'])
.resample('D').first()
.fillna(dict.fromkeys(['Cases', 'Deaths'], 0))
.ffill()
.reset_index(level=2)
.reset_index(drop=True)
)
Date Region Country Cases Deaths Lat Long
0 2020-03-08 Northern Territory Australia 27.0 49.0 -12.4634 130.8456
1 2020-03-09 Northern Territory Australia 80.0 85.0 -12.4634 130.8456
2 2020-03-10 Northern Territory Australia 0.0 0.0 -12.4634 130.8456
3 2020-03-11 Northern Territory Australia 0.0 0.0 -12.4634 130.8456
4 2020-03-12 Northern Territory Australia 35.0 73.0 -12.4634 130.8456
5 2020-03-08 Western Australia Australia 48.0 20.0 -31.9505 115.8605
6 2020-03-09 Western Australia Australia 70.0 12.0 -31.9505 115.8605
7 2020-03-10 Western Australia Australia 66.0 95.0 -31.9505 115.8605
8 2020-03-11 Western Australia Australia 31.0 38.0 -31.9505 115.8605
9 2020-03-12 Western Australia Australia 40.0 83.0 -31.9505 115.8605
Edit:
Seems indeed that not all places have same start and end date, so we have to take that into account, the following works:
csbs_df = pd.read_csv(
'https://jordansdatabucket.s3-us-west-2.amazonaws.com/covid19data/csbs_df.csv.gz'
).iloc[:, 1:]
csbs_df['Date_text'] = pd.to_datetime(csbs_df['Date_text'])
date_range = pd.date_range(csbs_df['Date_text'].min(), csbs_df['Date_text'].max(), freq='D')
def reindex_dates(data, dates):
data = data.reindex(dates).fillna(dict.fromkeys(['Cases', 'Deaths'], 0)).ffill().bfill()
return data
dfn = (
csbs_df.set_index('Date_text')
.groupby('id').apply(lambda x: reindex_dates(x, date_range))
.reset_index(level=0, drop=True)
.reset_index()
.rename(columns={'index': 'Date'})
)
print(dfn.head())
Date id lat lon Timestamp \
0 2020-03-24 0.0 40.714550 -74.007140 2020-03-24 00:00:00+00:00
1 2020-03-25 0.0 40.714550 -74.007140 2020-03-25 00:00:00+00:00
2 2020-03-26 0.0 40.714550 -74.007140 2020-03-26 00:00:00+00:00
3 2020-03-24 1.0 41.163198 -73.756063 2020-03-24 00:00:00+00:00
4 2020-03-25 1.0 41.163198 -73.756063 2020-03-25 00:00:00+00:00
Date province country_code country county confirmed deaths \
0 2020-03-24 New York US US New York 13119.0 125.0
1 2020-03-25 New York US US New York 15597.0 192.0
2 2020-03-26 New York US US New York 20011.0 280.0
3 2020-03-24 New York US US Westchester 2894.0 0.0
4 2020-03-25 New York US US Westchester 3891.0 1.0
source
0 CSBS
1 CSBS
2 CSBS
3 CSBS
4 CSBS

Related

how to merge Two datasets with different time ranges?

I have two datasets that look like this:
df1:
Date
City
State
Quantity
2019-01
Chicago
IL
35
2019-01
Orlando
FL
322
...
....
...
...
2021-07
Chicago
IL
334
2021-07
Orlando
FL
4332
df2:
Date
City
State
Sales
2020-03
Chicago
IL
30
2020-03
Orlando
FL
319
...
...
...
...
2021-07
Chicago
IL
331
2021-07
Orlando
FL
4000
My date is in format period[M] for both datasets. I have tried using the df1.join(df2,how='outer') and (df2.join(df1,how='outer') commands but they don't add up correctly, essentially, in 2019-01, I have sales for 2020-03. How can I join these two datasets such that my output is as follows:
I have not been able to use merge() because I would have to merge with a combination of City and State and Date
Date
City
State
Quantity
Sales
2019-01
Chicago
IL
35
NaN
2019-01
Orlando
FL
322
NaN
...
...
...
...
...
2021-07
Chicago
IL
334
331
2021-07
Orlando
FL
4332
4000
You can outer-merge. By not specifying the columns to merge on, you merge on the intersection of the columns in both DataFrames (in this case, Date, City and State).
out = df1.merge(df2, how='outer').sort_values(by='Date')
Output:
Date City State Quantity Sales
0 2019-01 Chicago IL 35.0 NaN
1 2019-01 Orlando FL 322.0 NaN
4 2020-03 Chicago IL NaN 30.0
5 2020-03 Orlando FL NaN 319.0
2 2021-07 Chicago IL 334.0 331.0
3 2021-07 Orlando FL 4332.0 4000.0

How to resample the data by month and plot monthly percentages?

I have a dataframe in which matches played by a team in a year is given. Match Date is a column.
Team 1 Team 2 Winner Match Date
5 Australia England England 2018-01-14
12 Australia England England 2018-01-19
14 Australia England England 2018-01-21
20 Australia England Australia 2018-01-26
22 Australia England England 2018-01-28
34 New Zealand England New Zealand 2018-02-25
35 New Zealand England England 2018-02-28
36 New Zealand England England 2018-03-03
43 New Zealand England New Zealand 2018-03-07
46 New Zealand England England 2018-03-10
62 Scotland England Scotland 2018-06-10
63 England Australia England 2018-06-13
64 England Australia England 2018-06-16
65 England Australia England 2018-06-19
66 England Australia England 2018-06-21
67 England Australia England 2018-06-24
68 England India India 2018-07-12
70 England India England 2018-07-14
72 England India England 2018-07-17
106 Sri Lanka England no result 2018-10-10
107 Sri Lanka England England 2018-10-13
108 Sri Lanka England England 2018-10-17
109 Sri Lanka England England 2018-10-20
112 Sri Lanka England Sri Lanka 2018-10-23
Match Date is in datetime. I could plot the number of matches played versus winning matches. This is the code I used.
England.set_index('Match Date', inplace = True)
England.resample('1M').count()['Winner'].plot()
England_win.resample('1M').count()['Winner'].plot()
But I would like to plot the winning percentage by month. Please help.
Thank you
I am sure there are more efficient ways to do this, but one way to plot this using an approach similar to yours:
import matplotlib.pyplot as plt
import pandas as pd
#reading your sample data
df = pd.read_csv("test.txt", sep="\s{2,}", parse_dates=["Match Date"], index_col="ID", engine="python")
df.set_index('Match Date', inplace = True)
#creating df that count the wins
df1 = df[df["Winner"]=="England"].resample("1M").count()
#calculate and plot the percentage - if no game, NaN values are substituted with zero
df1.Winner.div(df.resample('1M').count()['Winner']).mul(100).fillna(0).plot()
plt.tight_layout()
plt.show()
Sample output:

column names setup after group by the data in python

My table is as bellowed
datetime source Day area Town County Country
0 2019-01-01 16:22:46 1273 Tuesday Brighton Brighton East Sussex England
1 2019-01-02 09:33:29 1823 Wednesday Taunton Taunton Somerset England
2 2019-01-02 09:44:46 1977 Wednesday Pontefract Pontefract West Yorkshire England
3 2019-01-02 10:01:42 1983 Wednesday Isle of Wight NaN NaN NaN
4 2019-01-02 12:03:13 1304 Wednesday Dover Dover Kent England
My codes are
counts_by_counties = call_by_counties.groupby(['County','Town']).count()
counts_by_counties.head()
My grouped result (Do the column name disappeared?)
datetime source Day area Country
County Town
Aberdeenshire Aberdeen 8 8 8 8 8
Banchory 1 1 1 1 1
Blackburn 18 18 18 18 18
Ellon 6 6 6 6 6
Fraserburgh 2 2 2 2 2
I used this codes to rename the column, I am wondering if there is other efficent way to change the column name.
# slicing of the table
counts_by_counties = counts_by_counties[['datetime',]]
# rename by datetime into Counts
counts_by_counties.rename(columns={'datetime': 'Counts'})
Expected result
Counts
County Town
Aberdeenshire Aberdeen 8
Banchory 1
Blackburn 18
Call reset_index as below.
Replace
counts_by_counties = call_by_counties.groupby(['County','Town']).count()
with
counts_by_counties = call_by_counties.groupby(['County','Town']).count().reset_index()

Set "Year" column to individual columns to create a panel

I am trying to reshape the following dataframe such that it is in panel data form by moving the "Year" column such that each year is an individual column.
Out[34]:
Award Year 0
State
Alabama 2003 89
Alabama 2004 92
Alabama 2005 108
Alabama 2006 81
Alabama 2007 71
... ...
Wyoming 2011 4
Wyoming 2012 2
Wyoming 2013 1
Wyoming 2014 4
Wyoming 2015 3
[648 rows x 2 columns]
I want the years to each be individual columns, this is an example,
Out[48]:
State 2003 2004 2005 2006
0 NewYork 10 10 10 10
1 Alabama 15 15 15 15
2 Washington 20 20 20 20
I have read up on stack/unstack but I don't think I want a multilevel index as a result. I have been looking through the documentation at to_frame etc. but I can't see what I am looking for.
If anyone can help that would be great!
Use set_index with append=True then select the column 0 and use unstack to reshape:
df = df.set_index('Award Year', append=True)['0'].unstack()
Result:
Award Year 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0
Pivot Table can help.
df2 = pd.pivot_table(df,values='0', columns='AwardYear', index=['State'])
df2
Result:
AwardYear 2003 2004 2005 2006 2007 2011 2012 2013 2014 2015
State
Alabama 89.0 92.0 108.0 81.0 71.0 NaN NaN NaN NaN NaN
Wyoming NaN NaN NaN NaN NaN 4.0 2.0 1.0 4.0 3.0

Reshape pivot table in pandas

I need to reshape a csv pivot table. A small extract looks like:
country location confirmedcases_10-02-2020 deaths_10-02-2020 confirmedcases_11-02-2020 deaths_11-02-2020
0 Australia New South Wales 4.0 0.0 4 0.0
1 Australia Victoria 4.0 0.0 4 0.0
2 Australia Queensland 5.0 0.0 5 0.0
3 Australia South Australia 2.0 0.0 2 0.0
4 Cambodia Sihanoukville 1.0 0.0 1 0.0
5 Canada Ontario 3.0 0.0 3 0.0
6 Canada British Columbia 4.0 0.0 4 0.0
7 China Hubei 31728.0 974.0 33366 1068.0
8 China Zhejiang 1177.0 0.0 1131 0.0
9 China Guangdong 1177.0 1.0 1219 1.0
10 China Henan 1105.0 7.0 1135 8.0
11 China Hunan 912.0 1.0 946 2.0
12 China Anhui 860.0 4.0 889 4.0
13 China Jiangxi 804.0 1.0 844 1.0
14 China Chongqing 486.0 2.0 505 3.0
15 China Sichuan 417.0 1.0 436 1.0
16 China Shandong 486.0 1.0 497 1.0
17 China Jiangsu 515.0 0.0 543 0.0
18 China Shanghai 302.0 1.0 311 1.0
19 China Beijing 342.0 3.0 352 3.0
is there any ready to use pandas tool to achieve it?
into something like:
country location date confirmedcases deaths
0 Australia New South Wales 2020-02-10 4.0 0.0
1 Australia Victoria 2020-02-10 4.0 0.0
2 Australia Queensland 2020-02-10 5.0 0.0
3 Australia South Australia 2020-02-10 2.0 0.0
4 Cambodia Sihanoukville 2020-02-10 1.0 0.0
5 Canada Ontario 2020-02-10 3.0 0.0
6 Canada British Columbia 2020-02-10 4.0 0.0
7 China Hubei 2020-02-10 31728.0 974.0
8 China Zhejiang 2020-02-10 1177.0 0.0
9 China Guangdong 2020-02-10 1177.0 1.0
10 China Henan 2020-02-10 1105.0 7.0
11 China Hunan 2020-02-10 912.0 1.0
12 China Anhui 2020-02-10 860.0 4.0
13 China Jiangxi 2020-02-10 804.0 1.0
14 China Chongqing 2020-02-10 486.0 2.0
15 China Sichuan 2020-02-10 417.0 1.0
16 China Shandong 2020-02-10 486.0 1.0
17 China Jiangsu 2020-02-10 515.0 0.0
18 China Shanghai 2020-02-10 302.0 1.0
19 China Beijing 2020-02-10 342.0 3.0
20 Australia New South Wales 2020-02-11 4.0 0.0
21 Australia Victoria 2020-02-11 4.0 0.0
22 Australia Queensland 2020-02-11 5.0 0.0
23 Australia South Australia 2020-02-11 2.0 0.0
24 Cambodia Sihanoukville 2020-02-11 1.0 0.0
25 Canada Ontario 2020-02-11 3.0 0.0
26 Canada British Columbia 2020-02-11 4.0 0.0
27 China Hubei 2020-02-11 33366.0 1068.0
28 China Zhejiang 2020-02-11 1131.0 0.0
29 China Guangdong 2020-02-11 1219.0 1.0
30 China Henan 2020-02-11 1135.0 8.0
31 China Hunan 2020-02-11 946.0 2.0
32 China Anhui 2020-02-11 889.0 4.0
33 China Jiangxi 2020-02-11 844.0 1.0
34 China Chongqing 2020-02-11 505.0 3.0
35 China Sichuan 2020-02-11 436.0 1.0
36 China Shandong 2020-02-11 497.0 1.0
37 China Jiangsu 2020-02-11 543.0 0.0
38 China Shanghai 2020-02-11 311.0 1.0
39 China Beijing 2020-02-11 352.0 3.0
Use pd.wide_to_long:
print (pd.wide_to_long(df,stubnames=["confirmedcases","deaths"],
i=["country","location"],j="date",sep="_",
suffix=r'\d{2}-\d{2}-\d{4}').reset_index())
country location date confirmedcases deaths
0 Australia New South Wales 10-02-2020 4.0 0.0
1 Australia New South Wales 11-02-2020 4.0 0.0
2 Australia Victoria 10-02-2020 4.0 0.0
3 Australia Victoria 11-02-2020 4.0 0.0
4 Australia Queensland 10-02-2020 5.0 0.0
5 Australia Queensland 11-02-2020 5.0 0.0
6 Australia South Australia 10-02-2020 2.0 0.0
7 Australia South Australia 11-02-2020 2.0 0.0
8 Cambodia Sihanoukville 10-02-2020 1.0 0.0
9 Cambodia Sihanoukville 11-02-2020 1.0 0.0
10 Canada Ontario 10-02-2020 3.0 0.0
11 Canada Ontario 11-02-2020 3.0 0.0
12 Canada British Columbia 10-02-2020 4.0 0.0
13 Canada British Columbia 11-02-2020 4.0 0.0
14 China Hubei 10-02-2020 31728.0 974.0
15 China Hubei 11-02-2020 33366.0 1068.0
16 China Zhejiang 10-02-2020 1177.0 0.0
17 China Zhejiang 11-02-2020 1131.0 0.0
18 China Guangdong 10-02-2020 1177.0 1.0
19 China Guangdong 11-02-2020 1219.0 1.0
20 China Henan 10-02-2020 1105.0 7.0
21 China Henan 11-02-2020 1135.0 8.0
22 China Hunan 10-02-2020 912.0 1.0
23 China Hunan 11-02-2020 946.0 2.0
24 China Anhui 10-02-2020 860.0 4.0
25 China Anhui 11-02-2020 889.0 4.0
26 China Jiangxi 10-02-2020 804.0 1.0
27 China Jiangxi 11-02-2020 844.0 1.0
28 China Chongqing 10-02-2020 486.0 2.0
29 China Chongqing 11-02-2020 505.0 3.0
30 China Sichuan 10-02-2020 417.0 1.0
31 China Sichuan 11-02-2020 436.0 1.0
32 China Shandong 10-02-2020 486.0 1.0
33 China Shandong 11-02-2020 497.0 1.0
34 China Jiangsu 10-02-2020 515.0 0.0
35 China Jiangsu 11-02-2020 543.0 0.0
36 China Shanghai 10-02-2020 302.0 1.0
37 China Shanghai 11-02-2020 311.0 1.0
38 China Beijing 10-02-2020 342.0 3.0
39 China Beijing 11-02-2020 352.0 3.0
Yes, and you can achieve it by reshaping the dataframe.
Firs you have to melt the columns to have them as values:
df = df.melt(['country', 'location'],
[ p for p in df.columns if p not in ['country', 'location'] ],
'key',
'value')
#> country location key value
#> 0 Australia New South Wales confirmedcases_10-02-2020 4
#> 1 Australia Victoria confirmedcases_10-02-2020 4
#> 2 Australia Queensland confirmedcases_10-02-2020 5
#> 3 Australia South Australia confirmedcases_10-02-2020 2
#> 4 Cambodia Sihanoukville confirmedcases_10-02-2020 1
#> .. ... ... ... ...
#> 75 China Sichuan deaths_11-02-2020 1
#> 76 China Shandong deaths_11-02-2020 1
#> 77 China Jiangsu deaths_11-02-2020 0
#> 78 China Shanghai deaths_11-02-2020 1
#> 79 China Beijing deaths_11-02-2020 3
After that you need to separate the values in the column key:
key_split_series = df.key.str.split("_", expand=True)
df["key"] = key_split_series[0]
df["date"] = key_split_series[1]
#> country location key value date
#> 0 Australia New South Wales confirmedcases 4 10-02-2020
#> 1 Australia Victoria confirmedcases 4 10-02-2020
#> 2 Australia Queensland confirmedcases 5 10-02-2020
#> 3 Australia South Australia confirmedcases 2 10-02-2020
#> 4 Cambodia Sihanoukville confirmedcases 1 10-02-2020
#> .. ... ... ... ... ...
#> 75 China Sichuan deaths 1 11-02-2020
#> 76 China Shandong deaths 1 11-02-2020
#> 77 China Jiangsu deaths 0 11-02-2020
#> 78 China Shanghai deaths 1 11-02-2020
#> 79 China Beijing deaths 3 11-02-2020
In the end, you just need to pivot the table to have confirmedcases and deaths back as columns:
df = df.set_index(["country", "location", "date", "key"])["value"].unstack().reset_index()
#> key country location date confirmedcases deaths
#> 0 Australia New South Wales 10-02-2020 4 0
#> 1 Australia New South Wales 11-02-2020 4 0
#> 2 Australia Queensland 10-02-2020 5 0
#> 3 Australia Queensland 11-02-2020 5 0
#> 4 Australia South Australia 10-02-2020 2 0
#> .. ... ... ... ... ...
#> 35 China Shanghai 11-02-2020 311 1
#> 36 China Sichuan 10-02-2020 417 1
#> 37 China Sichuan 11-02-2020 436 1
#> 38 China Zhejiang 10-02-2020 1177 0
#> 39 China Zhejiang 11-02-2020 1131 0
Use {dataframe}.reshape((-1,1)) if there is only one feature and {dataframe}.reshape((1,-1)) if there is only one sample

Categories