Intersect Two DataFrames In Pandas Time Series - python

I have two data frames that look similar to the ones represeted below.
df1
id date x w
0 71896517 2020-07-25 1 5
1 71896517 2020-09-14 2 3
2 72837666 2020-09-21 1 9
3 72848188 2020-11-03 1 1
df2
id date x y z
0 71896517 2020-07-25 1 1 6
1 71896589 2020-09-14 2 2 8
2 72837949 2020-09-21 1 1 3
3 72848188 2020-11-03 1 1 2
I want to achieve only one data frame by intersecting the tow data frames above and achieve something similar to:
id date x w y z
0 71896517 2020-07-25 1 5 1 6
1 71896517 2020-09-14 2 3 NaN NaN
2 71896589 2020-09-14 2 NaN 2 8
3 72837666 2020-09-21 1 9 NaN NaN
4 72837949 2020-09-21 1 NaN 1 3
5 72848188 2020-11-03 1 1 1 2
Pretty much I want for every date the information for each id to be on the same row. I left the NaN because I think that is how it is going to be presented, but then I will fill them with zero.
How can I achive this?

Let's try an outer merge:
df3 = df1.merge(df2, how='outer').sort_values('date').reset_index(drop=True)
print(df3)
df3:
id date x w y z
0 71896517 2020-07-25 1 5.0 1.0 6.0
1 71896517 2020-09-14 2 3.0 NaN NaN
2 71896589 2020-09-14 2 NaN 2.0 8.0
3 72837666 2020-09-21 1 9.0 NaN NaN
4 72837949 2020-09-21 1 NaN 1.0 3.0
5 72848188 2020-11-03 1 1.0 1.0 2.0

Related

Forward filling missing dates into Python Panel Pandas Dataframe

Suppose I have the following pandas dataframe:
df = pd.DataFrame({'Date':['2015-01-31','2015-01-31', '2015-02-28', '2015-03-31', '2015-04-30', '2015-04-30'], 'ID':[1,2,2,2,1,2], 'value':[1,2,3,4,5,6]})
print(df)
Date ID value
2015-01-31 1 1
2015-01-31 2 2
2015-02-28 2 3
2015-03-31 2 4
2015-04-30 1 5
2015-04-30 2 6
I want to forward fill the data such that I have the values for each end of month till 2015-05-31 (i.e. for each date - ID combination). That is, I would like the dataframe to look as follows:
Date ID value
2015-01-31 1 1
2015-01-31 2 2
2015-02-28 2 3
2015-02-28 1 1
2015-03-31 2 4
2015-03-31 1 1
2015-04-30 1 5
2015-04-30 2 6
2015-05-31 1 5
2015-05-31 2 6
Is something like this possible? I saw several similar questions on Stackoverflow on forward filling dates, however this was without an index column (where the same date can occur many times).
You can pivot then fill value with reindex + ffill
out = df.pivot(*df.columns).reindex(pd.date_range('2015-01-31',periods = 5,freq='M')).ffill().stack().reset_index()
out.columns = df.columns
out
Out[1077]:
Date ID value
0 2015-01-31 1 1.0
1 2015-01-31 2 2.0
2 2015-02-28 1 1.0
3 2015-02-28 2 3.0
4 2015-03-31 1 1.0
5 2015-03-31 2 4.0
6 2015-04-30 1 5.0
7 2015-04-30 2 6.0
8 2015-05-31 1 5.0
9 2015-05-31 2 6.0
Another solution:
idx = pd.MultiIndex.from_product(
[
pd.date_range(df["Date"].min(), "2015-05-31", freq="M"),
df["ID"].unique(),
],
names=["Date", "ID"],
)
df = df.set_index(["Date", "ID"]).reindex(idx).groupby(level=1).ffill()
print(df.reset_index())
Prints:
Date ID value
0 2015-01-31 1 1.0
1 2015-01-31 2 2.0
2 2015-02-28 1 1.0
3 2015-02-28 2 3.0
4 2015-03-31 1 1.0
5 2015-03-31 2 4.0
6 2015-04-30 1 5.0
7 2015-04-30 2 6.0
8 2015-05-31 1 5.0
9 2015-05-31 2 6.0

Get cumulative mean among groups in Python

I am trying to get a cumulative mean in python among different groups.
I have data as follows:
id date value
1 2019-01-01 2
1 2019-01-02 8
1 2019-01-04 3
1 2019-01-08 4
1 2019-01-10 12
1 2019-01-13 6
2 2019-01-01 4
2 2019-01-03 2
2 2019-01-04 3
2 2019-01-06 6
2 2019-01-11 1
The output I'm trying to get something like this:
id date value cumulative_avg
1 2019-01-01 2 NaN
1 2019-01-02 8 2
1 2019-01-04 3 5
1 2019-01-08 4 4.33
1 2019-01-10 12 4.25
1 2019-01-13 6 5.8
2 2019-01-01 4 NaN
2 2019-01-03 2 4
2 2019-01-04 3 3
2 2019-01-06 6 3
2 2019-01-11 1 3.75
I need the cumulative average to restart with each new id.
I can get a variation of what I'm looking for with a single, for example if the data set only had the data where id = 1 then I could use:
df['cumulative_avg'] = df['value'].expanding.mean().shift(1)
I try to add a group by into it but I get an error:
df['cumulative_avg'] = df.groupby('id')['value'].expanding().mean().shift(1)
TypeError: incompatible index of inserted column with frame index
Also tried:
df.set_index(['account']
ValueError: cannot handle a non-unique multi-index!
The actual data I have has millions of rows, and thousands of unique ids'. Any help with a speedy/efficient way to do this would be appreciated.
For many groups this will perform better because it ditches the apply. Take the cumsum divided by the cumcount, subtracting off the value to get the analog of expanding. Fortunately pandas interprets 0/0 as NaN.
gp = df.groupby('id')['value']
df['cum_avg'] = (gp.cumsum() - df['value'])/gp.cumcount()
id date value cum_avg
0 1 2019-01-01 2 NaN
1 1 2019-01-02 8 2.000000
2 1 2019-01-04 3 5.000000
3 1 2019-01-08 4 4.333333
4 1 2019-01-10 12 4.250000
5 1 2019-01-13 6 5.800000
6 2 2019-01-01 4 NaN
7 2 2019-01-03 2 4.000000
8 2 2019-01-04 3 3.000000
9 2 2019-01-06 6 3.000000
10 2 2019-01-11 1 3.750000
After a groupby, you can't really chain method and in your example, the shift is not made per group anymore so you would not get the expected result. And there is a problem with index alignment after anyway so you can't create a column like this. So you can do:
df['cumulative_avg'] = df.groupby('id')['value'].apply(lambda x: x.expanding().mean().shift(1))
print (df)
id date value cumulative_avg
0 1 2019-01-01 2 NaN
1 1 2019-01-02 8 2.000000
2 1 2019-01-04 3 5.000000
3 1 2019-01-08 4 4.333333
4 1 2019-01-10 12 4.250000
5 1 2019-01-13 6 5.800000
6 2 2019-01-01 4 NaN
7 2 2019-01-03 2 4.000000
8 2 2019-01-04 3 3.000000
9 2 2019-01-06 6 3.000000
10 2 2019-01-11 1 3.750000

Year to date average in dataframe

I have a dataframe that I am trying to calculate the year-to-date average for my value columns. Below is a sample dataframe.
date name values values2
0 2019-01-01 a 1 1
1 2019-02-01 a 3 3
2 2019-03-01 a 2 2
3 2019-04-01 a 6 2
I want to create new columns (values_ytd & values2_ytd) that will average the values from January to the latest period within the same year (April in sample data). I will need to group the data by year & name when calculating the averages. I am looking for an output similar to this.
date name values values2 values2_ytd values_ytd
0 2019-01-01 a 1 1 1 1
1 2019-02-01 a 3 3 2 2
2 2019-03-01 a 2 2 2 2
3 2019-04-01 a 6 2 2 3
I have tried unsuccesfully to using expanding().mean(), but most likely I was doing it wrong. My main dataframe has numerous name categories and many more columns. Here is the code I was attempting to use
df1.groupby([df1['name'], df1['date'].dt.year], as_index=False).expanding().mean().loc[:, 'values':'values2'].add_suffix('_ytd').reset_index(drop=True,level=0)
but am receiving the following error.
NotImplementedError: ops for Expanding for this dtype datetime64[ns] are not implemented
Note: This code below works perfectly when substituting cumsum() for .expanding().mean()to create a year-to-date sum of the values, but I cant figure it out for averages
df1.groupby([df1['name'], df1['date'].dt.year], as_index=False).cumsum().loc[:, 'values':'values2'].add_suffix('_ytd').reset_index(drop=True,level=0)
Any help is greatly appreciated.
Try this:
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
df[['values2_ytd', 'values_ytd']] = df.groupby([df.index.year, 'name'])['values','values2'].expanding().mean().reset_index(level=[0,1], drop=True)
df
name values values2 values2_ytd values_ytd
date
2019-01-01 a 1 1 1.0 1.0
2019-02-01 a 3 3 2.0 2.0
2019-03-01 a 2 2 2.0 2.0
2019-04-01 a 6 2 3.0 2.0
Example using multiple names and years:
date name values values2
0 2019-01-01 a 1 1
1 2019-02-01 a 3 3
2 2019-03-01 a 2 2
3 2019-04-01 a 6 2
4 2019-01-01 b 1 4
5 2019-02-01 b 3 4
6 2020-01-01 a 1 1
7 2020-02-01 a 3 3
8 2020-03-01 a 2 2
9 2020-04-01 a 6 2
Output:
name values values2 values2_ytd values_ytd
date
2019-01-01 a 1 1 1.0 1.0
2019-02-01 a 3 3 2.0 2.0
2019-03-01 a 2 2 2.0 2.0
2019-04-01 a 6 2 3.0 2.0
2019-01-01 b 1 4 1.0 4.0
2019-02-01 b 3 4 2.0 4.0
2020-01-01 a 1 1 1.0 1.0
2020-02-01 a 3 3 2.0 2.0
2020-03-01 a 2 2 2.0 2.0
2020-04-01 a 6 2 3.0 2.0
You should set date column as index: df.set_index('date', inplace=True) and then use df.resample('AS').groupby('name').mean()

How to add missing dates within date interval?

I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05
12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-06
13:39:00','2173-07-08 11:30:00','2173-04-08 16:00:00','2173-04-09
22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month
As you can see from the dataframe above that there are few missing dates in between. I would like to create new records for those dates and fill in values from the immediate previous row
def dt(df):
r = pd.date_range(start=df.date.min(), end=df.date.max())
df.set_index('date').reindex(r)
new_df = df.groupby(['subject_id','month']).apply(dt)
This generates all the dates. I only want to find the missing date within the input date interval for each subject for each month
I did try the code from this related post. Though it helped me but doesn't get me the expected output for this updated/new requirement. As we do left join, it copies all records. I can't do inner join either because it will drop non-match column. I want a mix of left join and inner join
Currently it creates new records for all 365 days in a year which I don't want. something like below. This is not expected
I only wish to add missing dates between input date interval as shown below. For example subject = 1, in the 4th month has records from 3rd and 5th. but 4th is missing. So we add record for 4th day alone. We don't need 6th,7th etc unlike current output. Similarly in 7th month, record for 7th day missing. so we just add a new record for that
I expect my output to be like as shown below
Here is problem you need resample for append new days, so it is necessary.
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')
df1 = (df.set_index('date')
.groupby('subject_id')
.resample('d')
.last()
.index
.to_frame(index=False))
print (df1)
subject_id date
0 1 2173-04-03
1 1 2173-04-04
2 1 2173-04-05
3 1 2173-04-06
4 1 2173-04-07
.. ... ...
99 2 2173-04-10
100 2 2173-04-11
101 2 2173-04-12
102 2 2173-04-13
103 2 2173-04-14
[104 rows x 2 columns]
Idea is remove unnecessary missing rows - you can create threshold for minimum consecutive mising values (here 5) and remove rows (created new column fro easy test):
df2 = df1.merge(df, how='left')
thresh = 5
mask = df2['day'].notna()
s = mask.cumsum().mask(mask)
df2['count'] = s.map(s.value_counts())
df2 = df2[(df2['count'] < thresh) | (df2['count'].isna())]
print (df2)
subject_id date time_1 val day count
0 1 2173-04-03 2173-04-03 12:35:00 5.0 3.0 NaN
1 1 2173-04-03 2173-04-03 12:50:00 5.0 3.0 NaN
2 1 2173-04-04 NaT NaN NaN 1.0
3 1 2173-04-05 2173-04-05 12:59:00 5.0 5.0 NaN
32 1 2173-05-04 2173-05-04 13:14:00 5.0 4.0 NaN
33 1 2173-05-05 2173-05-05 13:37:00 1.0 5.0 NaN
95 1 2173-07-06 2173-07-06 13:39:00 6.0 6.0 NaN
96 1 2173-07-07 NaT NaN NaN 1.0
97 1 2173-07-08 2173-07-08 11:30:00 5.0 8.0 NaN
98 2 2173-04-08 2173-04-08 16:00:00 5.0 8.0 NaN
99 2 2173-04-09 2173-04-09 22:00:00 8.0 9.0 NaN
100 2 2173-04-10 NaT NaN NaN 1.0
101 2 2173-04-11 2173-04-11 04:00:00 3.0 11.0 NaN
102 2 2173-04-12 NaT NaN NaN 1.0
103 2 2173-04-13 2173-04-13 04:30:00 4.0 13.0 NaN
104 2 2173-04-14 2173-04-14 08:00:00 6.0 14.0 NaN
Last use previous solution:
df2 = df2.groupby(df['subject_id']).ffill()
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
subject_id date time_1 val day count
0 1 2173-04-03 2173-04-03 12:35:00 5 3 NaN
1 1 2173-04-03 2173-04-03 12:50:00 5 3 NaN
2 1 2173-04-04 2173-04-04 12:50:00 5 4 1.0
3 1 2173-04-05 2173-04-05 12:59:00 5 5 1.0
32 1 2173-05-04 2173-05-04 13:14:00 5 4 NaN
33 1 2173-05-05 2173-05-05 13:37:00 1 5 NaN
95 1 2173-07-06 2173-07-06 13:39:00 6 6 NaN
96 1 2173-07-07 2173-07-07 13:39:00 6 7 1.0
97 1 2173-07-08 2173-07-08 11:30:00 5 8 1.0
98 2 2173-04-08 2173-04-08 16:00:00 5 8 1.0
99 2 2173-04-09 2173-04-09 22:00:00 8 9 1.0
100 2 2173-04-10 2173-04-10 22:00:00 8 10 1.0
101 2 2173-04-11 2173-04-11 04:00:00 3 11 1.0
102 2 2173-04-12 2173-04-12 04:00:00 3 12 1.0
103 2 2173-04-13 2173-04-13 04:30:00 4 13 1.0
104 2 2173-04-14 2173-04-14 08:00:00 6 14 1.0
EDIT: Solution with reindex for each month:
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')
df['month'] = df['time_1'].dt.month
df1 = (df.drop_duplicates(['date','subject_id'])
.set_index('date')
.groupby(['subject_id', 'month'])
.apply(lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max())))
.rename_axis(('subject_id','month','date'))
.index
.to_frame(index=False)
)
print (df1)
subject_id month date
0 1 4 2173-04-03
1 1 4 2173-04-04
2 1 4 2173-04-05
3 1 5 2173-05-04
4 1 5 2173-05-05
5 1 7 2173-07-06
6 1 7 2173-07-07
7 1 7 2173-07-08
8 2 4 2173-04-08
9 2 4 2173-04-09
10 2 4 2173-04-10
11 2 4 2173-04-11
12 2 4 2173-04-12
13 2 4 2173-04-13
14 2 4 2173-04-14
df2 = df1.merge(df, how='left')
df2 = df2.groupby(df2['subject_id']).ffill()
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
subject_id month date time_1 val day
0 1 4 2173-04-03 2173-04-03 12:35:00 5 3
1 1 4 2173-04-03 2173-04-03 12:50:00 5 3
2 1 4 2173-04-04 2173-04-04 12:50:00 5 4
3 1 4 2173-04-05 2173-04-05 12:59:00 5 5
4 1 5 2173-05-04 2173-05-04 13:14:00 5 4
5 1 5 2173-05-05 2173-05-05 13:37:00 1 5
6 1 7 2173-07-06 2173-07-06 13:39:00 6 6
7 1 7 2173-07-07 2173-07-07 13:39:00 6 7
8 1 7 2173-07-08 2173-07-08 11:30:00 5 8
9 2 4 2173-04-08 2173-04-08 16:00:00 5 8
10 2 4 2173-04-09 2173-04-09 22:00:00 8 9
11 2 4 2173-04-10 2173-04-10 22:00:00 8 10
12 2 4 2173-04-11 2173-04-11 04:00:00 3 11
13 2 4 2173-04-12 2173-04-12 04:00:00 3 12
14 2 4 2173-04-13 2173-04-13 04:30:00 4 13
15 2 4 2173-04-14 2173-04-14 08:00:00 6 14
Does this help?
def fill_dates(df):
result = pd.DataFrame()
for i,row in df.iterrows():
if i == 0:
result = result.append(row)
else:
start_date = result.iloc[-1]['time_1']
end_date = row['time_1']
# print(start_date, end_date)
delta = (end_date - start_date).days
# print(delta)
if delta > 0 and start_date.month == end_date.month:
for j in range(delta):
day = start_date + timedelta(days=j+1)
new_row = result.iloc[-1].copy()
new_row['time_1'] = day
new_row['remarks'] = 'added'
if new_row['time_1'].date() != row['time_1'].date():
result = result.append(new_row)
result = result.append(row)
else:
result = result.append(row)
result.reset_index(inplace = True)
return result

How do you convert start and end date records into timestamps?

For example (input pandas dataframe):
start_date end_date value
0 2018-05-17 2018-05-20 4
1 2018-05-22 2018-05-27 12
2 2018-05-14 2018-05-21 8
I want it to divide the value by the # of intervals present in the data (e.g. 2018-05-12 to 2018-05-27 has 6 days, 12 / 6 = 2) and then create a time series data like the following:
date value
0 2018-05-14 1
1 2018-05-15 1
2 2018-05-16 1
3 2018-05-17 2
4 2018-05-18 2
5 2018-05-19 2
6 2018-05-20 2
7 2018-05-21 1
8 2018-05-22 2
9 2018-05-23 2
10 2018-05-24 2
11 2018-05-25 2
12 2018-05-26 2
13 2018-05-27 2
is this possible to do without an inefficient loop through every row using pandas? Is there also a name for this method?
You can use:
#convert to datetimes if necessary
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
For each row generate list of Series by date_range, then divide their length and aggregate by groupby with sum:
dfs = [pd.Series(r.value, pd.date_range(r.start_date, r.end_date)) for r in df.itertuples()]
df = (pd.concat([x / len(x) for x in dfs])
.groupby(level=0)
.sum()
.rename_axis('date')
.reset_index(name='val'))
print (df)
date val
0 2018-05-14 1.0
1 2018-05-15 1.0
2 2018-05-16 1.0
3 2018-05-17 2.0
4 2018-05-18 2.0
5 2018-05-19 2.0
6 2018-05-20 2.0
7 2018-05-21 1.0
8 2018-05-22 2.0
9 2018-05-23 2.0
10 2018-05-24 2.0
11 2018-05-25 2.0
12 2018-05-26 2.0
13 2018-05-27 2.0

Categories