merge and sample two pandas time series - python

I have two time series. I would like to merge them and asfreq(*, method='pad') the result, restricted to the time range they share in common.
To illustrate, suppose I define A and B like this:
import datetime as dt
import numpy as np
import pandas as pd
A = pd.Series(np.arange(4), index=pd.date_range(dt.datetime(2017,1,4,10,0,0),
periods=4, freq=dt.timedelta(seconds=10)))
B = pd.Series(np.arange(6), index=pd.date_range(dt.datetime(2017,1,4,10,0,7),
periods=6, freq=dt.timedelta(seconds=3)))
So they look like:
# A
2017-01-04 10:00:00 0
2017-01-04 10:00:10 1
2017-01-04 10:00:20 2
2017-01-04 10:00:30 3
# B
2017-01-04 10:00:07 0
2017-01-04 10:00:10 1
2017-01-04 10:00:13 2
2017-01-04 10:00:16 3
2017-01-04 10:00:19 4
2017-01-04 10:00:22 5
I would like to compute something like:
# combine_and_asfreq(A, B, dt.timedelta(seconds=5))
# timestamp A B
2017-01-04 10:00:07 0 0
2017-01-04 10:00:12 1 1
2017-01-04 10:00:17 1 3
2017-01-04 10:00:22 2 5
How can I do this?

I am not exactly sure what you are asking but here is a somewhat convoluted method that first finds the overlapping time and creates a single column dataframe as the 'base' dataframe with the 5s timedelta.
get started by setting up dataframes properly
start = max(A.index.min(), B.index.min())
end = min(A.index.max(), B.index.max())
df_time = pd.DataFrame({'time': pd.date_range(start,end,freq='5s')})
df_A = A.reset_index()
df_B = B.reset_index()
df_A.columns = ['time', 'value']
df_B.columns = ['time', 'value']
Now we have the following three dataframes.
df_A
time value
0 2017-01-04 10:00:00 0
1 2017-01-04 10:00:10 1
2 2017-01-04 10:00:20 2
3 2017-01-04 10:00:30 3
df_B
time value
0 2017-01-04 10:00:07 0
1 2017-01-04 10:00:10 1
2 2017-01-04 10:00:13 2
3 2017-01-04 10:00:16 3
4 2017-01-04 10:00:19 4
5 2017-01-04 10:00:22 5
df_time
time
0 2017-01-04 10:00:07
1 2017-01-04 10:00:12
2 2017-01-04 10:00:17
3 2017-01-04 10:00:22
Use merge_asof to join all three
pd.merge_asof(pd.merge_asof(df_time, df_A, on='time'), df_B, on='time', suffixes=('_A', '_B'))
time value_A value_B
0 2017-01-04 10:00:07 0 0
1 2017-01-04 10:00:12 1 1
2 2017-01-04 10:00:17 1 3
3 2017-01-04 10:00:22 2 5

Related

Return blocks of size X from pandas dataframe

I have a PANDAS dataframe where I want to return a function of every X items of a time series--so for instance, my dataframe might look like
date value
2017-01-01 1
2017-01-02 5
2017-01-03 2
2017-01-04 1
2017-01-05 6
2017-01-06 6
So for example, if I want to be able to pull the rolling average of every X values where X is 3, I would want a dataframe showing
date value
2017-01-03 2.666
2017-01-04 2.666
2017-01-05 3
2017-01-06 4.333
Is there a dataframe operation that lets me pick a group of X values upon which to run a function?
I think you need rolling with mean and then if necessary remove first NaNs by dropna:
df['value'] = df['value'].rolling(3).mean()
df = df.dropna(subset=['value'])
print (df)
date value
2 2017-01-03 2.666667
3 2017-01-04 2.666667
4 2017-01-05 3.000000
5 2017-01-06 4.333333
There is also possible use min_periods parameter for avoid NaNs:
df['value'] = df['value'].rolling(3, min_periods=1).mean()
print (df)
date value
0 2017-01-01 1.000000
1 2017-01-02 3.000000
2 2017-01-03 2.666667
3 2017-01-04 2.666667
4 2017-01-05 3.000000
5 2017-01-06 4.333333

Pandas fill forward and sum as you go

I have a sparse dataframe including dates of when inventory is bought or sold like the following:
Date Inventory
2017-01-01 10
2017-01-05 -5
2017-01-07 15
2017-01-09 -20
First step I would like to solve is to to add in the other dates. I know you can use resample but just highlighting this part in case it has an impact on the next more difficult part. As below:
Date Inventory
2017-01-01 10
2017-01-02 NaN
2017-01-03 NaN
2017-01-04 NaN
2017-01-05 -5
2017-01-06 NaN
2017-01-07 15
2017-01-08 NaN
2017-01-09 -20
The final step is to have it fill forward over the NaNs except that once it encounters a new value that get added to the current value of the row above, so that the final dataframe looks like the following:
Date Inventory
2017-01-01 10
2017-01-02 10
2017-01-03 10
2017-01-04 10
2017-01-05 5
2017-01-06 5
2017-01-07 20
2017-01-08 20
2017-01-09 0
2017-01-10 0
I am trying to get a pythonic approach to this and not a loop based approach as that will be very slow.
The example should also work for a table with multiple columns as such:
Date InventoryA InventoryB
2017-01-01 10 NaN
2017-01-02 NaN NaN
2017-01-03 NaN 5
2017-01-04 NaN 5
2017-01-05 -5 NaN
2017-01-06 NaN -10
2017-01-07 15 NaN
2017-01-08 NaN NaN
2017-01-09 -20 NaN
would become:
Date InventoryA InventoryB
2017-01-01 10 0
2017-01-02 10 0
2017-01-03 10 5
2017-01-04 10 10
2017-01-05 5 10
2017-01-06 5 0
2017-01-07 20 0
2017-01-08 20 0
2017-01-09 0 0
2017-01-10 0 0
hope that helps too. I think the current solution will have a problem with the nans as such.
thanks
You can just fill the missing values with 0 after resampling (no inventory change on that day), and then use cumsum
df.fillna(0).cumsum()
You're simply doing the two steps in the wrong order :)
df['Inventory'].cumsum().resample('D').pad()
Edit: you might need to set the Date as index first.
df = df.set_index('Date')
Part 1 : Assuming df is your
Date Inventory
2017-01-01 10
2017-01-05 -5
2017-01-07 15
2017-01-09 -20
Then
import pandas as pd
import datetime
df_new = pd.DataFrame([df.Date.min() + datetime.timedelta(days=day) for day in range((df.Date.max() - df.Date.min()).days+1)])
df_new = df_new.merge(df, left_on=0, right_on='Date',how="left").drop("Date",axis=1)
df_new.columns = df.columns
Gives you :
Date Inventory
0 2017-01-01 10.0
1 2017-01-02 NaN
2 2017-01-03 NaN
3 2017-01-04 NaN
4 2017-01-05 -5.0
5 2017-01-06 NaN
6 2017-01-07 15.0
7 2017-01-08 NaN
8 2017-01-09 -20.0
part 2
From fillna method descriptions:
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill:
propagate last valid observation forward to next valid backfill /
bfill: use NEXT valid observation to fill gap
df_new.Inventory = df_new.Inventory.fillna(method="ffill")
Gives you
Date Inventory
0 2017-01-01 10.0
1 2017-01-02 10.0
2 2017-01-03 10.0
3 2017-01-04 10.0
4 2017-01-05 -5.0
5 2017-01-06 -5.0
6 2017-01-07 15.0
7 2017-01-08 15.0
8 2017-01-09 -20.0
You should be able to generalise it for more than one column once you understood how it can be done with one.

Python: creating categorical variable in the existing column

I have a DataFrame in python that has a column holding difference of 2 dates. I would like to create a new/overwrite on the existing column that can convert numeric to categorical variable based on below rules:
difference 0 days Level 0
difference 2 days Level 1
difference 2-6 days Level 2
difference 6-15 days Level 3
difference 15-69 days Level 4
difference NAT Level 5
how this could be accomplished.
say the column name is 'difference'
you can define a method like
def get_difference_category(difference):
if difference < 0:
return 0
if difference <=2:
return 1
#.. and so on
df['difference'] = df['difference'].apply(lambda value: get_difference_category(value), axis=1)
reference links:
https://github.com/vi3k6i5/pandas_basics/blob/master/2_b_apply_a_function_row_wise.ipynb
https://github.com/vi3k6i5/pandas_basics/blob/master/2_c_apply_a_function_to_a_column.ipynb
You can use np.searchsorted to find where each time delta falls into an array of break points. I replace any NaT differences with Level 6
td = pd.to_timedelta(['0 days', '2 days', '6 days', '15 days', '69 days'])
difs = df.End.values - df.Start.values
vals = np.searchsorted(td.values, difs)
vals[pd.isnull(difs)] = 6
df = df.assign(
Level=np.core.defchararray.add(
'Level ', vals.astype(str)
)
)
df
Start End Level
0 2017-01-01 2017-01-11 Level 3
1 2017-01-02 2017-03-09 Level 4
2 2017-01-03 2017-03-16 Level 5
3 2017-01-04 2017-01-10 Level 2
4 2017-01-05 2017-01-05 Level 0
5 2017-01-06 2017-01-08 Level 1
6 2017-01-07 2017-01-26 Level 4
7 2017-01-08 2017-01-15 Level 3
8 2017-01-09 2017-02-16 Level 4
9 2017-01-10 2017-01-24 Level 3
Setup
import pandas as pd
from io import StringIO
txt = """ Start End
0 2017-01-01 2017-01-11
1 2017-01-02 2017-03-09
2 2017-01-03 2017-03-16
3 2017-01-04 2017-01-10
4 2017-01-05 2017-01-05
5 2017-01-06 2017-01-08
6 2017-01-07 2017-01-26
7 2017-01-08 2017-01-15
8 2017-01-09 2017-02-16
9 2017-01-10 2017-01-24"""
df = pd.read_csv(StringIO(txt), delim_whitespace=True).apply(pd.to_datetime)
Edit: To handle NaT, You can use pd.cut:
data['Severity'] = pd.cut((data['End'] - data['Start']).dt.days,[-pd.np.inf,-1,0,2,6,15,69],labels=['Level 5', 'Level 0','Level 1','Level 2','Level 3','Level 4']).fillna('Level 5')
Example:
df.head(10)
Start End
0 2017-01-01 2017-01-11
1 2017-01-02 2017-03-09
2 2017-01-03 2017-03-16
3 2017-01-04 2017-01-10
4 2017-01-05 2017-01-25
5 2017-01-06 2017-01-25
6 2017-01-07 2017-01-26
7 2017-01-08 2017-01-15
8 2017-01-09 2017-02-16
9 2017-01-10 2017-01-24
df['Severity'] = pd.cut((df['End'] - df['Start']).dt.days,[-np.inf,0,2,6,15,69,np.inf],labels=['Level 0','Level 1','Level 2','Level 3','Level 4','Level 5'])
Output:
End Start Severity
0 2017-01-11 2017-01-01 Level 3
1 2017-03-09 2017-01-02 Level 4
2 2017-03-16 2017-01-03 Level 5
3 2017-01-10 2017-01-04 Level 2
4 2017-01-25 2017-01-05 Level 4
5 2017-01-25 2017-01-06 Level 4
6 2017-01-26 2017-01-07 Level 4
7 2017-01-15 2017-01-08 Level 3
8 2017-02-16 2017-01-09 Level 4
9 2017-01-24 2017-01-10 Level 3
I added an bar plot to analyze the distribution. I also used a dataframe and a lambda function to get my day differences. Visuals help you understand the data. the histogram gives you insight into classification distributions and the pairplot shows you how the day interval is distributed.
column1=['2017-01-01','2017-01-01','2017-01-02','2017-01-03','2017-01-04','2017-01-05','2017-01-06','2017-01-07','2017-01-08','2017-01-09','2017-01-10']
column2=['2017-01-01','2017-01-11','2017-03-09','2017-03-16','2017-01-10','2017-01-25','2017-01-25','2017-01-26','2017-01-15','2017-02-16','2017-01-24' ]
index=range(0,len(column1))
data={'column1':column1,'column2':column2}
df=pd.DataFrame(data, columns=['column1','column2'],index=index)
print(df.head())
differences=df.apply(lambda x: datetime.strptime(x['column2'],'%Y-%m-%d')- datetime.strptime(x['column1'],'%Y-%m-%d'),axis=1)
differences=differences.dt.days.astype('int')
years_bins=[-1,0,2,6,15,69,np.inf]
output_labels=['level 0','level 1','level 2','level 3','level 4','level 5']
out=pd.cut(differences,bins=years_bins,labels=output_labels)
df['differences']=differences
df['classification']=out
print(df.head())
fig, ax = plt.subplots()
ax = out.value_counts(sort=False).plot.barh(rot=0, color="b", figsize=(6,4))
ax.set_yticklabels(labels)
plt.show()
plt.hist(df['classification'], bins=6)
plt.show()
sns.distplot(df['differences'])
plt.show()

calculate datetime differenence over multiple rows in dataframe

I have got a python related question about datetimes in a dataframe. I imported the following df via pd.read_csv()
datetime label d_time
0 2017-01-03 23:52:00
1 2017-01-03 23:53:00 A
2 2017-01-03 23:54:00 A
3 2017-01-03 23:55:00 A
4 2017-01-04 00:01:00
5 2017-01-04 00:02:00 B
6 2017-01-04 00:06:00 B
7 2017-01-04 00:09:00 B
8 2017-01-04 00:11:00 B
9 2017-01-04 00:12:00
10 2017-01-04 00:14:00
11 2017-01-04 00:16:00
12 2017-01-04 00:18:00 C
13 2017-01-04 00:20:00 C
14 2017-01-04 00:22:00
I would like to know the time difference over the rows that are labeled with A, B, C as in the following:
datetime label d_time
0 2017-01-03 23:52:00
1 2017-01-03 23:53:00 A 0:02
2 2017-01-03 23:54:00 A
3 2017-01-03 23:55:00 A
4 2017-01-04 00:01:00
5 2017-01-04 00:02:00 B 0:09
6 2017-01-04 00:06:00 B
7 2017-01-04 00:09:00 B
8 2017-01-04 00:11:00 B
9 2017-01-04 00:12:00
10 2017-01-04 00:14:00
11 2017-01-04 00:16:00
12 2017-01-04 00:18:00 C 0:02
13 2017-01-04 00:20:00 C
14 2017-01-04 00:22:00
So the d_time should be the total time difference over labeled rows. There are approx. 100 different labels, and they can vary from 1 to x in a row. This calculation has to be done for +1 million rows, so a loop will probably not work. Does anybody know how to do this? Thanks in advance.
Assuming the consecutive labels are all the same, and seperated by 1 nan
you can do something like this
idx = pd.Series(df[pd.isnull(df['label'])].index)
idx_begin = idx.iloc[:-1] + 1
idx_end = idx.iloc[1:] - 1
d_time = df.loc[idx_end, 'datetime'].reset_index(drop=True) - df.loc[idx_begin, 'datetime'].reset_index(drop=True)
d_time.index = idx_begin
df.loc[idx_begin, 'd_time'] = d_time
If your dataset looks different, you might look into different ways to get to idx_begin and idx_end, but this works for the dataset you posted
Multiple consecutive nans
If there are multiple consecutive nan-values, you can solve this by adding this to the end
df.loc[df[pd.isnull(df['label'])].index, 'd_time'] = None
Consecutive different labels
idx = df[(df['label'] != df['label'].shift(1)) & (pd.notnull(df['label']) | (pd.notnull(df['label'].shift(1))))].index
idx_begin = idx[:-1]
idx_end = idx[1:] -1
This marks different labels as different starts and beginnings. To make this work, you will need the df.loc[df[pd.isnull(df['label'])].index, 'd_time'] = None added to the end
The & (pd.notnull(df['label']) | (pd.notnull(df['label'].shift(1))) part is because None != None
Result
datetime label d_time
0 2017-01-03 23:52:00 NaN NaN
1 2017-01-03 23:53:00 A NaN
2 2017-01-03 23:54:00 A NaN
3 2017-01-03 23:52:00 NaN NaN
4 2017-01-03 23:53:00 B NaN
5 2017-01-03 23:54:00 B NaN
6 2017-01-03 23:55:00 NaN NaN
7 2017-01-03 23:56:00 NaN NaN
8 2017-01-03 23:57:00 NaN NaN
9 2017-01-04 00:02:00 A NaN
10 2017-01-04 00:06:00 A NaN
11 2017-01-04 00:09:00 A NaN
12 2017-01-04 00:02:00 B NaN
13 2017-01-04 00:06:00 B NaN
14 2017-01-04 00:09:00 B NaN
15 2017-01-04 00:11:00 NaN NaN
yields
datetime label d_time
0 2017-01-03 23:52:00 NaN NaT
1 2017-01-03 23:53:00 A 00:01:00
2 2017-01-03 23:54:00 A NaT
3 2017-01-03 23:52:00 NaN NaT
4 2017-01-03 23:53:00 B 00:01:00
5 2017-01-03 23:54:00 B NaT
6 2017-01-03 23:55:00 NaN NaT
7 2017-01-03 23:56:00 NaN NaT
8 2017-01-03 23:57:00 NaN NaT
9 2017-01-04 00:02:00 A 00:07:00
10 2017-01-04 00:06:00 A NaT
11 2017-01-04 00:09:00 A NaT
12 2017-01-04 00:02:00 B 00:07:00
13 2017-01-04 00:06:00 B NaT
14 2017-01-04 00:09:00 B NaT
15 2017-01-04 00:11:00 NaN NaT
Last Series
If the last row doesn't have a changed label compared to the one before it, the last series will not register.
You can prevent this by including this after the first line
if idx[-1] != df.index[-1]:
idx = idx.append(df.index[[-1]]+1)
If the datetimes are datetime objects (or pandas.TimeStamp) you can use this for-loop
a_rows = []
for row in df.itertuples():
if row.label == 'A':
a_rows.append(row)
elif a_rows:
d_time = a_rows[-1].datetime - a_rows[0].datetime
df.loc[a_rows[0].Index, 'd_time'] = d_time
a_rows = []
with this result
datetime label d_time
0 2017-01-03 23:52:00
1 2017-01-03 23:53:00 A 0 days 00:02:00
2 2017-01-03 23:54:00 A
3 2017-01-03 23:55:00 A
4 2017-01-04 00:01:00
5 2017-01-04 00:02:00 A 0 days 00:07:00
6 2017-01-04 00:06:00 A
7 2017-01-04 00:09:00 A
8 2017-01-04 00:11:00
You can later format the timedelta object if you want.
If the datetime column are strings you can easily convert em with df['datetime'] = pd.to_datetime(df['datetime'])

Efficiently serialising timestamps

I have a dataframe to analyse that has a column of dates as datetimes, and a column of hours as integers.
I would like to combine the two columns into a single timestamp field for some further analysis, but cannot find a way to do so quickly.
I have this code that works, but takes an inoordinate amount of time due to the length of the dataframe (~1m entries)
for i in range(len(my_df))
my_df['gen_timestamp'][i] = datetime.datetime.combine(my_df['date'][i],
datetime.time(my_df['hour'][i])
What I would like to do is to somehow convert the datetime type in my_df['date'] to an integer (say a timestamp in seconds) and the integer type in my_df['hour'], so that they can be quickly summed without the need for a laborious loop.
Worst case I then convert that integer back to a datetime in one go or just use seconds as my data type going forwards.
Thanks for any help.
IIUC you can construct a TimedeltaIndex and add this to your datetimes:
In [112]:
# sample data
df = pd.DataFrame({'date':pd.date_range(dt.datetime(2017,1,1), periods=10), 'hour':np.arange(10)})
df
Out[112]:
date hour
0 2017-01-01 0
1 2017-01-02 1
2 2017-01-03 2
3 2017-01-04 3
4 2017-01-05 4
5 2017-01-06 5
6 2017-01-07 6
7 2017-01-08 7
8 2017-01-09 8
9 2017-01-10 9
In [113]:
df['timestamp'] = df['date'] + pd.TimedeltaIndex(df['hour'], unit='h')
df
Out[113]:
date hour timestamp
0 2017-01-01 0 2017-01-01 00:00:00
1 2017-01-02 1 2017-01-02 01:00:00
2 2017-01-03 2 2017-01-03 02:00:00
3 2017-01-04 3 2017-01-04 03:00:00
4 2017-01-05 4 2017-01-05 04:00:00
5 2017-01-06 5 2017-01-06 05:00:00
6 2017-01-07 6 2017-01-07 06:00:00
7 2017-01-08 7 2017-01-08 07:00:00
8 2017-01-09 8 2017-01-09 08:00:00
9 2017-01-10 9 2017-01-10 09:00:00
So in your case I expect the following to work:
my_df['gen_timestamp'] = my_df['date'] + pd.TimedeltaIndex(my_df['hour'], unit='h')
this assumes that my_df['date'] is already Datetime if not convert first using my_df['date'] = pd.to_datetime(my_df['date'])

Categories