Efficiently serialising timestamps - python

I have a dataframe to analyse that has a column of dates as datetimes, and a column of hours as integers.
I would like to combine the two columns into a single timestamp field for some further analysis, but cannot find a way to do so quickly.
I have this code that works, but takes an inoordinate amount of time due to the length of the dataframe (~1m entries)
for i in range(len(my_df))
my_df['gen_timestamp'][i] = datetime.datetime.combine(my_df['date'][i],
datetime.time(my_df['hour'][i])
What I would like to do is to somehow convert the datetime type in my_df['date'] to an integer (say a timestamp in seconds) and the integer type in my_df['hour'], so that they can be quickly summed without the need for a laborious loop.
Worst case I then convert that integer back to a datetime in one go or just use seconds as my data type going forwards.
Thanks for any help.

IIUC you can construct a TimedeltaIndex and add this to your datetimes:
In [112]:
# sample data
df = pd.DataFrame({'date':pd.date_range(dt.datetime(2017,1,1), periods=10), 'hour':np.arange(10)})
df
Out[112]:
date hour
0 2017-01-01 0
1 2017-01-02 1
2 2017-01-03 2
3 2017-01-04 3
4 2017-01-05 4
5 2017-01-06 5
6 2017-01-07 6
7 2017-01-08 7
8 2017-01-09 8
9 2017-01-10 9
In [113]:
df['timestamp'] = df['date'] + pd.TimedeltaIndex(df['hour'], unit='h')
df
Out[113]:
date hour timestamp
0 2017-01-01 0 2017-01-01 00:00:00
1 2017-01-02 1 2017-01-02 01:00:00
2 2017-01-03 2 2017-01-03 02:00:00
3 2017-01-04 3 2017-01-04 03:00:00
4 2017-01-05 4 2017-01-05 04:00:00
5 2017-01-06 5 2017-01-06 05:00:00
6 2017-01-07 6 2017-01-07 06:00:00
7 2017-01-08 7 2017-01-08 07:00:00
8 2017-01-09 8 2017-01-09 08:00:00
9 2017-01-10 9 2017-01-10 09:00:00
So in your case I expect the following to work:
my_df['gen_timestamp'] = my_df['date'] + pd.TimedeltaIndex(my_df['hour'], unit='h')
this assumes that my_df['date'] is already Datetime if not convert first using my_df['date'] = pd.to_datetime(my_df['date'])

Related

Pandas get minimum total in moving window

I have something like the following dataframe:
df=pd.Series(index=pd.date_range(start='1/1/2017', end='1/10/2017', freq='D'),
data=[5,5,2,1,3,4,5,6,7,8])
df
Out[216]:
2017-01-01 5
2017-01-02 5
2017-01-03 2
2017-01-04 1
2017-01-05 3
2017-01-06 4
2017-01-07 5
2017-01-08 6
2017-01-09 7
2017-01-10 8
Freq: D, dtype: int64
I want to identify the start date of the 3 day period that has the minimum total value. So in this example, 2017-01-03 through 2017-01-05, has the minimum value with a sum of 6 across those 3 days.
Is there a way to do this without looping through each 3 day window?
The result would be:
2017-01-03 6
And if there were multiple windows that have the same minimum sum, the result could have a record for each.
IIUC rolling
df=pd.Series(index=pd.date_range(start='1/1/2017', end='1/10/2017', freq='D'),
data=[5,5,2,1,3,4,5,6,7,8])
df=df.to_frame()
df['New']=df.rolling(3).sum().shift(-2).values
df.loc[df.New==df.New.min(),:].drop(0,1)
Out[685]:
New
2017-01-03 6.0

Pandas fill forward and sum as you go

I have a sparse dataframe including dates of when inventory is bought or sold like the following:
Date Inventory
2017-01-01 10
2017-01-05 -5
2017-01-07 15
2017-01-09 -20
First step I would like to solve is to to add in the other dates. I know you can use resample but just highlighting this part in case it has an impact on the next more difficult part. As below:
Date Inventory
2017-01-01 10
2017-01-02 NaN
2017-01-03 NaN
2017-01-04 NaN
2017-01-05 -5
2017-01-06 NaN
2017-01-07 15
2017-01-08 NaN
2017-01-09 -20
The final step is to have it fill forward over the NaNs except that once it encounters a new value that get added to the current value of the row above, so that the final dataframe looks like the following:
Date Inventory
2017-01-01 10
2017-01-02 10
2017-01-03 10
2017-01-04 10
2017-01-05 5
2017-01-06 5
2017-01-07 20
2017-01-08 20
2017-01-09 0
2017-01-10 0
I am trying to get a pythonic approach to this and not a loop based approach as that will be very slow.
The example should also work for a table with multiple columns as such:
Date InventoryA InventoryB
2017-01-01 10 NaN
2017-01-02 NaN NaN
2017-01-03 NaN 5
2017-01-04 NaN 5
2017-01-05 -5 NaN
2017-01-06 NaN -10
2017-01-07 15 NaN
2017-01-08 NaN NaN
2017-01-09 -20 NaN
would become:
Date InventoryA InventoryB
2017-01-01 10 0
2017-01-02 10 0
2017-01-03 10 5
2017-01-04 10 10
2017-01-05 5 10
2017-01-06 5 0
2017-01-07 20 0
2017-01-08 20 0
2017-01-09 0 0
2017-01-10 0 0
hope that helps too. I think the current solution will have a problem with the nans as such.
thanks
You can just fill the missing values with 0 after resampling (no inventory change on that day), and then use cumsum
df.fillna(0).cumsum()
You're simply doing the two steps in the wrong order :)
df['Inventory'].cumsum().resample('D').pad()
Edit: you might need to set the Date as index first.
df = df.set_index('Date')
Part 1 : Assuming df is your
Date Inventory
2017-01-01 10
2017-01-05 -5
2017-01-07 15
2017-01-09 -20
Then
import pandas as pd
import datetime
df_new = pd.DataFrame([df.Date.min() + datetime.timedelta(days=day) for day in range((df.Date.max() - df.Date.min()).days+1)])
df_new = df_new.merge(df, left_on=0, right_on='Date',how="left").drop("Date",axis=1)
df_new.columns = df.columns
Gives you :
Date Inventory
0 2017-01-01 10.0
1 2017-01-02 NaN
2 2017-01-03 NaN
3 2017-01-04 NaN
4 2017-01-05 -5.0
5 2017-01-06 NaN
6 2017-01-07 15.0
7 2017-01-08 NaN
8 2017-01-09 -20.0
part 2
From fillna method descriptions:
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill:
propagate last valid observation forward to next valid backfill /
bfill: use NEXT valid observation to fill gap
df_new.Inventory = df_new.Inventory.fillna(method="ffill")
Gives you
Date Inventory
0 2017-01-01 10.0
1 2017-01-02 10.0
2 2017-01-03 10.0
3 2017-01-04 10.0
4 2017-01-05 -5.0
5 2017-01-06 -5.0
6 2017-01-07 15.0
7 2017-01-08 15.0
8 2017-01-09 -20.0
You should be able to generalise it for more than one column once you understood how it can be done with one.

Replace values in Pandas DataFrame with a DateTime Index based on a list of dates

I have a dataframe of values and a list of dates.
E.g.,
data = pd.DataFrame([1,3,5,7,2,3,9,1,3,8,4,5],index=pd.date_range(start='2017-01-01',periods=12),columns=['values'])
I want to replace the value of a date in the date list with a zero value. E.g.,
date_list = ['2017-01-04', '2017-01-07', '2017-01-10']
I have tried:
data[date_list] == 0
but this yields an error:
KeyError: "None of [['2017-01-04', '2017-01-07', '2017-01-10']] are in the [index]"
Does anyone have an idea of how to solve this? I have a very large dataframe and date list...
Another way,
In [11]: data[data.index.isin(date_list)] = 0
In [12]: data
Out[12]:
values
2017-01-01 1
2017-01-02 3
2017-01-03 5
2017-01-04 0
2017-01-05 2
2017-01-06 3
2017-01-07 0
2017-01-08 1
2017-01-09 3
2017-01-10 0
2017-01-11 4
2017-01-12 5
You need to convert that list to datetime and use the loc indexer:
data.loc[pd.to_datetime(date_list)] = 0
data
Out[19]:
values
2017-01-01 1
2017-01-02 3
2017-01-03 5
2017-01-04 0
2017-01-05 2
2017-01-06 3
2017-01-07 0
2017-01-08 1
2017-01-09 3
2017-01-10 0
2017-01-11 4
2017-01-12 5
This works because the DataFrame has only one column. This sets all the columns to zero. But as jezrael pointed out, if you only want to set the values column to zero, you need to specify that:
data.loc[pd.to_datetime(date_list), 'values'] = 0

Python: creating categorical variable in the existing column

I have a DataFrame in python that has a column holding difference of 2 dates. I would like to create a new/overwrite on the existing column that can convert numeric to categorical variable based on below rules:
difference 0 days Level 0
difference 2 days Level 1
difference 2-6 days Level 2
difference 6-15 days Level 3
difference 15-69 days Level 4
difference NAT Level 5
how this could be accomplished.
say the column name is 'difference'
you can define a method like
def get_difference_category(difference):
if difference < 0:
return 0
if difference <=2:
return 1
#.. and so on
df['difference'] = df['difference'].apply(lambda value: get_difference_category(value), axis=1)
reference links:
https://github.com/vi3k6i5/pandas_basics/blob/master/2_b_apply_a_function_row_wise.ipynb
https://github.com/vi3k6i5/pandas_basics/blob/master/2_c_apply_a_function_to_a_column.ipynb
You can use np.searchsorted to find where each time delta falls into an array of break points. I replace any NaT differences with Level 6
td = pd.to_timedelta(['0 days', '2 days', '6 days', '15 days', '69 days'])
difs = df.End.values - df.Start.values
vals = np.searchsorted(td.values, difs)
vals[pd.isnull(difs)] = 6
df = df.assign(
Level=np.core.defchararray.add(
'Level ', vals.astype(str)
)
)
df
Start End Level
0 2017-01-01 2017-01-11 Level 3
1 2017-01-02 2017-03-09 Level 4
2 2017-01-03 2017-03-16 Level 5
3 2017-01-04 2017-01-10 Level 2
4 2017-01-05 2017-01-05 Level 0
5 2017-01-06 2017-01-08 Level 1
6 2017-01-07 2017-01-26 Level 4
7 2017-01-08 2017-01-15 Level 3
8 2017-01-09 2017-02-16 Level 4
9 2017-01-10 2017-01-24 Level 3
Setup
import pandas as pd
from io import StringIO
txt = """ Start End
0 2017-01-01 2017-01-11
1 2017-01-02 2017-03-09
2 2017-01-03 2017-03-16
3 2017-01-04 2017-01-10
4 2017-01-05 2017-01-05
5 2017-01-06 2017-01-08
6 2017-01-07 2017-01-26
7 2017-01-08 2017-01-15
8 2017-01-09 2017-02-16
9 2017-01-10 2017-01-24"""
df = pd.read_csv(StringIO(txt), delim_whitespace=True).apply(pd.to_datetime)
Edit: To handle NaT, You can use pd.cut:
data['Severity'] = pd.cut((data['End'] - data['Start']).dt.days,[-pd.np.inf,-1,0,2,6,15,69],labels=['Level 5', 'Level 0','Level 1','Level 2','Level 3','Level 4']).fillna('Level 5')
Example:
df.head(10)
Start End
0 2017-01-01 2017-01-11
1 2017-01-02 2017-03-09
2 2017-01-03 2017-03-16
3 2017-01-04 2017-01-10
4 2017-01-05 2017-01-25
5 2017-01-06 2017-01-25
6 2017-01-07 2017-01-26
7 2017-01-08 2017-01-15
8 2017-01-09 2017-02-16
9 2017-01-10 2017-01-24
df['Severity'] = pd.cut((df['End'] - df['Start']).dt.days,[-np.inf,0,2,6,15,69,np.inf],labels=['Level 0','Level 1','Level 2','Level 3','Level 4','Level 5'])
Output:
End Start Severity
0 2017-01-11 2017-01-01 Level 3
1 2017-03-09 2017-01-02 Level 4
2 2017-03-16 2017-01-03 Level 5
3 2017-01-10 2017-01-04 Level 2
4 2017-01-25 2017-01-05 Level 4
5 2017-01-25 2017-01-06 Level 4
6 2017-01-26 2017-01-07 Level 4
7 2017-01-15 2017-01-08 Level 3
8 2017-02-16 2017-01-09 Level 4
9 2017-01-24 2017-01-10 Level 3
I added an bar plot to analyze the distribution. I also used a dataframe and a lambda function to get my day differences. Visuals help you understand the data. the histogram gives you insight into classification distributions and the pairplot shows you how the day interval is distributed.
column1=['2017-01-01','2017-01-01','2017-01-02','2017-01-03','2017-01-04','2017-01-05','2017-01-06','2017-01-07','2017-01-08','2017-01-09','2017-01-10']
column2=['2017-01-01','2017-01-11','2017-03-09','2017-03-16','2017-01-10','2017-01-25','2017-01-25','2017-01-26','2017-01-15','2017-02-16','2017-01-24' ]
index=range(0,len(column1))
data={'column1':column1,'column2':column2}
df=pd.DataFrame(data, columns=['column1','column2'],index=index)
print(df.head())
differences=df.apply(lambda x: datetime.strptime(x['column2'],'%Y-%m-%d')- datetime.strptime(x['column1'],'%Y-%m-%d'),axis=1)
differences=differences.dt.days.astype('int')
years_bins=[-1,0,2,6,15,69,np.inf]
output_labels=['level 0','level 1','level 2','level 3','level 4','level 5']
out=pd.cut(differences,bins=years_bins,labels=output_labels)
df['differences']=differences
df['classification']=out
print(df.head())
fig, ax = plt.subplots()
ax = out.value_counts(sort=False).plot.barh(rot=0, color="b", figsize=(6,4))
ax.set_yticklabels(labels)
plt.show()
plt.hist(df['classification'], bins=6)
plt.show()
sns.distplot(df['differences'])
plt.show()

calculate datetime differenence over multiple rows in dataframe

I have got a python related question about datetimes in a dataframe. I imported the following df via pd.read_csv()
datetime label d_time
0 2017-01-03 23:52:00
1 2017-01-03 23:53:00 A
2 2017-01-03 23:54:00 A
3 2017-01-03 23:55:00 A
4 2017-01-04 00:01:00
5 2017-01-04 00:02:00 B
6 2017-01-04 00:06:00 B
7 2017-01-04 00:09:00 B
8 2017-01-04 00:11:00 B
9 2017-01-04 00:12:00
10 2017-01-04 00:14:00
11 2017-01-04 00:16:00
12 2017-01-04 00:18:00 C
13 2017-01-04 00:20:00 C
14 2017-01-04 00:22:00
I would like to know the time difference over the rows that are labeled with A, B, C as in the following:
datetime label d_time
0 2017-01-03 23:52:00
1 2017-01-03 23:53:00 A 0:02
2 2017-01-03 23:54:00 A
3 2017-01-03 23:55:00 A
4 2017-01-04 00:01:00
5 2017-01-04 00:02:00 B 0:09
6 2017-01-04 00:06:00 B
7 2017-01-04 00:09:00 B
8 2017-01-04 00:11:00 B
9 2017-01-04 00:12:00
10 2017-01-04 00:14:00
11 2017-01-04 00:16:00
12 2017-01-04 00:18:00 C 0:02
13 2017-01-04 00:20:00 C
14 2017-01-04 00:22:00
So the d_time should be the total time difference over labeled rows. There are approx. 100 different labels, and they can vary from 1 to x in a row. This calculation has to be done for +1 million rows, so a loop will probably not work. Does anybody know how to do this? Thanks in advance.
Assuming the consecutive labels are all the same, and seperated by 1 nan
you can do something like this
idx = pd.Series(df[pd.isnull(df['label'])].index)
idx_begin = idx.iloc[:-1] + 1
idx_end = idx.iloc[1:] - 1
d_time = df.loc[idx_end, 'datetime'].reset_index(drop=True) - df.loc[idx_begin, 'datetime'].reset_index(drop=True)
d_time.index = idx_begin
df.loc[idx_begin, 'd_time'] = d_time
If your dataset looks different, you might look into different ways to get to idx_begin and idx_end, but this works for the dataset you posted
Multiple consecutive nans
If there are multiple consecutive nan-values, you can solve this by adding this to the end
df.loc[df[pd.isnull(df['label'])].index, 'd_time'] = None
Consecutive different labels
idx = df[(df['label'] != df['label'].shift(1)) & (pd.notnull(df['label']) | (pd.notnull(df['label'].shift(1))))].index
idx_begin = idx[:-1]
idx_end = idx[1:] -1
This marks different labels as different starts and beginnings. To make this work, you will need the df.loc[df[pd.isnull(df['label'])].index, 'd_time'] = None added to the end
The & (pd.notnull(df['label']) | (pd.notnull(df['label'].shift(1))) part is because None != None
Result
datetime label d_time
0 2017-01-03 23:52:00 NaN NaN
1 2017-01-03 23:53:00 A NaN
2 2017-01-03 23:54:00 A NaN
3 2017-01-03 23:52:00 NaN NaN
4 2017-01-03 23:53:00 B NaN
5 2017-01-03 23:54:00 B NaN
6 2017-01-03 23:55:00 NaN NaN
7 2017-01-03 23:56:00 NaN NaN
8 2017-01-03 23:57:00 NaN NaN
9 2017-01-04 00:02:00 A NaN
10 2017-01-04 00:06:00 A NaN
11 2017-01-04 00:09:00 A NaN
12 2017-01-04 00:02:00 B NaN
13 2017-01-04 00:06:00 B NaN
14 2017-01-04 00:09:00 B NaN
15 2017-01-04 00:11:00 NaN NaN
yields
datetime label d_time
0 2017-01-03 23:52:00 NaN NaT
1 2017-01-03 23:53:00 A 00:01:00
2 2017-01-03 23:54:00 A NaT
3 2017-01-03 23:52:00 NaN NaT
4 2017-01-03 23:53:00 B 00:01:00
5 2017-01-03 23:54:00 B NaT
6 2017-01-03 23:55:00 NaN NaT
7 2017-01-03 23:56:00 NaN NaT
8 2017-01-03 23:57:00 NaN NaT
9 2017-01-04 00:02:00 A 00:07:00
10 2017-01-04 00:06:00 A NaT
11 2017-01-04 00:09:00 A NaT
12 2017-01-04 00:02:00 B 00:07:00
13 2017-01-04 00:06:00 B NaT
14 2017-01-04 00:09:00 B NaT
15 2017-01-04 00:11:00 NaN NaT
Last Series
If the last row doesn't have a changed label compared to the one before it, the last series will not register.
You can prevent this by including this after the first line
if idx[-1] != df.index[-1]:
idx = idx.append(df.index[[-1]]+1)
If the datetimes are datetime objects (or pandas.TimeStamp) you can use this for-loop
a_rows = []
for row in df.itertuples():
if row.label == 'A':
a_rows.append(row)
elif a_rows:
d_time = a_rows[-1].datetime - a_rows[0].datetime
df.loc[a_rows[0].Index, 'd_time'] = d_time
a_rows = []
with this result
datetime label d_time
0 2017-01-03 23:52:00
1 2017-01-03 23:53:00 A 0 days 00:02:00
2 2017-01-03 23:54:00 A
3 2017-01-03 23:55:00 A
4 2017-01-04 00:01:00
5 2017-01-04 00:02:00 A 0 days 00:07:00
6 2017-01-04 00:06:00 A
7 2017-01-04 00:09:00 A
8 2017-01-04 00:11:00
You can later format the timedelta object if you want.
If the datetime column are strings you can easily convert em with df['datetime'] = pd.to_datetime(df['datetime'])

Categories