I have a pandas dataframe with over 1000 timestamps (below) that I would like to loop through:
2016-02-22 14:59:44.561776
I'm having a hard time splitting this time stamp into 2 columns- 'date' and 'time'. The date format can stay the same, but the time needs to be converted to CST (including milliseconds).
Thanks for the help
Had same problem and this worked for me.
Suppose the date column in your dataset is called "date"
import pandas as pd
df = pd.read_csv(file_path)
df['Dates'] = pd.to_datetime(df['date']).dt.date
df['Time'] = pd.to_datetime(df['date']).dt.time
This will give you two columns "Dates" and "Time" with splited dates.
I'm not sure why you would want to do this in the first place, but if you really must...
df = pd.DataFrame({'my_timestamp': pd.date_range('2016-1-1 15:00', periods=5)})
>>> df
my_timestamp
0 2016-01-01 15:00:00
1 2016-01-02 15:00:00
2 2016-01-03 15:00:00
3 2016-01-04 15:00:00
4 2016-01-05 15:00:00
df['new_date'] = [d.date() for d in df['my_timestamp']]
df['new_time'] = [d.time() for d in df['my_timestamp']]
>>> df
my_timestamp new_date new_time
0 2016-01-01 15:00:00 2016-01-01 15:00:00
1 2016-01-02 15:00:00 2016-01-02 15:00:00
2 2016-01-03 15:00:00 2016-01-03 15:00:00
3 2016-01-04 15:00:00 2016-01-04 15:00:00
4 2016-01-05 15:00:00 2016-01-05 15:00:00
The conversion to CST is more tricky. I assume that the current timestamps are 'unaware', i.e. they do not have a timezone attached? If not, how would you expect to convert them?
For more details:
https://docs.python.org/2/library/datetime.html
How to make an unaware datetime timezone aware in python
EDIT
An alternative method that only loops once across the timestamps instead of twice:
new_dates, new_times = zip(*[(d.date(), d.time()) for d in df['my_timestamp']])
df = df.assign(new_date=new_dates, new_time=new_times)
The easiest way is to use the pandas.Series dt accessor, which works on columns with a datetime dtype (see pd.to_datetime). For this case, pd.date_range creates an example column with a datetime dtype, therefore use .dt.date and .dt.time:
df = pd.DataFrame({'full_date': pd.date_range('2016-1-1 10:00:00.123', periods=10, freq='5H')})
df['date'] = df['full_date'].dt.date
df['time'] = df['full_date'].dt.time
In [166]: df
Out[166]:
full_date date time
0 2016-01-01 10:00:00.123 2016-01-01 10:00:00.123000
1 2016-01-01 15:00:00.123 2016-01-01 15:00:00.123000
2 2016-01-01 20:00:00.123 2016-01-01 20:00:00.123000
3 2016-01-02 01:00:00.123 2016-01-02 01:00:00.123000
4 2016-01-02 06:00:00.123 2016-01-02 06:00:00.123000
5 2016-01-02 11:00:00.123 2016-01-02 11:00:00.123000
6 2016-01-02 16:00:00.123 2016-01-02 16:00:00.123000
7 2016-01-02 21:00:00.123 2016-01-02 21:00:00.123000
8 2016-01-03 02:00:00.123 2016-01-03 02:00:00.123000
9 2016-01-03 07:00:00.123 2016-01-03 07:00:00.123000
If your timestamps are already in pandas format (not string), then:
df["date"] = df["timestamp"].date
dt["time"] = dt["timestamp"].time
If your timestamp is a string, you can parse it using the datetime module:
from datetime import datetime
data1["timestamp"] = df["timestamp"].apply(lambda x: \
datetime.strptime(x,"%Y-%m-%d %H:%M:%S.%f"))
Source:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html
If your timestamp is a string, you can convert it to a datetime object:
from datetime import datetime
timestamp = '2016-02-22 14:59:44.561776'
dt = datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S.%f')
From then on you can bring it to whatever format you like.
Try
s = '2016-02-22 14:59:44.561776'
date,time = s.split()
then convert time as needed.
If you want to further split the time,
hour, minute, second = time.split(':')
try this:
def time_date(datetime_obj):
date_time = datetime_obj.split(' ')
time = date_time[1].split('.')
return date_time[0], time[0]
In addition to #Alexander if you want a single liner
df['new_date'],df['new_time'] = zip(*[(d.date(), d.time()) for d in df['my_timestamp']])
If your timestamp is a string, you can convert it to pandas timestamp before splitting it.
#convert to pandas timestamp
data["old_date"] = pd.to_datetime(data.old_date)
#split columns
data["new_date"] = data["old_date"].dt.date
data["new_time"] = data["old_date"].dt.time
Related
I have a dataframe:
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
I would like to convert the time based on conditions: if the hour is less than 9, I want to set it to 9 and if the hour is more than 17, I need to set it to 17.
I tried this approach:
df['time'] = np.where(((df['time'].dt.hour < 9) & (df['time'].dt.hour != 0)), dt.time(9, 00))
I am getting an error: Can only use .dt. accesor with datetimelike values.
Can anyone please help me with this? Thanks.
Here's a way to do what your question asks:
df.time = pd.to_datetime(df.time)
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
Input:
time
0 2022-06-06 08:45:00
1 2022-06-06 09:30:00
2 2022-06-06 18:00:00
3 2022-06-06 15:00:00
Output:
time
0 2022-06-06 09:45:00
1 2022-06-06 09:30:00
2 2022-06-06 17:00:00
3 2022-06-06 15:00:00
UPDATE:
Here's alternative code to try to address OP's error as described in the comments:
import pandas as pd
import datetime
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print('', 'df loaded as strings:', df, sep='\n')
df.time = pd.to_datetime(df.time, format='%H:%M:%S')
print('', 'df converted to datetime by pd.to_datetime():', df, sep='\n')
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.time = [time.time() for time in pd.to_datetime(df.time)]
print('', 'df with time column adjusted to have hour between 9 and 17, converted to type "time":', df, sep='\n')
Output:
df loaded as strings:
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
df converted to datetime by pd.to_datetime():
time
0 1900-01-01 08:45:00
1 1900-01-01 09:30:00
2 1900-01-01 18:00:00
3 1900-01-01 15:00:00
df with time column adjusted to have hour between 9 and 17, converted to type "time":
time
0 09:45:00
1 09:30:00
2 17:00:00
3 15:00:00
UPDATE #2:
To not just change the hour for out-of-window times, but to simply apply 9:00 and 17:00 as min and max times, respectively (see OP's comment on this), you can do this:
df.loc[df['time'].dt.hour < 9, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[9]*len(df.index)}))
df.loc[df['time'].dt.hour > 17, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[17]*len(df.index)}))
df['time'] = [time.time() for time in pd.to_datetime(df['time'])]
Since your 'time' column contains strings they can kept as strings and assign new string values where appropriate. To filter for your criteria it is convenient to: create datetime Series from the 'time' column, create boolean Series by comparing the datetime Series with your criteria, use the boolean Series to filter the rows which need to be changed.
Your data:
import numpy as np
import pandas as pd
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print(df.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
Convert to datetime, make boolean Series with your criteria
dts = pd.to_datetime(df['time'])
lt_nine = dts.dt.hour < 9
gt_seventeen = (dts.dt.hour >= 17)
print(lt_nine)
print(gt_seventeen)
>>>
0 True
1 False
2 False
3 False
Name: time, dtype: bool
0 False
1 False
2 True
3 False
Name: time, dtype: bool
Use the boolean series to assign a new value:
df.loc[lt_nine,'time'] = '09:00:00'
df.loc[gt_seventeen,'time'] = '17:00:00'
print(df.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
Or just stick with strings altogether and create the boolean Series using regex patterns and .str.match.
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00','07:22:00','22:02:06']}
dg = pd.DataFrame(data)
print(dg.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
4 07:22:00
5 22:02:06
# regex patterns
pattern_lt_nine = '^00|01|02|03|04|05|06|07|08'
pattern_gt_seventeen = '^17|18|19|20|21|22|23'
Make boolean Series and assign new values
gt_seventeen = dg['time'].str.match(pattern_gt_seventeen)
lt_nine = dg['time'].str.match(pattern_lt_nine)
dg.loc[lt_nine,'time'] = '09:00:00'
dg.loc[gt_seventeen,'time'] = '17:00:00'
print(dg.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
4 09:00:00
5 17:00:00
Time series / date functionality
Working with text data
What methods are available to merge columns which have timestamps that do not exactly match?
DF1:
date start_time employee_id session_id
01/01/2016 01/01/2016 06:03:13 7261824 871631182
DF2:
date start_time employee_id session_id
01/01/2016 01/01/2016 06:03:37 7261824 871631182
I could join on the ['date', 'employee_id', 'session_id'], but sometimes the same employee will have multiple identical sessions on the same date which causes duplicates. I could drop the rows where this takes place, but I would lose valid sessions if I did.
Is there an efficient way to join if the timestamp of DF1 is <5 minutes from the timestamp of DF2, and the session_id and employee_id also match? If there is a matching record, then the timestamp will always be slightly later than DF1 because an event is triggered at some future point.
['employee_id', 'session_id', 'timestamp<5minutes']
Edit - I assumed someone would have run into this issue before.
I was thinking of doing this:
Take my timestamp on each dataframe
Create a column which is the timestamp + 5 minutes (rounded)
Create a column which is the timestamp - 5 minutes (rounded)
Create a 10 minute interval string to join the files on
df1['low_time'] = df1['start_time'] - timedelta(minutes=5)
df1['high_time'] = df1['start_time'] + timedelta(minutes=5)
df1['interval_string'] = df1['low_time'].astype(str) + df1['high_time'].astype(str)
Does someone know how to round those 5 minute intervals to the nearest 5 minute mark?
02:59:37 - 5 min = 02:55:00
02:59:37 + 5 min = 03:05:00
interval_string = '02:55:00-03:05:00'
pd.merge(df1, df2, how = 'left', on = ['employee_id', 'session_id', 'date', 'interval_string']
Does anyone know how to round the time like that? This seems like it could work. You still match based on the date, employee, and session, and then you look for times which are basically within the same 10 minute interval or range
I would try using this method in pandas:
pandas.merge_asof()
The parameters of interest for you would be direction,tolerance,left_on, and right_on
Building off #Igor answer:
import pandas as pd
from pandas import read_csv
from io import StringIO
# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]
# index column (above combination)
ixc = 'date_start_time'
df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)
df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)
df1['date_start_time'] = pd.to_datetime(df1['date_start_time'])
df2['date_start_time'] = pd.to_datetime(df2['date_start_time'])
# converting this to the index so we can preserve the date_start_time columns so you can validate the merging logic
df1.index = df1['date_start_time']
df2.index = df2['date_start_time']
# the magic happens below, check the direction and tolerance arguments
tol = pd.Timedelta('5 minute')
pd.merge_asof(left=df1,right=df2,right_index=True,left_index=True,direction='nearest',tolerance=tol)
output
date_start_time date_start_time_x employee_id_x session_id_x date_start_time_y employee_id_y session_id_y
2016-01-01 02:03:00 2016-01-01 02:03:00 7261824 871631182 2016-01-01 02:03:00 7261824.0 871631182.0
2016-01-01 06:03:00 2016-01-01 06:03:00 7261824 871631183 2016-01-01 06:05:00 7261824.0 871631183.0
2016-01-01 11:01:00 2016-01-01 11:01:00 7261824 871631184 2016-01-01 11:04:00 7261824.0 871631184.0
2016-01-01 14:01:00 2016-01-01 14:01:00 7261824 871631185 NaT NaN NaN
Consider the following mini-version of your problem:
from io import StringIO
from pandas import read_csv, to_datetime
# how close do sessions have to be to be considered equal? (in minutes)
threshold = 5
# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]
# index column (above combination)
ixc = 'date_start_time'
df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)
df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)
which gives
>>> df1
date_start_time employee_id session_id
0 2016-01-01 02:03:00 7261824 871631182
1 2016-01-01 06:03:00 7261824 871631183
2 2016-01-01 11:01:00 7261824 871631184
3 2016-01-01 14:01:00 7261824 871631185
>>> df2
date_start_time employee_id session_id
0 2016-01-01 02:03:00 7261824 871631182
1 2016-01-01 06:05:00 7261824 871631183
2 2016-01-01 11:04:00 7261824 871631184
3 2016-01-01 14:10:00 7261824 871631185
You would like to treat df2[0:3] as duplicates of df1[0:3] when merging (since they are respectively less than 5 minutes apart), but treat df1[3] and df2[3] as separate sessions.
Solution 1: Interval Matching
This is essentially what you are suggesting in your edit. You want to map timestamps in both tables to a 10-minute interval centered on the timestamp rounded to the nearest 5 minutes.
Each interval can be represented uniquely by its midpoint, so you can merge the data frames on the timestamp rounded to the nearest 5 minutes. For example:
import numpy as np
# half-threshold in nanoseconds
threshold_ns = threshold * 60 * 1e9
# compute "interval" to which each session belongs
df1['interval'] = to_datetime(np.round(df1.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)
df2['interval'] = to_datetime(np.round(df2.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)
# join
cols = ['interval', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]
which prints
interval employee_id session_id
0 2016-01-01 02:05:00 7261824 871631182
1 2016-01-01 06:05:00 7261824 871631183
2 2016-01-01 11:00:00 7261824 871631184
3 2016-01-01 14:00:00 7261824 871631185
4 2016-01-01 11:05:00 7261824 871631184
5 2016-01-01 14:10:00 7261824 871631185
Note that this is not totally correct. The sessions df1[2] and df2[2] are not treated as duplicates although they are only 3 minutes apart. This is because they were on different sides of the interval boundary.
Solution 2: One-to-one matching
Here is another approach which depends on the condition that sessions in df1 have either zero or one duplicates in df2.
We replace timestamps in df1 with the closest timestamp in df2 which matches on employee_id and session_id and is less than 5 minutes away.
from datetime import timedelta
# get closest match from "df2" to row from "df1" (as long as it's below the threshold)
def closest(row):
matches = df2.loc[(df2.employee_id == row.employee_id) &
(df2.session_id == row.session_id)]
deltas = matches.date_start_time - row.date_start_time
deltas = deltas.loc[deltas <= timedelta(minutes=threshold)]
try:
return matches.loc[deltas.idxmin()]
except ValueError: # no items
return row
# replace timestamps in "df1" with closest timestamps in "df2"
df1 = df1.apply(closest, axis=1)
# join
cols = ['date_start_time', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]
which prints
date_start_time employee_id session_id
0 2016-01-01 02:03:00 7261824 871631182
1 2016-01-01 06:05:00 7261824 871631183
2 2016-01-01 11:04:00 7261824 871631184
3 2016-01-01 14:01:00 7261824 871631185
4 2016-01-01 14:10:00 7261824 871631185
This approach is significantly slower, since you have to search through the entirety of df2 for each row in df1. What I have written can probably be optimized further, but this will still take a long time on large datasets.
I would suggest to use the built-in pandas Series dt round function, to round both dataframe to a common time, for example round up to every 5min. So the time will always be in format: 01:00:00 and then 01:05:00 for example. In that way, both dataframe will have similar time index to perform the merge.
Please see documentation and examples here pandas.Series.dt.round
What methods are available to merge columns which have timestamps that do not exactly match?
DF1:
date start_time employee_id session_id
01/01/2016 01/01/2016 06:03:13 7261824 871631182
DF2:
date start_time employee_id session_id
01/01/2016 01/01/2016 06:03:37 7261824 871631182
I could join on the ['date', 'employee_id', 'session_id'], but sometimes the same employee will have multiple identical sessions on the same date which causes duplicates. I could drop the rows where this takes place, but I would lose valid sessions if I did.
Is there an efficient way to join if the timestamp of DF1 is <5 minutes from the timestamp of DF2, and the session_id and employee_id also match? If there is a matching record, then the timestamp will always be slightly later than DF1 because an event is triggered at some future point.
['employee_id', 'session_id', 'timestamp<5minutes']
Edit - I assumed someone would have run into this issue before.
I was thinking of doing this:
Take my timestamp on each dataframe
Create a column which is the timestamp + 5 minutes (rounded)
Create a column which is the timestamp - 5 minutes (rounded)
Create a 10 minute interval string to join the files on
df1['low_time'] = df1['start_time'] - timedelta(minutes=5)
df1['high_time'] = df1['start_time'] + timedelta(minutes=5)
df1['interval_string'] = df1['low_time'].astype(str) + df1['high_time'].astype(str)
Does someone know how to round those 5 minute intervals to the nearest 5 minute mark?
02:59:37 - 5 min = 02:55:00
02:59:37 + 5 min = 03:05:00
interval_string = '02:55:00-03:05:00'
pd.merge(df1, df2, how = 'left', on = ['employee_id', 'session_id', 'date', 'interval_string']
Does anyone know how to round the time like that? This seems like it could work. You still match based on the date, employee, and session, and then you look for times which are basically within the same 10 minute interval or range
I would try using this method in pandas:
pandas.merge_asof()
The parameters of interest for you would be direction,tolerance,left_on, and right_on
Building off #Igor answer:
import pandas as pd
from pandas import read_csv
from io import StringIO
# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]
# index column (above combination)
ixc = 'date_start_time'
df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)
df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)
df1['date_start_time'] = pd.to_datetime(df1['date_start_time'])
df2['date_start_time'] = pd.to_datetime(df2['date_start_time'])
# converting this to the index so we can preserve the date_start_time columns so you can validate the merging logic
df1.index = df1['date_start_time']
df2.index = df2['date_start_time']
# the magic happens below, check the direction and tolerance arguments
tol = pd.Timedelta('5 minute')
pd.merge_asof(left=df1,right=df2,right_index=True,left_index=True,direction='nearest',tolerance=tol)
output
date_start_time date_start_time_x employee_id_x session_id_x date_start_time_y employee_id_y session_id_y
2016-01-01 02:03:00 2016-01-01 02:03:00 7261824 871631182 2016-01-01 02:03:00 7261824.0 871631182.0
2016-01-01 06:03:00 2016-01-01 06:03:00 7261824 871631183 2016-01-01 06:05:00 7261824.0 871631183.0
2016-01-01 11:01:00 2016-01-01 11:01:00 7261824 871631184 2016-01-01 11:04:00 7261824.0 871631184.0
2016-01-01 14:01:00 2016-01-01 14:01:00 7261824 871631185 NaT NaN NaN
Consider the following mini-version of your problem:
from io import StringIO
from pandas import read_csv, to_datetime
# how close do sessions have to be to be considered equal? (in minutes)
threshold = 5
# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]
# index column (above combination)
ixc = 'date_start_time'
df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)
df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)
which gives
>>> df1
date_start_time employee_id session_id
0 2016-01-01 02:03:00 7261824 871631182
1 2016-01-01 06:03:00 7261824 871631183
2 2016-01-01 11:01:00 7261824 871631184
3 2016-01-01 14:01:00 7261824 871631185
>>> df2
date_start_time employee_id session_id
0 2016-01-01 02:03:00 7261824 871631182
1 2016-01-01 06:05:00 7261824 871631183
2 2016-01-01 11:04:00 7261824 871631184
3 2016-01-01 14:10:00 7261824 871631185
You would like to treat df2[0:3] as duplicates of df1[0:3] when merging (since they are respectively less than 5 minutes apart), but treat df1[3] and df2[3] as separate sessions.
Solution 1: Interval Matching
This is essentially what you are suggesting in your edit. You want to map timestamps in both tables to a 10-minute interval centered on the timestamp rounded to the nearest 5 minutes.
Each interval can be represented uniquely by its midpoint, so you can merge the data frames on the timestamp rounded to the nearest 5 minutes. For example:
import numpy as np
# half-threshold in nanoseconds
threshold_ns = threshold * 60 * 1e9
# compute "interval" to which each session belongs
df1['interval'] = to_datetime(np.round(df1.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)
df2['interval'] = to_datetime(np.round(df2.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)
# join
cols = ['interval', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]
which prints
interval employee_id session_id
0 2016-01-01 02:05:00 7261824 871631182
1 2016-01-01 06:05:00 7261824 871631183
2 2016-01-01 11:00:00 7261824 871631184
3 2016-01-01 14:00:00 7261824 871631185
4 2016-01-01 11:05:00 7261824 871631184
5 2016-01-01 14:10:00 7261824 871631185
Note that this is not totally correct. The sessions df1[2] and df2[2] are not treated as duplicates although they are only 3 minutes apart. This is because they were on different sides of the interval boundary.
Solution 2: One-to-one matching
Here is another approach which depends on the condition that sessions in df1 have either zero or one duplicates in df2.
We replace timestamps in df1 with the closest timestamp in df2 which matches on employee_id and session_id and is less than 5 minutes away.
from datetime import timedelta
# get closest match from "df2" to row from "df1" (as long as it's below the threshold)
def closest(row):
matches = df2.loc[(df2.employee_id == row.employee_id) &
(df2.session_id == row.session_id)]
deltas = matches.date_start_time - row.date_start_time
deltas = deltas.loc[deltas <= timedelta(minutes=threshold)]
try:
return matches.loc[deltas.idxmin()]
except ValueError: # no items
return row
# replace timestamps in "df1" with closest timestamps in "df2"
df1 = df1.apply(closest, axis=1)
# join
cols = ['date_start_time', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]
which prints
date_start_time employee_id session_id
0 2016-01-01 02:03:00 7261824 871631182
1 2016-01-01 06:05:00 7261824 871631183
2 2016-01-01 11:04:00 7261824 871631184
3 2016-01-01 14:01:00 7261824 871631185
4 2016-01-01 14:10:00 7261824 871631185
This approach is significantly slower, since you have to search through the entirety of df2 for each row in df1. What I have written can probably be optimized further, but this will still take a long time on large datasets.
I would suggest to use the built-in pandas Series dt round function, to round both dataframe to a common time, for example round up to every 5min. So the time will always be in format: 01:00:00 and then 01:05:00 for example. In that way, both dataframe will have similar time index to perform the merge.
Please see documentation and examples here pandas.Series.dt.round
I have a large data set like this
user category
time
2014-01-01 00:00:00 21155349 2
2014-01-01 00:00:00 56347479 6
2014-01-01 00:00:00 68429517 13
2014-01-01 00:00:00 39055685 4
2014-01-01 00:00:00 521325 13
I want to make it as
user category
time
00:00:00 21155349 2
00:00:00 56347479 6
00:00:00 68429517 13
00:00:00 39055685 4
00:00:00 521325 13
How you do this using pandas
If you want to mutate a series (column) in pandas, the pattern is to apply a function to it (that updates on element in the series at a time), and to then assign that series back into into the dataframe
import pandas
import StringIO
# load data
data = '''date,user,category
2014-01-01 00:00:00, 21155349, 2
2014-01-01 00:00:00, 56347479, 6
2014-01-01 00:00:00, 68429517, 13
2014-01-01 00:00:00, 39055685, 4
2014-01-01 00:00:00, 521325, 13'''
df = pandas.read_csv(StringIO.StringIO(data))
df['date'] = pandas.to_datetime(df['date'])
# make the required change
without_date = df['date'].apply( lambda d : d.time() )
df['date'] = without_date
# display results
print df
If the problem is because the date is the index, you've got a few more hoops to jump through:
df = pandas.read_csv(StringIO.StringIO(data), index_col='date')
ser = pandas.to_datetime(df.index).to_series()
df.set_index(ser.apply(lambda d : d.time() ))
As suggested by #DSM, If you have pandas later than 0.15.2, you can use use the .dt accessor on the series to do fast updates.
df = pandas.read_csv(StringIO.StringIO(data), index_col='date')
ser = pandas.to_datetime(df.index).to_series()
df.set_index(ser.dt.time)
I have a csv file that I am trying to import into pandas.
There are two columns of intrest. date and hour and are the first two cols.
E.g.
date,hour,...
10-1-2013,0,
10-1-2013,0,
10-1-2013,0,
10-1-2013,1,
10-1-2013,1,
How do I import using pandas so that that hour and date is combined or is that best done after the initial import?
df = DataFrame.from_csv('bingads.csv', sep=',')
If I do the initial import how do I combine the two as a date and then delete the hour?
Thanks
Define your own date_parser:
In [291]: from dateutil.parser import parse
In [292]: import datetime as dt
In [293]: def date_parser(x):
.....: date, hour = x.split(' ')
.....: return parse(date) + dt.timedelta(0, 3600*int(hour))
In [298]: pd.read_csv('test.csv', parse_dates=[[0,1]], date_parser=date_parser)
Out[298]:
date_hour a b c
0 2013-10-01 00:00:00 1 1 1
1 2013-10-01 00:00:00 2 2 2
2 2013-10-01 00:00:00 3 3 3
3 2013-10-01 01:00:00 4 4 4
4 2013-10-01 01:00:00 5 5 5
Apply read_csv instead of read_clipboard to handle your actual data:
>>> df = pd.read_clipboard(sep=',')
>>> df['date'] = pd.to_datetime(df.date) + pd.to_timedelta(df.hour, unit='D')/24
>>> del df['hour']
>>> df
date ...
0 2013-10-01 00:00:00 NaN
1 2013-10-01 00:00:00 NaN
2 2013-10-01 00:00:00 NaN
3 2013-10-01 01:00:00 NaN
4 2013-10-01 01:00:00 NaN
[5 rows x 2 columns]
Take a look at the parse_dates argument which pandas.read_csv accepts.
You can do something like:
df = pandas.read_csv('some.csv', parse_dates=True)
# in which case pandas will parse all columns where it finds dates
df = pandas.read_csv('some.csv', parse_dates=[i,j,k])
# in which case pandas will parse the i, j and kth columns for dates
Since you are only using the two columns from the cdv file and combining those into one, I would squeeze into a series of datetime objects like so:
import pandas as pd
from StringIO import StringIO
import datetime as dt
txt='''\
date,hour,A,B
10-1-2013,0,1,6
10-1-2013,0,2,7
10-1-2013,0,3,8
10-1-2013,1,4,9
10-1-2013,1,5,10'''
def date_parser(date, hour):
dates=[]
for ed, eh in zip(date, hour):
month, day, year=list(map(int, ed.split('-')))
hour=int(eh)
dates.append(dt.datetime(year, month, day, hour))
return dates
p=pd.read_csv(StringIO(txt), usecols=[0,1],
parse_dates=[[0,1]], date_parser=date_parser, squeeze=True)
print p
Prints:
0 2013-10-01 00:00:00
1 2013-10-01 00:00:00
2 2013-10-01 00:00:00
3 2013-10-01 01:00:00
4 2013-10-01 01:00:00
Name: date_hour, dtype: datetime64[ns]