Pandas.SHIFT in Multi index frame for temporal dependency - python
This is my first post so please be gentle. I have searched across the world wide web looking for a solution but I am yet to find one. The problem i'm trying to solve is as follows:
I have a dataset, comprised of 500.000+ samples, with 6 features per sample.
I have put this dataset in a multiindexed Pandas DataFrame
The first level of my dataFrame is the timeseries index, the second level is the ID. It looks as follows
Time id
2017-03-07 10:06:49.963241984 122.0 -7.024347
136.0 -11.664985
243.0 1.716150
2017-03-07 10:06:50.003462400 122.0 -7.025922
136.0 -11.671526
Every timestamp, a number of objects can be seen and are marked by label 'id'. For my application, i want to add a temporal dependency by including information
that happened 5 seconds ago, i.e. in this example on timestamp 10:06:45.
But, importantly, i only want to add this information if at that timestamp the object already existed (so if the id is equal).
I wanted to use the function dataframe.shift, as mentioned here and, i want to do it per level, so as indicated by user Unutbu in How do you shift Pandas DataFrame with a multiindex?
My question is as follows:
How do I append extra columns to the original dataframe X with information on what those objects were 5s ago. I would expect something like the following
X['x_location_shifted'] = X.groupby(level=1)['x_location'].shift(5*rate)
with the rate being 25Hz, ie. we shift 125 "DateTimeIndices", but, only if an object with id='...' exists at that timestamp.
EDIT:
The timestamps are not synchronized 100%, so the timegap is not always exactly equal to 0.04. Previously, i used np.argmin(np.abs(time-index)) to find the closest index to the stamp.
For example, in my set, at timestamp 2017-03-07 10:36:03.605008640 there is an object with id == 175 and location_x = 54.323.
id = 175
X.ix['2017-03-07 10:36:03.605008640', id] = 54.323
At timestamp 2017-03-07 10:36:08.604962560 ..... this object with id=175 has a location_x = 67.165955
id = 175
old_time = pd.to_datetime('2017-03-07 10:36:03.605008640')
new_time = old_time + pd.Timedelta('5 seconds')
# Finding the new value of location
X.ix[np.argmin(np.abs(new_time - X.index.get_level_values(0))), id]
So, finally, at timestep 10:36:08 i want to add the information of timestamp 10:36:03 IF the object already existed at that timestamp.
EDIT2:
After trying Maarten Fabré's solution, I came up with my own implementation, which you can find below. If anyone can show me a more pythonic way to do this, please let me know.
for current_time in X.index.get_level_values(0)[125:]:
#only do if there are objects at current time
if len(X.ix[current_time].index):
# Calculate past time
past_time = current_time - pd.Timedelta('5 seconds')
# Find index in X.index that is closest to this past time
past_time_index = np.argmin(np.abs(past_time-X.index.get_level_values(0)))
# translate the index back to a label
past_time = X.index[past_time_index][0]
# in that timestep, cycle the objects
for obj_id in X.ix[current_time].index:
# Try looking for the value box_center.x of obj obj_id 5s ago
try:
X.ix[(current_time, obj_id), 'box_center.x.shifted'] = X.ix[(past_time, obj_id), 'box_center.x']
X.ix[(current_time, obj_id), 'box_center.y.shifted'] = X.ix[(past_time, obj_id), 'box_center.y']
X.ix[(current_time, obj_id), 'relative_velocity.x.shifted'] = X.ix[(past_time, obj_id), 'relative_velocity.x']
X.ix[(current_time, obj_id), 'relative_velocity.y.shifted'] = X.ix[(past_time, obj_id), 'relative_velocity.y']
# If the key doesnt exist, the object doesn't exist, ergo the field should be np.nan
except KeyError:
X.ix[(current_time, obj_id), 'box_center.x.shift'] = np.nan
print('Timestep {}'.format(current_time))
If this is not enough information, please say so and I can add it :)
Cheers and thanks!
Assuming that you have no gaps in the timestamps, one possible solution might be the following, which creates a new index with shifted timestamps and uses that to get the 5 seconds-ago values for each ID.
offset = 5 * rate
# Create a shallow copy of the multiindex levels for modification
modified_levels = list(X.index.levels)
# Shift them
modified_times = pd.Series(modified_levels[0]).shift(offset)
# Fill NaNs with dummy values to avoid duplicates in the new index
modified_times[modified_times.isnull()] = range(sum(modified_times.isnull()))
modified_levels[0] = modified_times
new_index = X.index.set_levels(modified_levels, inplace=False)
X['x_location_shifted'] = X.loc[new_index, 'x_location'].values
If the timestamps are not 100% regular, then you'll either have to round the to the nearest 1/x second, or use a loop
you could use this as a loop
Data definition
import pandas as pd
import numpy as np
from io import StringIO
df_str = """
timestamp id location
10:00:00.005 1 a
10:00:00.005 2 b
10:00:00.005 3 c
10:00:05.006 2 a
10:00:05.006 3 b
10:00:05.006 4 c"""
df = pd.DataFrame.from_csv(StringIO(df_str), sep='\t').reset_index()
delta = pd.to_timedelta(5, unit='s')
margin = pd.to_timedelta(1/50, unit='s')
df['location_shifted'] = np.nan
Loop over the different id's
for label_id in set(df['id']):
df_id = df[df['id'] == label_id].copy() # copy to make sure we don't overwrite the original data. Might not be necessary
df_id['time_shift'] = df['timestamp'] + delta
for row in df_id.itertuples():
idx = row.Index
time_dif = abs(df['timestamp'] - row.time_shift)
shifted_locs = df_id[time_dif < margin ]
l = len(shifted_locs)
if l:
print(shifted_locs)
if l == 1:
idx_shift = shifted_locs.index[0]
else:
idx_shift = shifted_locs['time_shift'].idxmin()
df.loc[idx_shift, 'location_shifted'] = df_id.loc[idx, 'location']
Results
timestamp id location location_shifted
0 2017-05-09 10:00:00.005 1 a
1 2017-05-09 10:00:00.005 2 b
2 2017-05-09 10:00:00.005 3 c
3 2017-05-09 10:00:05.006 2 a b
4 2017-05-09 10:00:05.006 3 b c
5 2017-05-09 10:00:05.006 4 c
Any of you arriving here with the same question; i managed to solve it in a (minimal) vectorized way, but, it required me to return to a 3d panel.
3 Steps:
- make into 3D panel
- Add new columns
- Fill those columns
From a multi-index 2d frame it's possible to change it to a pandas.Panel where you convert the 2nd index to one of the axes in the panel.
After this I have a 3D panel with axes [time, objects, parameters]. Then, tranpose the panel to have the PARAMETERS as items, this to add columns to the datapanel. So, tranpose the panel, add the columns, transpose back.
dp_new = dp.transpose(2,0,1)
dp_new['shifted_box_center_x']=np.nan
dp_new['shifted_box_center_y']=np.nan
dp_new['shifted_relative_velocity_x']=np.nan
dp_new['shifted_relative_velocity_y']=np.nan
# tranpose them back to their original form
dp_new = dp_new.transpose(1,2,0)
Now that we have added the new fields, we can get their names by
new_fields = dp_new.minor_axis[-4:]
The objective is to add information from 5s ago, if that object existed. Therefore, we cycle the time series from a moment in time which is 5s. In my case, at a rate of 25Hz, this is element 5*rate = 125.
Lets first set the time to start from 5s in the datapanel
time = dp_new.items[125:]
Then, we iterate an enumerated version of the time. The enumeration will start at 0, which is the index of the datapanel at timestep = 0. The first timestep however is the timestep at time 0+5seconds.
time = dp_new.items[125:]
for iloc, ts in enumerate(time):
# Print progress
print('{} out of {}'.format(ts, dp.items[-1]) , end="\r", flush=True)
# Generate new INDEX field, by taking the field ID and dropping the NaN values
ids = dp_new.loc[ts].id.dropna().values
# Drop the nan field from the frame
dp_new[ts].dropna(thresh=5, inplace=True)
# save the original indices
original_index = {'index': dp_new.loc[ts].index, 'id': dp_new.loc[ts].id.values}
# set the index to field id
dp_new[ts].set_index(['id'], inplace=True)
# Check if the vector ids does NOT contain ALL ZEROS
if np.any(ids): # Check for all zeros
df_past = dp_new.iloc[iloc].copy() # SCREENSHOT AT TS=5s --> ILOC = 0
df_past.dropna(thresh=5, inplace=True) # drop the nan rows
df_past.set_index(['id'], inplace=True) # set the index to field ID
dp_new[ts].loc[original_index['id'], new_fields] = df_past[fields].values
This will only fill in fields that have id's ==ids.
This code was able to run on a 300 000 element file in about 5 minutes.
Note: i spent quite some time on this, mainly because of how one indexes a panel. At first , i thought calling the 3 dimensions would work, as stated in pandas help, but it seems that this is not the case.
dp_new[ts, ids, new_fields] = values does NOT work.
Related
How to use one dataframe's index to reindex another one in pandas
I am so sorry that I truly don't know what title I should use. But here is my question Stocks_Open d-1 d-2 d-3 d-4 000001.HR 1817.670960 1808.937405 1796.928768 1804.570628 000002.ZH 4867.910878 4652.713598 4652.713598 4634.904168 000004.HD 92.046474 92.209029 89.526880 96.435445 000005.SS 28.822245 28.636893 28.358865 28.729569 000006.SH 192.362963 189.174626 185.986290 187.403328 000007.SH 79.190528 80.515892 81.509916 78.693516 Stocks_Volume d-1 d-2 d-3 d-4 000001.HR 324234 345345 657546 234234 000002.ZH 4867343 465234 4652598 4634168 000004.HD 9246474 929029 826880 965445 000005.SS 2822245 2836893 2858865 2829569 000006.SH 19262963 1897466 1886290 183328 000007.SH 7190528 803892 809916 7693516 Above are the data I parsed from a database, what I exactly want to do is to obtain the correlation of open price and volume in 4 days for each stock (The first column consists of codes of different stocks). In other words, I am trying to calculate the correlation of corresponding rows of each DataFrame. (This is only simplified example, the real data should be extended to more than 1000 different stocks.) My attempt is to create a dataframe and to run a loop, assigning the results to that dataframe. But here is a problem, which is, the index pf the created dataframe is not exactly what I want. When I tried to append the correlation column, the bug occurred. (Please ignore the value of correlation, which is I concocted here, just to give an example) r = pd.DataFrame(index = range(6),columns = ['c'] for i in range(6): r.iloc[i-1,:] = Stocks_Open.iloc[i-1].corr(Stocks_Volume.iloc[i-1]) Correlation_in_4days = pd.concat([Stocks_Open,Stocks_Volume], axis = 1) Correlation_in_4days['corr'] = r['c'] for i in range(6): Correlation_in_4days.iloc[i-1,8] = r.iloc[i-1,:] r c 1 0.654 2 -0.454 3 0.3321 4 0.2166 5 -0.8772 6 0.3256 The bug occurred. "ValueError: Incompatible indexer with Series" I realized that my correlation dataframe's index is integer but not the stock code, but I don't know how to fix it, is there any help? My ideal result is: corr 000001.HR 0.654 000002.ZH -0.454 000004.HD 0.3321 000005.SS 0.2166 000006.SH -0.8772 000007.SH 0.3256
Try assign the index back r.index = Stocks_Open.index
How to not set value to slice of copy [duplicate]
This question already has answers here: How to deal with SettingWithCopyWarning in Pandas (20 answers) Closed 2 years ago. I am trying to replace string values in a column without creating a copy. I have looked at the docs provided in the warning and also this question. I have also tried using .replace() with the same results. What am I not understanding? Code: import pandas as pd from datetime import timedelta # set csv file as constant TRADER_READER = pd.read_csv('TastyTrades.csv') TRADER_READER['Strategy'] = '' def iron_condor(): TRADER_READER['Date'] = pd.to_datetime(TRADER_READER['Date'], format="%Y-%m-%d %H:%M:%S") a = 0 b = 1 c = 2 d = 3 for row in TRADER_READER.index: start_time = TRADER_READER['Date'][a] end_time = start_time + timedelta(seconds=5) e = TRADER_READER.iloc[a] f = TRADER_READER.iloc[b] g = TRADER_READER.iloc[c] h = TRADER_READER.iloc[d] if start_time <= f['Date'] <= end_time and f['Underlying Symbol'] == e['Underlying Symbol']: if start_time <= g['Date'] <= end_time and g['Underlying Symbol'] == e['Underlying Symbol']: if start_time <= h['Date'] <= end_time and h['Underlying Symbol'] == e['Underlying Symbol']: e.loc[e['Strategy']] = 'Iron Condor' f.loc[f['Strategy']] = 'Iron Condor' g.loc[g['Strategy']] = 'Iron Condor' h.loc[h['Strategy']] = 'Iron Condor' print(e, f, g, h) if (d + 1) > int(TRADER_READER.index[-1]): break else: a += 1 b += 1 c += 1 d += 1 iron_condor() Warning: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_with_indexer(indexer, value) Hopefully this satisfies the data needed to replicate: ,Date,Type,Action,Symbol,Instrument Type,Description,Value,Quantity,Average Price,Commissions,Fees,Multiplier,Underlying Symbol,Expiration Date,Strike Price,Call or Put 36,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103P00206500,Equity Option,Bought 1 QQQ 01/03/20 Put 206.50 # 0.07,-7,1,-7,-1.0,-0.14,100.0,QQQ,1/3/2020,206.5,PUT 37,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103C00217500,Equity Option,Bought 1 QQQ 01/03/20 Call 217.50 # 0.03,-3,1,-3,-1.0,-0.14,100.0,QQQ,1/3/2020,217.5,CALL 38,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103P00209000,Equity Option,Sold 1 QQQ 01/03/20 Put 209.00 # 0.14,14,1,14,-1.0,-0.15,100.0,QQQ,1/3/2020,209.0,PUT 39,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103C00214500,Equity Option,Sold 1 QQQ 01/03/20 Call 214.50 # 0.30,30,1,30,-1.0,-0.15,100.0,QQQ,1/3/2020,214.5,CALL 40,2020-01-03 16:08:13,Trade,BUY_TO_CLOSE,QQQ 200103C00214500,Equity Option,Bought 1 QQQ 01/03/20 Call 214.50 # 0.07,-7,1,-7,0.0,-0.14,100.0,QQQ,1/3/2020,214.5,CALL Expected result: ,Date,Type,Action,Symbol,Instrument Type,Description,Value,Quantity,Average Price,Commissions,Fees,Multiplier,Underlying Symbol,Expiration Date,Strike Price,Call or Put 36,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103P00206500,Equity Option,Bought 1 QQQ 01/03/20 Put 206.50 # 0.07,-7,1,-7,-1.0,-0.14,100.0,QQQ,1/3/2020,206.5,PUT,Iron Condor 37,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103C00217500,Equity Option,Bought 1 QQQ 01/03/20 Call 217.50 # 0.03,-3,1,-3,-1.0,-0.14,100.0,QQQ,1/3/2020,217.5,CALL,Iron Condor 38,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103P00209000,Equity Option,Sold 1 QQQ 01/03/20 Put 209.00 # 0.14,14,1,14,-1.0,-0.15,100.0,QQQ,1/3/2020,209.0,PUT,Iron Condor 39,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103C00214500,Equity Option,Sold 1 QQQ 01/03/20 Call 214.50 # 0.30,30,1,30,-1.0,-0.15,100.0,QQQ,1/3/2020,214.5,CALL,Iron Condor 40,2020-01-03 16:08:13,Trade,BUY_TO_CLOSE,QQQ 200103C00214500,Equity Option,Bought 1 QQQ 01/03/20 Call 214.50 # 0.07,-7,1,-7,0.0,-0.14,100.0,QQQ,1/3/2020,214.5,CALL,
Let's start from some improvements in the initial part of your code: The leftmost column of your input file is apparently the index column, so it should be read as the index. The consequence is some different approach to the way to access rows (details later). The Date column can be converted to datetime64 as early as at the reading time. So the initial part of your code can be: TRADER_READER = pd.read_csv('Input.csv', index_col=0, parse_dates=['Date']) TRADER_READER['Strategy'] = '' Then I decided to organize the loop other way: indStart is the integer index of the index column. As you process your file in "overlapping" couples of 4 consecutive rows, a more natural way to organize the loop is to stop on 4-th row from the end. So the loop is over the range(TRADER_READER.index.size - 3). Indices of 4 rows of interest can be read from the respective slice of the index, i.e. [indStart : indStart + 4] Check of particular row can be performed with a nested function. To avoid your warning, setting of values in Strategy column should be performed using loc on the original DataFrame, with row parameter for the respective row and column parameter for Strategy. The whole update (for the current couple of 4 rows) can be performed in a single instruction, specifying row parameter as a slice, from a thru d. So the code can be something like below: def iron_condor(): def rowCheck(row): return start_time <= row.Date <= end_time and row['Underlying Symbol'] == undSymb for indStart in range(TRADER_READER.index.size - 3): a, b, c, d = TRADER_READER.index[indStart : indStart + 4] e = TRADER_READER.loc[a] undSymb = e['Underlying Symbol'] start_time = e.Date end_time = start_time + pd.Timedelta('5S') if rowCheck(TRADER_READER.loc[b]) and rowCheck(TRADER_READER.loc[c]) and rowCheck(TRADER_READER.loc[d]): TRADER_READER.loc[a:d, 'Strategy'] = 'Iron Condor' print('New values:') print(TRADER_READER.loc[a:d]) No need to increment a, b, c and d. Neither break is needed. Edit If for some reason you have to do other updates on the rows in question, you can change my code accordingly. But I don't understand "this csv file will make a new column" in your comment. For now anything you do is performed on the DataFrame in memory. Only after that you can save the DataFrame back to the original file. But note that even your code changes the type of Date column, so I assume you do it once and then the type of this column is just datetime64. So you probably should change the type of Date column as a separate operation and then (possibly many times) update thie DataFrame and save the updated content back to the source file. Edit following the comment as of 21:22:46Z re.search('.*TO_OPEN$', row['Action']) returns a re.Match object if a match has been found, otherwise None. So can not compare this result with the string searched. If you wanted to get the string matched, you should run e.g.: mtch = re.search('.*TO_OPEN$', row['Action']) textFound = None if mtch: textFound = mtch.group(0) But you actually don't need to do it. It is enough to check whether a match has been found, so the condition can be: found = bool(re.search('.*TO_OPEN$', row['Action'])) (note that None cast to bool returns False and any non-Null object returns True). Yet another (probably simpler and quicker) solution is that you run just: row.Action.endswith('TO_OPEN') without invoking any regex fuction.
Here is a quite elaborating post that can not only answer your question but also explain in details why things are the case. Deal with SettingWithCopyWarning In short if you want to set the value of the original df, either use .replace(inplace=True) or df.loc[condition, theColtoBeSet] = new_val
Pandas - DateTime within X amount of minutes from row
I am not entirely positive the best way to ask or phrase this question so I will highlight my problem, dataset, my thoughts on the method and end goal and hopefully it will be clear by the end. My problem: My company dispatches workers and will load up dispatches to a single employee even if they are on their current dispatch. This is due to limitation in the software we use. If an employee receives two dispatches within 30 minutes, we call this a double dispatch. We are analyzing our dispatching efficiency and I am running into a bit of a head scratcher. I need to run through our 100k row database and add an additional column that will read as a dummy variable 1 for double 0 for normal. BUT as we have multiple people we dispatch and B our records do not start ordered by dispatch, I need to determine how often a dispatch occurs to the same person within 30 minutes. Dataset: The dataset is incredibly massive due to poor organization in our data warehouse but for terms of what items I need these are the columns I will need for my calc. Tech Name | Dispatch Time (PST) John Smith | 1/1/2017 12:34 Jane Smith | 1/1/2017 12:46 John Smith | 1/1/2017 18:32 John Smith | 1/1/2017 18:50 My Thoughts: How I would do it is clunky and it could work one way but not backwards. I would more or less write my code as: import pandas as pd df = pd.read_excel('data.xlsx') df.sort('Dispatch Time (PST)', inplace = True) tech_name = None dispatch_time = pd.to_datetime('1/1/1900 00:00:00') for index, row in df.iterrows(): if tech_name is None: tech_name = row['Tech Name'] else: if dispatch_time.pd.time_delta('0 Days 00:30:00') > row['Tech Dispatch Time (PST)'] AND row['Tech Name'] = tech_name: row['Double Dispatch'] = 1 dispatch_time = row['Tech Dispatch Time (PST)'] else: dispatch_time = row['Tech Dispatch Time (PST)'] tech_name = row['Tech Name'] This has many problems from being slow, only tracking dates going backwards and not forwards so I will be missing many dispatches. End Goal: My goal is to have a dataset I can then plug back into Tableau for my report by adding on one column that reads as that dummy variable so I can filter and calculate on that. I appreciate your time and help and let me know if any more details are necessary. Thank you! ------------------ EDIT ------------- Added a edit to make the question clear as I failed to do so earlier. Question: Is Pandas the best tool to use to iterate over my dataframe to see each for each datetime dispatch, is there a record that matches the Tech's Name AND is less then 30 minutes away from this record. If so, how could I improve my algorithm or theory, if not what would the best tool be. Desired Output - An additional column that records if a dispatch happened within a 30 minute window as a dummy variable 1 for True 0 for False. I need to see when double dispatches are occuring and how many records are true double dispatches, and not just a count that says there were 100 instances of double dispatch, but that involved over 200 records. I need to be able to sort and see each record.
Hello I think I found a solution. It slow, only compares one index before or after, but in terms of cases that have 3 dispatches within thirty minutes, this represents less then .5 % for us. import pandas as pd import numpy as np import datetime as dt dispatch = 'Tech Dispatched Date-Time (PST)' tech = 'CombinedTech' df = pd.read_excel('combined_data.xlsx') df.sort_values(dispatch, inplace=True) df.reset_index(inplace = True) df['Double Dispatch'] = np.NaN writer = pd.ExcelWriter('final_output.xlsx', engine='xlsxwriter') dispatch_count = 0 time = dt.timedelta(minutes = 30) for index, row in df.iterrows(): try: tech_one = df[tech].loc[(index - 1)] dispatch_one = df[dispatch].loc[(index - 1)] except KeyError: tech_one = None dispatch_one = pd.to_datetime('1/1/1990 00:00:00') try: tech_two = df[tech].loc[(index + 1)] dispatch_two = df[dispatch].loc[(index + 1)] except KeyError: tech_two = None dispatch_two = pd.to_datetime('1/1/2020 00:00:00') first_time = dispatch_one + time second_time = pd.to_datetime(row[dispatch]) + time dispatch_pd = pd.to_datetime(row[dispatch]) if tech_one == row[tech] or tech_two == row[tech]: if first_time > row[dispatch] or second_time > dispatch_two: df.set_value(index, 'Double Dispatch', 1) dispatch_count += 1 else: df.set_value(index, 'Double Dispatch', 0) dispatch_count += 1 print(dispatch_count) # This was to monitor total # of records being pushed through df.to_excel(writer,sheet_name='Sheet1') writer.save() writer.close()
Split rows of a dataframe according to time
I have a pandas dataframe df that looks like the following: df Out[16]: Start End Value Start Realtime End Realtime Duration 0 0 2999 1 736051 736051 59.98 1 3000 104999 0 736051 736051 5639.98 For each row, I would need to check the Start Realtime and End Realtime column and if they are across one day (eg. Start Realtime[0] = 29-05-2016 22:30:00 and End Realtime[0]=30=05-2006 01:00:00 I should split the row in 2: one from Start Realtime = 29-05-2016 22:30:00 until End Realtime = 29-05-2016 23:59:59 and one from Start Realtime = 30-05-2016 00:00:00 until End Realtime = 30-05-2016 01:00:00 keeping the same value in the Value column and recalculating the duration (in seconds) and start and end columns (in samples) It would be nice if I can keep the cut off time (in this example midnight) flexible..
Just take it row by row for starters. The idea is if you have a row you need to split, then return a dataframe with two rows; otherwise return a dataframe with one. And then append it on to the new dataframe you are creating. expanded_df = pd.DataFrame() for i, row in df.iterrows(): expanded_df = expanded_df.append(applyFunc(row), ignore_index=True) For each row, create a cutoff time datetime object that is the closest to the start_time but after it. Then just see whether it falls between the start_time and end_time. Finally if it requires a split, create two new pandas series to return with the changed values. def applyFunc(row): start_time = datetime.datetime.fromtimestamp(row["Start Realtime"]) end_time = # Similar to above custom_hour = 11 # custom_minute = ... cutoff_time = # Start with datetime.datetime(start_time.year, start_time.month, start_time.day, custom_hour, 0, 0) and see how you need to adjust with datetime.timedelta(...) if start_time < cutoff_time < end_time: before_cutoff = # Logic for before_cutoff; you will probably find row.set_value("key", value) useful after_cutoff = # Logic for after_cutoff series return pd.DataFrame([before_cutoff,after_cutoff]) else: return row
Pandas joining based on date
I'm trying to join two dataframes with dates that don't perfectly match up. For a given group/date in the left dataframe, I want to join the corresponding record from the right dataframe with the a date just before that of the left dataframe. Probably easiest to show with an example. df1: group date teacher a 1/10/00 1 a 2/27/00 1 b 1/7/00 1 b 4/5/00 1 c 2/9/00 2 c 9/12/00 2 df2: teacher date hair length 1 1/1/00 4 1 1/5/00 8 1 1/30/00 20 1 3/20/00 100 2 1/1/00 0 2 8/10/00 50 Gives us: group date teacher hair length a 1/10/00 1 8 a 2/27/00 1 20 b 1/7/00 1 8 b 4/5/00 1 100 c 2/9/00 2 0 c 9/12/00 2 50 Edit 1: Hacked together a way to do this. Basically I iterate through every row in df1 and pick out the most recent corresponding entry in df2. It is insanely slow, surely there must be a better way.
One way to do this is to create a new column in the left data frame, which will (for a given row's date) determine the value that is closest and earlier: df1['join_date'] = df1.date.map(lambda x: df2.date[df2.date <= x].max()) then a regular join or merge between 'join_date' on the left and 'date' on the right will work. You may need to tweak the function to handle Null values or other corner cases. This is not very efficient (you are searching the right-hand dates over and over). A more efficient approach is to sort both data frames by the dates, iterate through the left-hand data frame, and consume entries from the right hand data frame just until the date is larger: # Assuming df1 and df2 are sorted by the dates df1['hair length'] = 0 # initialize r_generator = df2.iterrows() _, cur_r_row = next(r_generator) for i, l_row in df1.iterrows(): cur_hair_length = 0 # Assume 0 works when df1 has a date earlier than df2 while cur_r_row['date'] <= l_row['date']: cur_hair_length = cur_r_row['hair length'] try: _, cur_r_row = next(r_generator) except StopIteration: break df1.loc[i, 'hair length'] = cur_hair_length
Seems like the quickest way to do this is using sqlite via pysqldf: def partial_versioned_join(tablea, tableb, tablea_keys, tableb_keys): try: tablea_group, tablea_date = tablea_keys tableb_group, tableb_date = tableb_keys except ValueError, e: raise(e, 'Need to pass in both a group and date key for both tables') # Note: can't actually use group here as a field name due to sqlite statement = """SELECT a.group, a.{date_a} AS {temp_date}, b.* FROM (SELECT tablea.group, tablea.{date_a}, tablea.{group_a}, MAX(tableb.{date_b}) AS tdate FROM tablea JOIN tableb ON tablea.{group_a}=tableb.{group_b} AND tablea.{date_a}>=tableb.{date_b} GROUP BY tablea.{base_id}, tablea.{date_a}, tablea.{group_a} ) AS a JOIN tableb b ON a.{group_a}=b.{group_b} AND a.tdate=b.{date_b}; """.format(group_a=tablea_group, date_a=tablea_date, group_b=tableb_group, date_b=tableb_date, temp_date='join_date', base_id=base_id) # Note: you lose types here for tableb so you may want to save them pre_join_tableb = sqldf(statement, locals()) return pd.merge(tablea, pre_join_tableb, how='inner', left_on=['group'] + tablea_keys, right_on=['group', tableb_group, 'join_date'])