How to not set value to slice of copy [duplicate] - python
This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 2 years ago.
I am trying to replace string values in a column without creating a copy. I have looked at the docs provided in the warning and also this question. I have also tried using .replace() with the same results. What am I not understanding?
Code:
import pandas as pd
from datetime import timedelta
# set csv file as constant
TRADER_READER = pd.read_csv('TastyTrades.csv')
TRADER_READER['Strategy'] = ''
def iron_condor():
TRADER_READER['Date'] = pd.to_datetime(TRADER_READER['Date'], format="%Y-%m-%d %H:%M:%S")
a = 0
b = 1
c = 2
d = 3
for row in TRADER_READER.index:
start_time = TRADER_READER['Date'][a]
end_time = start_time + timedelta(seconds=5)
e = TRADER_READER.iloc[a]
f = TRADER_READER.iloc[b]
g = TRADER_READER.iloc[c]
h = TRADER_READER.iloc[d]
if start_time <= f['Date'] <= end_time and f['Underlying Symbol'] == e['Underlying Symbol']:
if start_time <= g['Date'] <= end_time and g['Underlying Symbol'] == e['Underlying Symbol']:
if start_time <= h['Date'] <= end_time and h['Underlying Symbol'] == e['Underlying Symbol']:
e.loc[e['Strategy']] = 'Iron Condor'
f.loc[f['Strategy']] = 'Iron Condor'
g.loc[g['Strategy']] = 'Iron Condor'
h.loc[h['Strategy']] = 'Iron Condor'
print(e, f, g, h)
if (d + 1) > int(TRADER_READER.index[-1]):
break
else:
a += 1
b += 1
c += 1
d += 1
iron_condor()
Warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_with_indexer(indexer, value)
Hopefully this satisfies the data needed to replicate:
,Date,Type,Action,Symbol,Instrument Type,Description,Value,Quantity,Average Price,Commissions,Fees,Multiplier,Underlying Symbol,Expiration Date,Strike Price,Call or Put
36,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103P00206500,Equity Option,Bought 1 QQQ 01/03/20 Put 206.50 # 0.07,-7,1,-7,-1.0,-0.14,100.0,QQQ,1/3/2020,206.5,PUT
37,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103C00217500,Equity Option,Bought 1 QQQ 01/03/20 Call 217.50 # 0.03,-3,1,-3,-1.0,-0.14,100.0,QQQ,1/3/2020,217.5,CALL
38,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103P00209000,Equity Option,Sold 1 QQQ 01/03/20 Put 209.00 # 0.14,14,1,14,-1.0,-0.15,100.0,QQQ,1/3/2020,209.0,PUT
39,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103C00214500,Equity Option,Sold 1 QQQ 01/03/20 Call 214.50 # 0.30,30,1,30,-1.0,-0.15,100.0,QQQ,1/3/2020,214.5,CALL
40,2020-01-03 16:08:13,Trade,BUY_TO_CLOSE,QQQ 200103C00214500,Equity Option,Bought 1 QQQ 01/03/20 Call 214.50 # 0.07,-7,1,-7,0.0,-0.14,100.0,QQQ,1/3/2020,214.5,CALL
Expected result:
,Date,Type,Action,Symbol,Instrument Type,Description,Value,Quantity,Average Price,Commissions,Fees,Multiplier,Underlying Symbol,Expiration Date,Strike Price,Call or Put
36,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103P00206500,Equity Option,Bought 1 QQQ 01/03/20 Put 206.50 # 0.07,-7,1,-7,-1.0,-0.14,100.0,QQQ,1/3/2020,206.5,PUT,Iron Condor
37,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103C00217500,Equity Option,Bought 1 QQQ 01/03/20 Call 217.50 # 0.03,-3,1,-3,-1.0,-0.14,100.0,QQQ,1/3/2020,217.5,CALL,Iron Condor
38,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103P00209000,Equity Option,Sold 1 QQQ 01/03/20 Put 209.00 # 0.14,14,1,14,-1.0,-0.15,100.0,QQQ,1/3/2020,209.0,PUT,Iron Condor
39,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103C00214500,Equity Option,Sold 1 QQQ 01/03/20 Call 214.50 # 0.30,30,1,30,-1.0,-0.15,100.0,QQQ,1/3/2020,214.5,CALL,Iron Condor
40,2020-01-03 16:08:13,Trade,BUY_TO_CLOSE,QQQ 200103C00214500,Equity Option,Bought 1 QQQ 01/03/20 Call 214.50 # 0.07,-7,1,-7,0.0,-0.14,100.0,QQQ,1/3/2020,214.5,CALL,
Let's start from some improvements in the initial part of your code:
The leftmost column of your input file is apparently the index column,
so it should be read as the index. The consequence is some different approach
to the way to access rows (details later).
The Date column can be converted to datetime64 as early as at the reading time.
So the initial part of your code can be:
TRADER_READER = pd.read_csv('Input.csv', index_col=0, parse_dates=['Date'])
TRADER_READER['Strategy'] = ''
Then I decided to organize the loop other way:
indStart is the integer index of the index column.
As you process your file in "overlapping" couples of 4 consecutive rows,
a more natural way to organize the loop is to stop on 4-th row from the end.
So the loop is over the range(TRADER_READER.index.size - 3).
Indices of 4 rows of interest can be read from the respective slice of the
index, i.e. [indStart : indStart + 4]
Check of particular row can be performed with a nested function.
To avoid your warning, setting of values in Strategy column should be
performed using loc on the original DataFrame, with row parameter for
the respective row and column parameter for Strategy.
The whole update (for the current couple of 4 rows) can be performed in
a single instruction, specifying row parameter as a slice,
from a thru d.
So the code can be something like below:
def iron_condor():
def rowCheck(row):
return start_time <= row.Date <= end_time and row['Underlying Symbol'] == undSymb
for indStart in range(TRADER_READER.index.size - 3):
a, b, c, d = TRADER_READER.index[indStart : indStart + 4]
e = TRADER_READER.loc[a]
undSymb = e['Underlying Symbol']
start_time = e.Date
end_time = start_time + pd.Timedelta('5S')
if rowCheck(TRADER_READER.loc[b]) and rowCheck(TRADER_READER.loc[c]) and rowCheck(TRADER_READER.loc[d]):
TRADER_READER.loc[a:d, 'Strategy'] = 'Iron Condor'
print('New values:')
print(TRADER_READER.loc[a:d])
No need to increment a, b, c and d. Neither break is needed.
Edit
If for some reason you have to do other updates on the rows in question,
you can change my code accordingly.
But I don't understand "this csv file will make a new column" in your
comment. For now anything you do is performed on the DataFrame
in memory. Only after that you can save the DataFrame back to the
original file. But note that even your code changes the type of Date
column, so I assume you do it once and then the type of this column
is just datetime64.
So you probably should change the type of Date column as a separate
operation and then (possibly many times) update thie DataFrame and save
the updated content back to the source file.
Edit following the comment as of 21:22:46Z
re.search('.*TO_OPEN$', row['Action']) returns a re.Match object if
a match has been found, otherwise None.
So can not compare this result with the string searched. If you wanted to get
the string matched, you should run e.g.:
mtch = re.search('.*TO_OPEN$', row['Action'])
textFound = None
if mtch:
textFound = mtch.group(0)
But you actually don't need to do it. It is enough to check whether
a match has been found, so the condition can be:
found = bool(re.search('.*TO_OPEN$', row['Action']))
(note that None cast to bool returns False and any non-Null object
returns True).
Yet another (probably simpler and quicker) solution is that you run just:
row.Action.endswith('TO_OPEN')
without invoking any regex fuction.
Here is a quite elaborating post that can not only answer your question but also explain in details why things are the case.
Deal with SettingWithCopyWarning
In short if you want to set the value of the original df, either use .replace(inplace=True) or df.loc[condition, theColtoBeSet] = new_val
Related
Adding Leading Zeros to a field with MM:SS time data
I have the following data: data shows a race time finish and pace: As you can see, the data doesn't show the hour format for people who finish before the hour mark and in order to do some analysis, i need to convert into a time format but pandas doesn't recognize just the MM:SS format. how can I pad '0:' in front of the rows where hour is missing? i'm sorry, this is my first time posting.
Considering your data is in csv format. # reading in the data file df = pd.read_csv('data_file.csv') # replacing spaces with '_' in column names df.columns = [c.replace(' ', '_') for c in df.columns] for i, row in df.iterrows(): val_inital = str(row.Gun_time) val_final = val_inital.replace(':','') if len(val_final)<5: val_final = "0:" + val_inital df.at[i, 'Gun_time'] = val_final # saving newly edited csv file df.to_csv('new_data_file.csv') Before: Gun time 0 28:48 1 29:11 2 1:01:51 3 55:01 4 2:08:11 After: Gun_time 0 0:28:48 1 0:29:11 2 1:01:51 3 0:55:01 4 2:08:11
You can try to apply the following function to the columns you want to change then maybe change it to timedelta df['Gun time'].apply(lambda x: '0:' + x if len(x) == 5 \ else ('0:0' + x if len(x) == 4 else x)) df['Gun time'] = pd.to_timedelta(df['Gun Time'])
Pandas to modify values in csv file based on function
I have a CSV file that looks like below, this is same like my last question but this is by using Pandas. Group Sam Dan Bori Son John Mave A 0.00258844 0.983322 1.61479 1.2785 1.96963 10.6945 B 0.0026034 0.983305 1.61198 1.26239 1.9742 10.6838 C 0.0026174 0.983294 1.60913 1.24543 1.97877 10.6729 D 0.00263062 0.983289 1.60624 1.22758 1.98334 10.6618 E 0.00264304 0.98329 1.60332 1.20885 1.98791 10.6505 I have a function like below def getnewno(value): value = value + 30 if value > 40 : value = value - 20 else: value = value return value I want to send all these values to the getnewno function and get a newvalue and update the CSV file. How can this be accomplished in Pandas. Expected output: Group Sam Dan Bori Son John Mave A 30.00258844 30.983322 31.61479 31.2785 31.96963 20.6945 B 30.0026034 30.983305 31.61198 31.26239 31.9742 20.6838 C 30.0026174 30.983294 31.60913 31.24543 31.97877 20.6729 D 30.00263062 30.983289 31.60624 31.22758 31.98334 20.6618 E 30.00264304 30.98329 31.60332 31.20885 31.98791 20.6505
The following should give you what you desire. Applying a function Your function can be simplified and here expressed as a lambda function. It's then a matter of applying your function to all of the columns. There are a number of ways to do so. The first idea that comes to mind is to loop over df.columns. However, we can do better than this by using the applymap or transform methods: import pandas as pd # Read in the data from file df = pd.read_csv('data.csv', sep='\s+', index_col=0) # Simplified function with which to transform data getnewno = lambda value: value + 10 if value > 10 else value + 30 # Looping over columns #for col in df.columns: # df[col] = df[col].apply(getnewno) # Apply to all columns without loop df = df.applymap(getnewno) # Write out updated data df.to_csv('data_updated.csv') Using broadcasting You can achieve your result using broadcasting and a little boolean logic. This avoids looping over any columns, and should ultimately prove faster and less memory intensive (although if your dataset is small any speed-up would be negligible): import pandas as pd df = pd.read_csv('data.csv', sep='\s+', index_col=0) df += 30 make_smaller = df > 40 df[make_smaller] -= 20
First of all, your getnewno function looks too complicated... it can be simplified to e.g.: def getnewno(value): if value + 30 > 40: return value - 20 else: return value you can even change value + 30 > 40 to value > 10. Or even a oneliner if you want: getnewno = lambda value: value-20 if value > 10 else value Having the function you can apply it to specific values/columns. For example, if want you to create a column Mark_updated basing on Mark column, it should look like this (I assume your pandas DataFrame is called df): df['Mark_updated'] = df['Mark'].apply(getnewno)
Use the mask function to do an if-else solution, before writing the data to csv res = (df .select_dtypes('number') .add(30) #the if-else comes in here #if any entry in the dataframe is greater than 40, subtract 20 from it #else leave as is .mask(lambda x: x>40, lambda x: x.sub(20)) ) #insert the group column back res.insert(0,'Group',df.Group.array) write to csv res.to_csv(filename) Group Sam Dan Bori Son John Mave 0 A 30.002588 30.983322 31.61479 31.27850 31.96963 20.6945 1 B 30.002603 30.983305 31.61198 31.26239 31.97420 20.6838 2 C 30.002617 30.983294 31.60913 31.24543 31.97877 20.6729 3 D 30.002631 30.983289 31.60624 31.22758 31.98334 20.6618 4 E 30.002643 30.983290 31.60332 31.20885 31.98791 20.6505
New column based off certain input parameter to select what columns to use - Python
Have a pandas dataframe that includes multiple columns of monthly finance data. I have an input of period that is specified by the person running the program. It's currently just saved as period like shown below within the code. #coded into python period = ?? (user adds this in from input screen) I need to create another column of data that uses the input period number to perform a calculation of other columns. So, in the above table I'd like to create a new column 'calculation' that depends on the period input. For example, if a period of 1 was used the following calc1 would be completed (with math actually done). Period = 2 - then calc2. Period = 3 - then calc3. I only need one column calculated depending on the period number but added three examples in below picture for example of how it'd work. I can do this in SQL using case when. So using the input period then sum what columns I need to. select Account #, '&Period' AS Period, '&Year' AS YR, case When '&Period' = '1' then sum(d_cf+d_1) when '&Period' = '2' then sum(d_cf+d_1+d_2) when '&Period' = '3' then sum(d_cf+d_1+d_2+d_3) I am unsure on how to do this easily in python (newer learner). Yes, I could create a column that does each calculation via new column for every possible period (1-12), and then only select that column but I'd like to learn and do it a more efficient way. Can you help more or point me in a better direction?
You could certainly do something like df[['d_cf'] + [f'd_{i}' for i in range(1, period+1)]].sum(axis=1)
You can do this using a simple function in python: def get_calculation(df, period=NULL): ''' df = pandas data frame period = integer type ''' if period == 1: return df.apply(lambda x: x['d_0'] +x['d_1'], axis=1) if period == 2: return df.apply(lambda x: x['d_0'] +x['d_1']+ x['d_2'], axis=1) if period == 3: return df.apply(lambda x: x['d_0'] +x['d_1']+ x['d_2'] + x['d_3'], axis=1) new_df = get_calculation(df, period = 1) Setup: df = pd.DataFrame({'d_0':list(range(1,7)), 'd_1': list(range(10,70,10)), 'd_2':list(range(100,700,100)), 'd_3': list(range(1000,7000,1000))})
Setup: import pandas as pd ddict = { 'Year':['2018','2018','2018','2018','2018',], 'Account_Num':['1111','1122','1133','1144','1155'], 'd_cf':['1','2','3','4','5'], } data = pd.DataFrame(ddict) Create value calculator: def get_calcs(period): # Convert period to integer s = str(period) # Convert to string value n = int(period) + 1 # This will repeat the period number by the value of the period number return ''.join([i * n for i in s]) Main function copies data frame, iterates through period values, and sets calculated values to the correct spot index-wise for each relevant column: def process_data(data_frame=data, period_column='d_cf'): # Copy data_frame argument df = data_frame.copy(deep=True) # Run through each value in our period column for i in df[period_column].values.tolist(): # Create a temporary column new_column = 'd_{}'.format(i) # Pass the period into our calculator; Capture the result calculated_value = get_calcs(i) # Create a new column based on our period number df[new_column] = '' # Use indexing to place the calculated value into our desired location df.loc[df[period_column] == i, new_column] = calculated_value # Return the result return df Start: Year Account_Num d_cf 0 2018 1111 1 1 2018 1122 2 2 2018 1133 3 3 2018 1144 4 4 2018 1155 5 Result: process_data(data) Year Account_Num d_cf d_1 d_2 d_3 d_4 d_5 0 2018 1111 1 11 1 2018 1122 2 222 2 2018 1133 3 3333 3 2018 1144 4 44444 4 2018 1155 5 555555
Pandas - DateTime within X amount of minutes from row
I am not entirely positive the best way to ask or phrase this question so I will highlight my problem, dataset, my thoughts on the method and end goal and hopefully it will be clear by the end. My problem: My company dispatches workers and will load up dispatches to a single employee even if they are on their current dispatch. This is due to limitation in the software we use. If an employee receives two dispatches within 30 minutes, we call this a double dispatch. We are analyzing our dispatching efficiency and I am running into a bit of a head scratcher. I need to run through our 100k row database and add an additional column that will read as a dummy variable 1 for double 0 for normal. BUT as we have multiple people we dispatch and B our records do not start ordered by dispatch, I need to determine how often a dispatch occurs to the same person within 30 minutes. Dataset: The dataset is incredibly massive due to poor organization in our data warehouse but for terms of what items I need these are the columns I will need for my calc. Tech Name | Dispatch Time (PST) John Smith | 1/1/2017 12:34 Jane Smith | 1/1/2017 12:46 John Smith | 1/1/2017 18:32 John Smith | 1/1/2017 18:50 My Thoughts: How I would do it is clunky and it could work one way but not backwards. I would more or less write my code as: import pandas as pd df = pd.read_excel('data.xlsx') df.sort('Dispatch Time (PST)', inplace = True) tech_name = None dispatch_time = pd.to_datetime('1/1/1900 00:00:00') for index, row in df.iterrows(): if tech_name is None: tech_name = row['Tech Name'] else: if dispatch_time.pd.time_delta('0 Days 00:30:00') > row['Tech Dispatch Time (PST)'] AND row['Tech Name'] = tech_name: row['Double Dispatch'] = 1 dispatch_time = row['Tech Dispatch Time (PST)'] else: dispatch_time = row['Tech Dispatch Time (PST)'] tech_name = row['Tech Name'] This has many problems from being slow, only tracking dates going backwards and not forwards so I will be missing many dispatches. End Goal: My goal is to have a dataset I can then plug back into Tableau for my report by adding on one column that reads as that dummy variable so I can filter and calculate on that. I appreciate your time and help and let me know if any more details are necessary. Thank you! ------------------ EDIT ------------- Added a edit to make the question clear as I failed to do so earlier. Question: Is Pandas the best tool to use to iterate over my dataframe to see each for each datetime dispatch, is there a record that matches the Tech's Name AND is less then 30 minutes away from this record. If so, how could I improve my algorithm or theory, if not what would the best tool be. Desired Output - An additional column that records if a dispatch happened within a 30 minute window as a dummy variable 1 for True 0 for False. I need to see when double dispatches are occuring and how many records are true double dispatches, and not just a count that says there were 100 instances of double dispatch, but that involved over 200 records. I need to be able to sort and see each record.
Hello I think I found a solution. It slow, only compares one index before or after, but in terms of cases that have 3 dispatches within thirty minutes, this represents less then .5 % for us. import pandas as pd import numpy as np import datetime as dt dispatch = 'Tech Dispatched Date-Time (PST)' tech = 'CombinedTech' df = pd.read_excel('combined_data.xlsx') df.sort_values(dispatch, inplace=True) df.reset_index(inplace = True) df['Double Dispatch'] = np.NaN writer = pd.ExcelWriter('final_output.xlsx', engine='xlsxwriter') dispatch_count = 0 time = dt.timedelta(minutes = 30) for index, row in df.iterrows(): try: tech_one = df[tech].loc[(index - 1)] dispatch_one = df[dispatch].loc[(index - 1)] except KeyError: tech_one = None dispatch_one = pd.to_datetime('1/1/1990 00:00:00') try: tech_two = df[tech].loc[(index + 1)] dispatch_two = df[dispatch].loc[(index + 1)] except KeyError: tech_two = None dispatch_two = pd.to_datetime('1/1/2020 00:00:00') first_time = dispatch_one + time second_time = pd.to_datetime(row[dispatch]) + time dispatch_pd = pd.to_datetime(row[dispatch]) if tech_one == row[tech] or tech_two == row[tech]: if first_time > row[dispatch] or second_time > dispatch_two: df.set_value(index, 'Double Dispatch', 1) dispatch_count += 1 else: df.set_value(index, 'Double Dispatch', 0) dispatch_count += 1 print(dispatch_count) # This was to monitor total # of records being pushed through df.to_excel(writer,sheet_name='Sheet1') writer.save() writer.close()
Pandas.SHIFT in Multi index frame for temporal dependency
This is my first post so please be gentle. I have searched across the world wide web looking for a solution but I am yet to find one. The problem i'm trying to solve is as follows: I have a dataset, comprised of 500.000+ samples, with 6 features per sample. I have put this dataset in a multiindexed Pandas DataFrame The first level of my dataFrame is the timeseries index, the second level is the ID. It looks as follows Time id 2017-03-07 10:06:49.963241984 122.0 -7.024347 136.0 -11.664985 243.0 1.716150 2017-03-07 10:06:50.003462400 122.0 -7.025922 136.0 -11.671526 Every timestamp, a number of objects can be seen and are marked by label 'id'. For my application, i want to add a temporal dependency by including information that happened 5 seconds ago, i.e. in this example on timestamp 10:06:45. But, importantly, i only want to add this information if at that timestamp the object already existed (so if the id is equal). I wanted to use the function dataframe.shift, as mentioned here and, i want to do it per level, so as indicated by user Unutbu in How do you shift Pandas DataFrame with a multiindex? My question is as follows: How do I append extra columns to the original dataframe X with information on what those objects were 5s ago. I would expect something like the following X['x_location_shifted'] = X.groupby(level=1)['x_location'].shift(5*rate) with the rate being 25Hz, ie. we shift 125 "DateTimeIndices", but, only if an object with id='...' exists at that timestamp. EDIT: The timestamps are not synchronized 100%, so the timegap is not always exactly equal to 0.04. Previously, i used np.argmin(np.abs(time-index)) to find the closest index to the stamp. For example, in my set, at timestamp 2017-03-07 10:36:03.605008640 there is an object with id == 175 and location_x = 54.323. id = 175 X.ix['2017-03-07 10:36:03.605008640', id] = 54.323 At timestamp 2017-03-07 10:36:08.604962560 ..... this object with id=175 has a location_x = 67.165955 id = 175 old_time = pd.to_datetime('2017-03-07 10:36:03.605008640') new_time = old_time + pd.Timedelta('5 seconds') # Finding the new value of location X.ix[np.argmin(np.abs(new_time - X.index.get_level_values(0))), id] So, finally, at timestep 10:36:08 i want to add the information of timestamp 10:36:03 IF the object already existed at that timestamp. EDIT2: After trying Maarten Fabré's solution, I came up with my own implementation, which you can find below. If anyone can show me a more pythonic way to do this, please let me know. for current_time in X.index.get_level_values(0)[125:]: #only do if there are objects at current time if len(X.ix[current_time].index): # Calculate past time past_time = current_time - pd.Timedelta('5 seconds') # Find index in X.index that is closest to this past time past_time_index = np.argmin(np.abs(past_time-X.index.get_level_values(0))) # translate the index back to a label past_time = X.index[past_time_index][0] # in that timestep, cycle the objects for obj_id in X.ix[current_time].index: # Try looking for the value box_center.x of obj obj_id 5s ago try: X.ix[(current_time, obj_id), 'box_center.x.shifted'] = X.ix[(past_time, obj_id), 'box_center.x'] X.ix[(current_time, obj_id), 'box_center.y.shifted'] = X.ix[(past_time, obj_id), 'box_center.y'] X.ix[(current_time, obj_id), 'relative_velocity.x.shifted'] = X.ix[(past_time, obj_id), 'relative_velocity.x'] X.ix[(current_time, obj_id), 'relative_velocity.y.shifted'] = X.ix[(past_time, obj_id), 'relative_velocity.y'] # If the key doesnt exist, the object doesn't exist, ergo the field should be np.nan except KeyError: X.ix[(current_time, obj_id), 'box_center.x.shift'] = np.nan print('Timestep {}'.format(current_time)) If this is not enough information, please say so and I can add it :) Cheers and thanks!
Assuming that you have no gaps in the timestamps, one possible solution might be the following, which creates a new index with shifted timestamps and uses that to get the 5 seconds-ago values for each ID. offset = 5 * rate # Create a shallow copy of the multiindex levels for modification modified_levels = list(X.index.levels) # Shift them modified_times = pd.Series(modified_levels[0]).shift(offset) # Fill NaNs with dummy values to avoid duplicates in the new index modified_times[modified_times.isnull()] = range(sum(modified_times.isnull())) modified_levels[0] = modified_times new_index = X.index.set_levels(modified_levels, inplace=False) X['x_location_shifted'] = X.loc[new_index, 'x_location'].values
If the timestamps are not 100% regular, then you'll either have to round the to the nearest 1/x second, or use a loop you could use this as a loop Data definition import pandas as pd import numpy as np from io import StringIO df_str = """ timestamp id location 10:00:00.005 1 a 10:00:00.005 2 b 10:00:00.005 3 c 10:00:05.006 2 a 10:00:05.006 3 b 10:00:05.006 4 c""" df = pd.DataFrame.from_csv(StringIO(df_str), sep='\t').reset_index() delta = pd.to_timedelta(5, unit='s') margin = pd.to_timedelta(1/50, unit='s') df['location_shifted'] = np.nan Loop over the different id's for label_id in set(df['id']): df_id = df[df['id'] == label_id].copy() # copy to make sure we don't overwrite the original data. Might not be necessary df_id['time_shift'] = df['timestamp'] + delta for row in df_id.itertuples(): idx = row.Index time_dif = abs(df['timestamp'] - row.time_shift) shifted_locs = df_id[time_dif < margin ] l = len(shifted_locs) if l: print(shifted_locs) if l == 1: idx_shift = shifted_locs.index[0] else: idx_shift = shifted_locs['time_shift'].idxmin() df.loc[idx_shift, 'location_shifted'] = df_id.loc[idx, 'location'] Results timestamp id location location_shifted 0 2017-05-09 10:00:00.005 1 a 1 2017-05-09 10:00:00.005 2 b 2 2017-05-09 10:00:00.005 3 c 3 2017-05-09 10:00:05.006 2 a b 4 2017-05-09 10:00:05.006 3 b c 5 2017-05-09 10:00:05.006 4 c
Any of you arriving here with the same question; i managed to solve it in a (minimal) vectorized way, but, it required me to return to a 3d panel. 3 Steps: - make into 3D panel - Add new columns - Fill those columns From a multi-index 2d frame it's possible to change it to a pandas.Panel where you convert the 2nd index to one of the axes in the panel. After this I have a 3D panel with axes [time, objects, parameters]. Then, tranpose the panel to have the PARAMETERS as items, this to add columns to the datapanel. So, tranpose the panel, add the columns, transpose back. dp_new = dp.transpose(2,0,1) dp_new['shifted_box_center_x']=np.nan dp_new['shifted_box_center_y']=np.nan dp_new['shifted_relative_velocity_x']=np.nan dp_new['shifted_relative_velocity_y']=np.nan # tranpose them back to their original form dp_new = dp_new.transpose(1,2,0) Now that we have added the new fields, we can get their names by new_fields = dp_new.minor_axis[-4:] The objective is to add information from 5s ago, if that object existed. Therefore, we cycle the time series from a moment in time which is 5s. In my case, at a rate of 25Hz, this is element 5*rate = 125. Lets first set the time to start from 5s in the datapanel time = dp_new.items[125:] Then, we iterate an enumerated version of the time. The enumeration will start at 0, which is the index of the datapanel at timestep = 0. The first timestep however is the timestep at time 0+5seconds. time = dp_new.items[125:] for iloc, ts in enumerate(time): # Print progress print('{} out of {}'.format(ts, dp.items[-1]) , end="\r", flush=True) # Generate new INDEX field, by taking the field ID and dropping the NaN values ids = dp_new.loc[ts].id.dropna().values # Drop the nan field from the frame dp_new[ts].dropna(thresh=5, inplace=True) # save the original indices original_index = {'index': dp_new.loc[ts].index, 'id': dp_new.loc[ts].id.values} # set the index to field id dp_new[ts].set_index(['id'], inplace=True) # Check if the vector ids does NOT contain ALL ZEROS if np.any(ids): # Check for all zeros df_past = dp_new.iloc[iloc].copy() # SCREENSHOT AT TS=5s --> ILOC = 0 df_past.dropna(thresh=5, inplace=True) # drop the nan rows df_past.set_index(['id'], inplace=True) # set the index to field ID dp_new[ts].loc[original_index['id'], new_fields] = df_past[fields].values This will only fill in fields that have id's ==ids. This code was able to run on a 300 000 element file in about 5 minutes. Note: i spent quite some time on this, mainly because of how one indexes a panel. At first , i thought calling the 3 dimensions would work, as stated in pandas help, but it seems that this is not the case. dp_new[ts, ids, new_fields] = values does NOT work.