Detection of variable length pattern in pandas dataframe column

Detection of variable length pattern in pandas dataframe column - python

The last 2 columns of a timeseries indexed dataframe identify the start ('A' or 'AA' or 'AAA'), end ('F' or 'FF' or 'FFF') and duration (number of rows between start and end) of a physical process, they look like this:
and the A-F sequences or the n sequences between them are of variable length.
How can I identify these patterns and for each of them calculate averages of other columns for the corresponding rows?
What I, very badly, tried to do is the following:
import pandas as pd
import xlrd
##### EXCEL LOAD
filepath= 'H:\\CCGT GE startup.xlsx'
df = pd.read_excel(filepath,sheet_name='Sheet1',header=0,skiprows=0,parse_cols='A:CO',index_col=0)
df = df.sort_index() # set increasing time index, source data is time decreasing
gas=[]
for i,row in df.iterrows():
if df['FLAG STARTUP TG1'] is not 'n':
while 'F' not in df['FLAG STARTUP TG1']:
gas.append(df['PORTATA GREZZA TG1 - m3/h'])
gas.append(i)
But the script gets stuck on the first if (doesn't match the 'n' condition and keeps appending the same row,i pair). Additionally, my method is also wrong in excluding the last 'F' row that still pertains to the same process and should be considered as part of it!
p.s. the first 1000 rows df is here http://www.filedropper.com/ccgtgestartup1000
p.p.s. Besides not working, my method is also wrong in excluding the last 'F' row that still pertains to the same process and should be considered as part of it!
p.p.p.s. The 2 columns refer to 2 different processes/machines and are unrelated (almost, more on this later), I want to do the same analysis on both (they will refer to different columns' averages). The first "A" string marks the beginning of the process and gets repeated until the last timestamp that gets marked with an 'F' string. in the original file the timestamps are descending and that's why i used the sort_index() method. The string length depends on other columns values but the obvious FLAG columns correlation is only in the 3 character strings 'AAA'&'FFF' because this should occur only if the the 2 processes start in +-1 timestamp from each other.

This is how I managed to get the desired results (N.B. I later decided that only the single character 'A'-->'F' sequences are of interest)
import pandas as pd
import numpy as np
##### EXCEL LOAD
filepath= 'H:\\CCGT GE startup.xlsx'
df = pd.read_excel(filepath,sheet_name='Sheet1',header=0,skiprows=0,parse_cols='A:CO',index_col=0)
df = df.sort_index() # set increasing time index, source data is time decreasing
tg1 = pd.DataFrame(index=df.index.copy(),columns=['counter','flag','gas','p','raw_p','tv_p','lhv','fs'])
k = 0
for i,row in df.iterrows():
if 'A' == str(row['FLAG STARTUP TG1']):
tg1.ix[i,'flag']=row['FLAG STARTUP TG1']
tg1.ix[i,'gas']=row['Portata gas naturale']
tg1.ix[i,'counter']=k
tg1.ix[i,'fs']=row['1FIRED START COUNT - N°']
tg1.ix[i,'p']=row['POTENZA ATTIVA MONTANTE 1 SU 400 KV - MW']
tg1.ix[i,'raw_p']=row['POTENZA ATTIVA MONTANTE 1 SU 15 KV - MW']
tg1.ix[i,'tv_p']=row['POTENZA ATTIVA MONTANTE TV - MW']
tg1.ix[i,'lhv']=row['LHV - MJ/Sm3']
elif 'F' == str(row['FLAG STARTUP TG1']):
tg1.ix[i,'flag']=row['FLAG STARTUP TG1']
tg1.ix[i,'gas']=row['Portata gas naturale']
tg1.ix[i,'counter']=k
tg1.ix[i,'fs']=row['1FIRED START COUNT - N°']
tg1.ix[i,'p']=row['POTENZA ATTIVA MONTANTE 1 SU 400 KV - MW']
tg1.ix[i,'raw_p']=row['POTENZA ATTIVA MONTANTE 1 SU 15 KV - MW']
tg1.ix[i,'tv_p']=row['POTENZA ATTIVA MONTANTE TV - MW']
tg1.ix[i,'lhv']=row['LHV - MJ/Sm3']
k+=1
tg1 = tg1.dropna(axis=0)
tg1 = tg1[tg1['gas'] != 0] #data where gas flow measurement is missing is dropped
tg1 = tg1.convert_objects(convert_numeric=True)
#timestamp count for each startup for duration calculation
counts = pd.DataFrame(tg1['counter'].value_counts(),columns=['duration'])
counts['start']=counts.index
counts = counts.set_index(np.arange(len(tg1['counter'].value_counts())))
tg1 = tg1.merge(counts,how='inner',left_on='counter',right_on='start')
# filter out non pertinent startups (too long or too short)
tg1 = tg1[tg1['duration'].isin([6,7])]
#calculate thermal input per start (process)
table = tg1.groupby(['counter']).mean()
table['t_in']=table.apply((lambda row: row['gas']*row['duration']*0.25*row['lhv']/3600),axis=1)
Any improvements and suggestions to do the calculations in the iteration and avoid all the "prep- work" after it are welcome.

Related

Matlab - Assigning matching variables between two data sets and creating a new table

I'm currently working with two data sets, as shown below:
Data set 1:
Point_ID
Record
Difference (m)
'2804AJGCA57'
'Record003 - 220428_103738_Scanner_1 - 2804AJGCA57'
'0.035240'
'2804AJGCA28'
'Record003 - 220428_103738_Scanner_1 - 2804AJGCA28'
'0.030961'
'2804AJGCA29'
'Record003 - 220428_103738_Scanner_1 - 2804AJGCA29'
'0.030219'
Data set 2:
Point_ID
Easting
Northing
Elevation_OD
'2804AJGCA1'
'200305.3884'
'80809.76627'
'7.25913'
'2804AJGCA2'
'200304.9855'
'80809.20396'
'7.23274'
'2804AJGCA3'
'200304.3783'
'80808.51888'
'7.20207'
Essentially, I need to compare the 1st column of both tables and if the 'Point_ID' from the 2nd data set is found within the 1st, I need to add the 'Difference (m)' column onto the corresponding 'Point_ID' row, in the 2nd data set, or a new table.
I hope that this makes sense. Currently I have the following code:
%% Import CSVs
GCA_data = readtable('Input/Test/GCA&GCP_Results_Flight1.csv', 'Delimiter',';', 'Format','%s %s %s'); %Insert pathway to GCA_Results csv.
XYZ_data= readtable('Input/Test/GCA&GCP_Flight1.csv','Delimiter',',','Format','%s %s %s %s'); %Insert pathway to the GCA XYZ file inputted into RiProcess.
%% Pre - Settings
ids = GCA_data.Object1; %Identifies all points that were used within the GCA calculations
nids = numel(ids); % Identifies the number of unique point ids.
gca_table = [];
for ii = 1:nids;
ID = ids{ii}; %Speicifies the point ID.
idx = ismember()
end

This is a type of join operation. MATLAB can do various different joins on table. See this doc page for more. I'm not sure, but I think you want a "left" join - this is what you need if some entries aren't present, like in this example:
t1 = table(["aaa"; "bbb"; "ccc"], [1; 2; 3], ...
'VariableNames', {'Point_ID', 'Difference (m)'});
t2 = table(["bbb"; "ccc"; "ddd"], [200; 300; 400], ...
'VariableNames', {'Point_ID', 'OtherValue'});
% Match up rows by "Point_ID", get "Difference (m)" if
% available and add to "OtherValue".
% MergeKeys=true means keep only one copy of the key "Point_ID"
outerjoin(t2, t1, type="left", MergeKeys=true)
Which gets:
3×3 table
Point_ID OtherValue Difference (m)
________ __________ ______________
"bbb" 200 2
"ccc" 300 3
"ddd" 400 NaN

this is tagged as Python so I'll give you a python code solution:
import numpy as np
import pandas as pd
GCA_data=pd.read_csv('GCA_Results csv', sep=';', header=True)
XYZ_data=pd.read_csv('GCA&GCP_Flight1.csv', sep=',', header=True)
GCA_data=GCA_data.set_index(GCA_data['Point_ID'])
XYZ_data=XYZ_data.set_index(XYZ_data['Point_ID'])
*Anyname*= pd.concat([XYZ_data,GCA_data['Difference (m)']], axis=1).reset_index(drop=True)
*Anyname*.to_csv('Anyname',index=False)

Hope I understood your question correct, as their no match between both table points I am compare by starts match. By first converting the df1's pointID and difference columns to dict.
For ex, 2804AJGCA28 == 2804AJGCA2
Code:
dic = pd.Series(df1['Difference (m)'].values,index=df1['Point_ID']).to_dict()
df2['Difference (m)'] = df2['Point_ID'].apply(lambda x: [ v for k,v in dic.items() if x in k])
df2
Output:
Point_ID Easting Northing Elevation_OD Difference (m)
0 2804AJGCA1 200305.3884 80809.76627 7.25913 []
1 2804AJGCA2 200304.9855 80809.2039 7.23274 [0.030961, 0.030219]
2 2804AJGCA3 200304.3783 80808.51888 7.20207 []
Otherwise, if you have match between pointID you can remove loop and startswith logic and just
df2['Point_ID'].apply(lambda x: pd.Series(df1['Difference (m)'].values,index=df1['Point_ID']).to_dict()[x] )

first attempt at python, error ("IndexError: index 8 is out of bounds for axis 0 with size 8") and efficiency question

learning python, just began last week, havent otherwise coded for about 20 years and was never that advanced to begin with. I got the hello world thing down. Now im trying to back test FX pairs. Any help up the learning curve appreciated, and of course scouring this site while on my Lynda vids.
Getting a funky error, and also wondering if theres blatantly more efficient ways to loop through columns of excel data the way I am.
The spreadsheet being read is simple ... 56 FX pairs down column A, and 8 rows over where the column headers are dates, and the cells in each column are the respective FX pair closing price on that date. The strategy starts at the top of the 2nd column (so that there is a return % that can be calc'd vs the prior priord) and calcs out period/period % returns for each pair, identifying which is the 'maximum value', and then "goes long" that highest performer ... whose performance in the subsequent period/period is recorded as PnL to the portfolio ("p" in the code), loops through that until the current, most recent column is read.
The error relates to using 8 columns instead of 7 ... works when i limit the loop to 7 columns but not 8. When I used 8 I get a wall of text concluding with "IndexError: index 8 is out of bounds for axis 0 with size 8" Similar error when i use too many rows, 56 instead of 55, think im missing the bottom row.
Here's my code:
,,,
enter code here
#set up imports
import pandas as pd
#import spreadsheet
x1 = pd.ExcelFile(r"C:\Users\Gamblor\Desktop\Python\test2020.xlsx")
df = pd.read_excel(x1, "Sheet1", header=1)
#define counters for loops
o = 1 # observation counter
c = 3 # column counter
r = 0 # active row counter for sorting through for max
#define identifiers for the portfolio
rpos = 0 # static row, for identifying which currency pair is in column 0 of that row
p = 100 # portfolio size starts at $100
#define the stuff we are evaluating for
pair = df.iat[r,0] # starting pair at 0,0 where each loop will begin
pair_pct_rtn = 0 # starts out at zero, becomes something at first evaluation, then gets
compared to each subsequent eval
pair_pct_rtn_calc = 0 # a second version of above, for comparison to prior return
#runs a loop starting at the top to find the max period/period % return in a specific column
while (c < 8): # manually limiting this to 5 columns left to right
while (r < 55): # i am manually limiting this to 55 data rows per the spreadsheet ... would be better if automatic
pair_pct_rtn_calc = ((df.iat[r,c])/(df.iat[r,c-1]) - 1)
if pair_pct_rtn_calc > pair_pct_rtn: # if its a higher return, it must be the "max" to that point
pair = df.iat[r,0] # identifies the max pair for this column observation, so far
pair_pct_rtn = pair_pct_rtn_calc # sets pair_pct_rtn as the new max
rpos = r # identifies the max pair's ROW for this column observation, so far
r = r + 1 # adds to r in order to jump down and calc the next row
print('in obs #', o ,', ', pair ,'did best at' ,pair_pct_rtn ,'.')
o = o + 1
# now adjust the portfolio by however well USDMXN did in the subsequent week
p = p * ( 1 + ((df.iat[rpos,c+1])/(df.iat[rpos,c]) - 1))
print('then the subsequent period it did: ',(df.iat[rpos,c+1])/(df.iat[rpos,c]) - 1)
print('resulting in portfolio value of', p)
rpos = 0
r = 0
pair_pct_rtn = 0
c = c + 1 # adds to c in order to move to the next period to the right
print(p)

Since indices are labelled from 0 onwards, the 8th element you are looking for will have index 7. Likewise, row index 55 (the 56th row) will be your last row.

Pandas merge - combination of and and or conditions [duplicate]

Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.
I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.
The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:
SNP DataFrame (snp_df):
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 752721
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.
Gene DataFrame (gene_df):
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
This dataframe shows the locations of all the genes of interest.
What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.
If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.
I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasql module, while the other approach is to loop through the separate genes.
SQL method
import pandas as pd
import pandasql as psql
pysqldf = lambda q: psql.sqldf(q, globals())
q = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
genic_snps = pysqldf(q)
all_dfs.append(genic_snps)
all_genic_snps = pd.concat(all_dfs)
Gene iteration method
all_dfs = []
for line in gene_df.iterrows():
info = line[1] # Getting the Series object
this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
(snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
Can anyone give any suggestions of a more effective way of doing this?

I've just thought of a way to solve this - by combining my two methods:
First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
# Getting rid of redundant genes
min_bp = this_chr_snp['BP'].min()
max_bp = this_chr_snp['BP'].max()
this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
~(this_genes['chr_stop'] <= min_bp)]
for line in this_genes.iterrows():
info = line[1]
this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
(this_chr_snp['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(1, 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.

You can use the following to accomplish what you're looking for:
merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]
Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:
snp_df
Out[193]:
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 30400
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
gene_df
Out[194]:
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
merged_df
Out[195]:
SNP feature_id
8 rs3131972 GeneID:100302278

Pandas - DateTime within X amount of minutes from row

I am not entirely positive the best way to ask or phrase this question so I will highlight my problem, dataset, my thoughts on the method and end goal and hopefully it will be clear by the end.
My problem:
My company dispatches workers and will load up dispatches to a single employee even if they are on their current dispatch. This is due to limitation in the software we use. If an employee receives two dispatches within 30 minutes, we call this a double dispatch.
We are analyzing our dispatching efficiency and I am running into a bit of a head scratcher. I need to run through our 100k row database and add an additional column that will read as a dummy variable 1 for double 0 for normal. BUT as we have multiple people we dispatch and B our records do not start ordered by dispatch, I need to determine how often a dispatch occurs to the same person within 30 minutes.
Dataset:
The dataset is incredibly massive due to poor organization in our data warehouse but for terms of what items I need these are the columns I will need for my calc.
Tech Name | Dispatch Time (PST)
John Smith | 1/1/2017 12:34
Jane Smith | 1/1/2017 12:46
John Smith | 1/1/2017 18:32
John Smith | 1/1/2017 18:50
My Thoughts:
How I would do it is clunky and it could work one way but not backwards. I would more or less write my code as:
import pandas as pd
df = pd.read_excel('data.xlsx')
df.sort('Dispatch Time (PST)', inplace = True)
tech_name = None
dispatch_time = pd.to_datetime('1/1/1900 00:00:00')
for index, row in df.iterrows():
if tech_name is None:
tech_name = row['Tech Name']
else:
if dispatch_time.pd.time_delta('0 Days 00:30:00') > row['Tech Dispatch Time (PST)'] AND row['Tech Name'] = tech_name:
row['Double Dispatch'] = 1
dispatch_time = row['Tech Dispatch Time (PST)']
else:
dispatch_time = row['Tech Dispatch Time (PST)']
tech_name = row['Tech Name']
This has many problems from being slow, only tracking dates going backwards and not forwards so I will be missing many dispatches.
End Goal:
My goal is to have a dataset I can then plug back into Tableau for my report by adding on one column that reads as that dummy variable so I can filter and calculate on that.
I appreciate your time and help and let me know if any more details are necessary.
Thank you!
------------------ EDIT -------------
Added a edit to make the question clear as I failed to do so earlier.
Question: Is Pandas the best tool to use to iterate over my dataframe to see each for each datetime dispatch, is there a record that matches the Tech's Name AND is less then 30 minutes away from this record.
If so, how could I improve my algorithm or theory, if not what would the best tool be.
Desired Output - An additional column that records if a dispatch happened within a 30 minute window as a dummy variable 1 for True 0 for False. I need to see when double dispatches are occuring and how many records are true double dispatches, and not just a count that says there were 100 instances of double dispatch, but that involved over 200 records. I need to be able to sort and see each record.

Hello I think I found a solution. It slow, only compares one index before or after, but in terms of cases that have 3 dispatches within thirty minutes, this represents less then .5 % for us.
import pandas as pd
import numpy as np
import datetime as dt
dispatch = 'Tech Dispatched Date-Time (PST)'
tech = 'CombinedTech'
df = pd.read_excel('combined_data.xlsx')
df.sort_values(dispatch, inplace=True)
df.reset_index(inplace = True)
df['Double Dispatch'] = np.NaN
writer = pd.ExcelWriter('final_output.xlsx', engine='xlsxwriter')
dispatch_count = 0
time = dt.timedelta(minutes = 30)
for index, row in df.iterrows():
try:
tech_one = df[tech].loc[(index - 1)]
dispatch_one = df[dispatch].loc[(index - 1)]
except KeyError:
tech_one = None
dispatch_one = pd.to_datetime('1/1/1990 00:00:00')
try:
tech_two = df[tech].loc[(index + 1)]
dispatch_two = df[dispatch].loc[(index + 1)]
except KeyError:
tech_two = None
dispatch_two = pd.to_datetime('1/1/2020 00:00:00')
first_time = dispatch_one + time
second_time = pd.to_datetime(row[dispatch]) + time
dispatch_pd = pd.to_datetime(row[dispatch])
if tech_one == row[tech] or tech_two == row[tech]:
if first_time > row[dispatch] or second_time > dispatch_two:
df.set_value(index, 'Double Dispatch', 1)
dispatch_count += 1
else:
df.set_value(index, 'Double Dispatch', 0)
dispatch_count += 1
print(dispatch_count) # This was to monitor total # of records being pushed through
df.to_excel(writer,sheet_name='Sheet1')
writer.save()
writer.close()

Pandas.SHIFT in Multi index frame for temporal dependency

This is my first post so please be gentle. I have searched across the world wide web looking for a solution but I am yet to find one. The problem i'm trying to solve is as follows:
I have a dataset, comprised of 500.000+ samples, with 6 features per sample.
I have put this dataset in a multiindexed Pandas DataFrame
The first level of my dataFrame is the timeseries index, the second level is the ID. It looks as follows
Time id
2017-03-07 10:06:49.963241984 122.0 -7.024347
136.0 -11.664985
243.0 1.716150
2017-03-07 10:06:50.003462400 122.0 -7.025922
136.0 -11.671526
Every timestamp, a number of objects can be seen and are marked by label 'id'. For my application, i want to add a temporal dependency by including information
that happened 5 seconds ago, i.e. in this example on timestamp 10:06:45.
But, importantly, i only want to add this information if at that timestamp the object already existed (so if the id is equal).
I wanted to use the function dataframe.shift, as mentioned here and, i want to do it per level, so as indicated by user Unutbu in How do you shift Pandas DataFrame with a multiindex?
My question is as follows:
How do I append extra columns to the original dataframe X with information on what those objects were 5s ago. I would expect something like the following
X['x_location_shifted'] = X.groupby(level=1)['x_location'].shift(5*rate)
with the rate being 25Hz, ie. we shift 125 "DateTimeIndices", but, only if an object with id='...' exists at that timestamp.
EDIT:
The timestamps are not synchronized 100%, so the timegap is not always exactly equal to 0.04. Previously, i used np.argmin(np.abs(time-index)) to find the closest index to the stamp.
For example, in my set, at timestamp 2017-03-07 10:36:03.605008640 there is an object with id == 175 and location_x = 54.323.
id = 175
X.ix['2017-03-07 10:36:03.605008640', id] = 54.323
At timestamp 2017-03-07 10:36:08.604962560 ..... this object with id=175 has a location_x = 67.165955
id = 175
old_time = pd.to_datetime('2017-03-07 10:36:03.605008640')
new_time = old_time + pd.Timedelta('5 seconds')
# Finding the new value of location
X.ix[np.argmin(np.abs(new_time - X.index.get_level_values(0))), id]
So, finally, at timestep 10:36:08 i want to add the information of timestamp 10:36:03 IF the object already existed at that timestamp.
EDIT2:
After trying Maarten Fabré's solution, I came up with my own implementation, which you can find below. If anyone can show me a more pythonic way to do this, please let me know.
for current_time in X.index.get_level_values(0)[125:]:
#only do if there are objects at current time
if len(X.ix[current_time].index):
# Calculate past time
past_time = current_time - pd.Timedelta('5 seconds')
# Find index in X.index that is closest to this past time
past_time_index = np.argmin(np.abs(past_time-X.index.get_level_values(0)))
# translate the index back to a label
past_time = X.index[past_time_index][0]
# in that timestep, cycle the objects
for obj_id in X.ix[current_time].index:
# Try looking for the value box_center.x of obj obj_id 5s ago
try:
X.ix[(current_time, obj_id), 'box_center.x.shifted'] = X.ix[(past_time, obj_id), 'box_center.x']
X.ix[(current_time, obj_id), 'box_center.y.shifted'] = X.ix[(past_time, obj_id), 'box_center.y']
X.ix[(current_time, obj_id), 'relative_velocity.x.shifted'] = X.ix[(past_time, obj_id), 'relative_velocity.x']
X.ix[(current_time, obj_id), 'relative_velocity.y.shifted'] = X.ix[(past_time, obj_id), 'relative_velocity.y']
# If the key doesnt exist, the object doesn't exist, ergo the field should be np.nan
except KeyError:
X.ix[(current_time, obj_id), 'box_center.x.shift'] = np.nan
print('Timestep {}'.format(current_time))
If this is not enough information, please say so and I can add it :)
Cheers and thanks!

Assuming that you have no gaps in the timestamps, one possible solution might be the following, which creates a new index with shifted timestamps and uses that to get the 5 seconds-ago values for each ID.
offset = 5 * rate
# Create a shallow copy of the multiindex levels for modification
modified_levels = list(X.index.levels)
# Shift them
modified_times = pd.Series(modified_levels[0]).shift(offset)
# Fill NaNs with dummy values to avoid duplicates in the new index
modified_times[modified_times.isnull()] = range(sum(modified_times.isnull()))
modified_levels[0] = modified_times
new_index = X.index.set_levels(modified_levels, inplace=False)
X['x_location_shifted'] = X.loc[new_index, 'x_location'].values

If the timestamps are not 100% regular, then you'll either have to round the to the nearest 1/x second, or use a loop
you could use this as a loop
Data definition
import pandas as pd
import numpy as np
from io import StringIO
df_str = """
timestamp id location
10:00:00.005 1 a
10:00:00.005 2 b
10:00:00.005 3 c
10:00:05.006 2 a
10:00:05.006 3 b
10:00:05.006 4 c"""
df = pd.DataFrame.from_csv(StringIO(df_str), sep='\t').reset_index()
delta = pd.to_timedelta(5, unit='s')
margin = pd.to_timedelta(1/50, unit='s')
df['location_shifted'] = np.nan
Loop over the different id's
for label_id in set(df['id']):
df_id = df[df['id'] == label_id].copy() # copy to make sure we don't overwrite the original data. Might not be necessary
df_id['time_shift'] = df['timestamp'] + delta
for row in df_id.itertuples():
idx = row.Index
time_dif = abs(df['timestamp'] - row.time_shift)
shifted_locs = df_id[time_dif < margin ]
l = len(shifted_locs)
if l:
print(shifted_locs)
if l == 1:
idx_shift = shifted_locs.index[0]
else:
idx_shift = shifted_locs['time_shift'].idxmin()
df.loc[idx_shift, 'location_shifted'] = df_id.loc[idx, 'location']
Results
timestamp id location location_shifted
0 2017-05-09 10:00:00.005 1 a
1 2017-05-09 10:00:00.005 2 b
2 2017-05-09 10:00:00.005 3 c
3 2017-05-09 10:00:05.006 2 a b
4 2017-05-09 10:00:05.006 3 b c
5 2017-05-09 10:00:05.006 4 c

Any of you arriving here with the same question; i managed to solve it in a (minimal) vectorized way, but, it required me to return to a 3d panel.
3 Steps:
- make into 3D panel
- Add new columns
- Fill those columns
From a multi-index 2d frame it's possible to change it to a pandas.Panel where you convert the 2nd index to one of the axes in the panel.
After this I have a 3D panel with axes [time, objects, parameters]. Then, tranpose the panel to have the PARAMETERS as items, this to add columns to the datapanel. So, tranpose the panel, add the columns, transpose back.
dp_new = dp.transpose(2,0,1)
dp_new['shifted_box_center_x']=np.nan
dp_new['shifted_box_center_y']=np.nan
dp_new['shifted_relative_velocity_x']=np.nan
dp_new['shifted_relative_velocity_y']=np.nan
# tranpose them back to their original form
dp_new = dp_new.transpose(1,2,0)
Now that we have added the new fields, we can get their names by
new_fields = dp_new.minor_axis[-4:]
The objective is to add information from 5s ago, if that object existed. Therefore, we cycle the time series from a moment in time which is 5s. In my case, at a rate of 25Hz, this is element 5*rate = 125.
Lets first set the time to start from 5s in the datapanel
time = dp_new.items[125:]
Then, we iterate an enumerated version of the time. The enumeration will start at 0, which is the index of the datapanel at timestep = 0. The first timestep however is the timestep at time 0+5seconds.
time = dp_new.items[125:]
for iloc, ts in enumerate(time):
# Print progress
print('{} out of {}'.format(ts, dp.items[-1]) , end="\r", flush=True)
# Generate new INDEX field, by taking the field ID and dropping the NaN values
ids = dp_new.loc[ts].id.dropna().values
# Drop the nan field from the frame
dp_new[ts].dropna(thresh=5, inplace=True)
# save the original indices
original_index = {'index': dp_new.loc[ts].index, 'id': dp_new.loc[ts].id.values}
# set the index to field id
dp_new[ts].set_index(['id'], inplace=True)
# Check if the vector ids does NOT contain ALL ZEROS
if np.any(ids): # Check for all zeros
df_past = dp_new.iloc[iloc].copy() # SCREENSHOT AT TS=5s --> ILOC = 0
df_past.dropna(thresh=5, inplace=True) # drop the nan rows
df_past.set_index(['id'], inplace=True) # set the index to field ID
dp_new[ts].loc[original_index['id'], new_fields] = df_past[fields].values
This will only fill in fields that have id's ==ids.
This code was able to run on a 300 000 element file in about 5 minutes.
Note: i spent quite some time on this, mainly because of how one indexes a panel. At first , i thought calling the 3 dimensions would work, as stated in pandas help, but it seems that this is not the case.
dp_new[ts, ids, new_fields] = values does NOT work.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Detection of variable length pattern in pandas dataframe column - python

Related

Matlab - Assigning matching variables between two data sets and creating a new table

first attempt at python, error ("IndexError: index 8 is out of bounds for axis 0 with size 8") and efficiency question

Pandas merge - combination of and and or conditions [duplicate]

Pandas - DateTime within X amount of minutes from row

Pandas.SHIFT in Multi index frame for temporal dependency

Categories

Resources