Compare each pair of dates in two columns in python efficiently

Compare each pair of dates in two columns in python efficiently - python

I have a data frame with a column of start dates and a column of end dates. I want to check the integrity of the dates by ensuring that the start date is before the end date (i.e. start_date < end_date).I have over 14,000 observations to run through.
I have data in the form of:
Start End
0 2008-10-01 2008-10-31
1 2006-07-01 2006-12-31
2 2000-05-01 2002-12-31
3 1971-08-01 1973-12-31
4 1969-01-01 1969-12-31
I have added a column to write the result to, even though I just want to highlight whether there are incorrect ones so I can delete them:
dates['Correct'] = " "
And have began to check each date pair using the following, where my dataframe is called dates:
for index, row in dates.iterrows():
if dates.Start[index] < dates.End[index]:
dates.Correct[index] = "correct"
elif dates.Start[index] == dates.End[index]:
dates.Correct[index] = "same"
elif dates.Start[index] > dates.End[index]:
dates.Correct[index] = "incorrect"
Which works, it is just taking a really really long-time (about over 15 minutes). I need a more efficiently running code - is there something I am doing wrong or could improve?

Why not just do it in a vectorized way:
is_correct = dates['Start'] < dates['End']
is_incorrect = dates['Start'] > dates['End']
is_same = ~is_correct & ~is_incorrect

Since the list doesn't need to be compared sequentially, you can gain performance by splitting your dataset and then using multiple processes to perform the comparison simultaneously. Take a look at the multiprocessing module for help.

Something like the following may be quicker:
import pandas as pd
import datetime
df = pd.DataFrame({
'start': ["2008-10-01", "2006-07-01", "2000-05-01"],
'end': ["2008-10-31", "2006-12-31", "2002-12-31"],
})
def comparison_check(df):
start = datetime.datetime.strptime(df['start'], "%Y-%m-%d").date()
end = datetime.datetime.strptime(df['end'], "%Y-%m-%d").date()
if start < end:
return "correct"
elif start == end:
return "same"
return "incorrect"
In [23]: df.apply(comparison_check, axis=1)
Out[23]:
0 correct
1 correct
2 correct
dtype: object
Timings
In [26]: %timeit df.apply(comparison_check, axis=1)
1000 loops, best of 3: 447 µs per loop
So by my calculations, 14,000 rows should take (447/3)*14,000 = (149 µs)*14,000 = 2.086s, so a might shorter than 15 minutes :)

Related

Resample a data frame into n-month periods with arbitrary end-of-period months

I want to resample() my daily data into six-month chunks. However, I want the ends of the six-month chunks to be the ends of April and October. If I use df.resample('6M').sum() (or df.groupby(pd.Grouper(freq='6M').sum()), the end of the first six-month chunk is the end of the first month in the data. I know about anchored offsets, but I do not know how to create a custom anchored offset (e.g., '6M-APR' does not work).
Here is some example code:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(
data={'logret': np.random.randn(1000)},
index=pd.date_range(start='2001-05-25', periods=1000, freq='B')
)
df.resample('6M').sum()
Which yields the following output:
logret
2001-05-31 2.2950148716254297
2001-11-30 -12.536360930670858
2002-05-31 5.468848462868161
2002-11-30 13.027927629740189
2003-05-31 -10.37282118563155
2003-11-30 -0.156275418330286
2004-05-31 -3.0768727498370905
2004-11-30 28.328856464071546
2005-05-31 -3.6462613215100546
I have not achieved my goal (six-month resampling that ends in April and October) with the start, offset, and loffset arguments to .resample().
I have achieved my goal with the hack below. However, it loses the date index, and I would like a more robust/repeatable approach.
def sixmonth(d, b=4):
y, m, h = d.year, d.month, 1
if (m > (b + 6)): y += 1
elif (m > b): h += 1
return y + h/10
df.groupby(sixmonth).sum()
Which yields the following output without a date:
logret
2001.2 -10.300839024148
2002.1 9.321994034984547
2002.2 8.855517878860585
2003.1 -2.4576797445001493
2003.2 -7.002919570231796
2004.1 -9.36895555474087
2004.2 27.13038641177464
2005.1 3.154551390326532
Of course, I could improve this hack. But is there a better/robust/repeatable solution for n-period resampling that ends in arbitrary months?

Another workaround, keeping the datetime index:
def custom_6M(df, month=4):
df = df.resample("M").sum()
df = df.rolling(6).sum()
return df[df.index.month.isin([month,month+6])]
>>> custom_6M(df)
logret
2001-10-31 -10.300839
2002-04-30 9.321994
2002-10-31 8.855518
2003-04-30 -2.457680
2003-10-31 -7.002920
2004-04-30 -9.368956
2004-10-31 27.130386

It's a pain. When I needed something similar, I ended up with the following approach:
anchor_month = 4
non_months = (anchor_month + 3) % 12, (anchor_month + 9) % 12
df = df.resample('Q-APR').sum()
df = (df.reset_index()
.groupby(df.index.month.isin(non_months).cumsum())
.agg({'index': 'last', 'logret': 'sum'})
.set_index('index'))
Result here:
logret
index
2001-10-31 -10.300839
2002-04-30 9.321994
2002-10-31 8.855518
2003-04-30 -2.457680
2003-10-31 -7.002920
2004-04-30 -9.368956
2004-10-31 27.130386
2005-04-30 3.154551
But the problem is, that sometimes the last index doesn't fit (okay here). That can be fixed by another '6M'-resample. Overall: Not pretty.

Thanks for the answers.
I have two more options.
Append a time-stamped series to df to anchor the six-month resampling periods
I hoped that .resample()'s origin argument would let me manually anchor my six-month resampling periods. It doesn't, but the following code does.
df.append(pd.Series(name=pd.to_datetime('2001-04-30'), dtype='float')).resample('6M').sum()
Improve my sixmonth() function to use timestamps
def sixmonth(d, m=6, n=4):
o = (m - (d.month - n)) % m
return d + pd.offsets.MonthEnd(o)
I first .resample('M') to make sure that I have end-of-month dates.
I could modify sixmonth() to check for end-of-month dates, but I'm more afraid of finding some new edge case than a little inefficiency.
df.resample('M').sum().groupby(sixmonth).sum()

How can I make my python program run faster?

I am reading in a .csv file and creating a pandas dataframe. The file is a file of stocks. I am only interested in the date, the company, and the closing cost. I want my program to find the max profit with the starting date, the ending date and the company. It needs to use the divide and conquer algorithm. I only know how to use for loops but it takes forever to run. The .csv file is 200,000 rows. How can I get this to run fast?
import pandas as pd
import numpy as np
import math
def cleanData(file):
df = pd.read_csv(file)
del df['open']
del df['low']
del df['high']
del df['volume']
return np.array(df)
df = cleanData('prices-split-adjusted.csv')
bestStock = [None, None, None, float(-math.inf)]
def DAC(data):
global bestStock
if len(data) > 1:
mid = len(data)//2
left = data[:mid]
right = data[mid:]
DAC(left)
DAC(right)
for i in range(len(data)):
for j in range(i+1,len(data)):
if data[i,1] == data[j,1]:
profit = data[j,2] - data[i,2]
if profit > bestStock[3]:
bestStock[0] = data[i,0]
bestStock[1] = data[j,0]
bestStock[2] = data[i,1]
bestStock[3] = profit
print(bestStock)
print('\n')
return bestStock
print(DAC(df))

I've got two things for your consideration (my answer tries not to change your algorithm approach i.e. nested loops and recursive funcs and tackles the low lying fruits first):
Unless you are debugging, try to avoid print() inside a loop. (in your case .. print(bestStock) ..) The I/O overhead can add up esp. if you are looping across large datasets and printing to screen often. Once you are OK with your code, comment it out to run on your full dataset and uncomment it only during debugging sessions. You can expect to see some improvement in speed without having to print to screen in the loop.
If you are after even more ways to 'speed it up', I found in my case (similar to yours which I often encounter especially in search/sort problems) that simply by switching the expensive part (the python 'For' loops) to Cython (and statically defining variable types .. this is KEY! to SPEEEEDDDDDD) gives me several orders of magnitude speed ups even before optimizing implementation. Check Cython out https://cython.readthedocs.io/en/latest/index.html. If thats not enough, then parrelism is your next best friend which would require rethinking your code implementation.

The main problems causing slow system performance are:
You manually iterate over 2 columns in nested loops without using pandas operations which make use of fast ndarray functions;
you use recursive calls which looks nice and simple but slow.
Setting the sample data as follows:
Date Company Close
0 2019-12-31 AAPL 73.412498
1 2019-12-31 FB 205.250000
2 2019-12-31 NFLX 323.570007
3 2020-01-02 AAPL 75.087502
4 2020-01-02 FB 209.779999
... ... ... ...
184 2020-03-30 FB 165.949997
185 2020-03-30 NFLX 370.959991
186 2020-03-31 AAPL 63.572498
187 2020-03-31 FB 166.800003
188 2020-03-31 NFLX 375.500000
189 rows × 3 columns
Then use the following codes (modify the column labels to your labels if different):
df_result = df.groupby('Company').agg(Start_Date=pd.NamedAgg(column='Date', aggfunc="first"), End_Date=pd.NamedAgg(column='Date', aggfunc="last"), bestGain=pd.NamedAgg(column='Close', aggfunc=lambda x: x.max() - x.iloc[0]))
Resulting output:
Start_Date End_Date bestGain
Company
AAPL 2019-12-31 2020-03-31 8.387505
FB 2019-12-31 2020-03-31 17.979996
NFLX 2019-12-31 2020-03-31 64.209991
To get the entry with greatest gain:
df_result.loc[df_result['bestGain'].idxmax()]
Resulting output:
Start_Date 2019-12-31 00:00:00
End_Date 2020-03-31 00:00:00
bestGain 64.209991
Name: NFLX, dtype: object
Execution time comparison
With my scaled down data of 3 stocks over 3 months, the codes making use of pandas function (takes 8.9ms) which is about about half the execution time with the original codes manually iterate over the numpy array with nested loops and recursive calls (takes 16.9ms) even after the majority of print() function calls removed.
Your codes with print() inside DAC() function removed:
%%timeit
"""
def cleanData(df):
# df = pd.read_csv(file)
del df['Open']
del df['Low']
del df['High']
del df['Volume']
return np.array(df)
"""
# df = cleanData('prices-split-adjusted.csv')
# df = cleanData(df0)
df = np.array(df0)
bestStock = [None, None, None, float(-math.inf)]
def DAC(data):
global bestStock
if len(data) > 1:
mid = len(data)//2
left = data[:mid]
right = data[mid:]
DAC(left)
DAC(right)
for i in range(len(data)):
for j in range(i+1,len(data)):
if data[i,1] == data[j,1]:
profit = data[j,2] - data[i,2]
if profit > bestStock[3]:
bestStock[0] = data[i,0]
bestStock[1] = data[j,0]
bestStock[2] = data[i,1]
bestStock[3] = profit
# print(bestStock)
# print('\n')
return bestStock
print(DAC(df))
[Timestamp('2020-03-16 00:00:00'), Timestamp('2020-03-31 00:00:00'), 'NFLX', 76.66000366210938]
16.9 ms ± 303 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
New simplified codes in pandas' way of coding:
%%timeit
df_result = df.groupby('Company').agg(Start_Date=pd.NamedAgg(column='Date', aggfunc="first"), End_Date=pd.NamedAgg(column='Date', aggfunc="last"), bestGain=pd.NamedAgg(column='Close', aggfunc=lambda x: x.max() - x.iloc[0]))
df_result.loc[df_result['bestGain'].idxmax()]
8.9 ms ± 195 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Solution using recursive function:
The main problem of your recursive function lies in that you did not make use of the results of recursive calls of reduced size data.
To properly use recursive function as a divide-and-conquer approach, you should take 3 major steps:
Divide the whole set of data into smaller pieces and handle the smaller pieces by recursive calls each taking one of the smaller pieces
Handle the end-point case (the easiest case most of the time) in each recursive call
Consolidate the results of all recursive calls of smaller pieces
The beauty of recursive calls is that you can solve a complicated problem by replacing the processing with 2 much more easier steps: 1st step is to handle the end-point case where you can handle for most of the time only ONE data item (which is most often easy). 2nd step is to just take another easy step to consolidate the results of the reduced-size calls.
You managed to take the first step but not the other 2 steps. In particular, you did not take advantage of simplifying the processing by making use of the results of smaller pieces. Instead, you handle the whole set of data in each call by looping all over all rows in the 2-dimensional numpy array. The nested loop logics is just like a "Bubble Sort" [with complexity order(n squared) instead of order(n)] . Hence, your recursive calls are just wasting time without value!
Suggest to modify your recursive functions as follows:
def DAC(data):
# global bestStock # define bestStock as a local variable instead
bestStock = [None, None, None, float(-math.inf)] # init bestStock
if len(data) = 1: # End-point case: data = 1 row
bestStock[0] = data[0,0]
bestStock[1] = data[0,0]
bestStock[2] = data[0,1]
bestStock[3] = 0.0
elif len(data) = 2: # End-point case: data = 2 rows
bestStock[0] = data[0,0]
bestStock[1] = data[1,0]
bestStock[2] = data[0,1] # Enhance here to allow stock break
bestStock[3] = data[1,2] - data[0,2]
elif len(data) >= 3: # Recursive calls and consolidate results
mid = len(data)//2
left = data[:mid]
right = data[mid:]
bestStock_left = DAC(left)
bestStock_right = DAC(right)
# Now make use of the results of divide-and-conquer and consolidate the results
bestStock[0] = bestStock_left[0]
bestStock[1] = bestStock_right[1]
bestStock[2] = bestStock_left[2] # Enhance here to allow stock break
bestStock[3] = bestStock_left[3] if bestStock_left[3] >= bestStock_right[3] else bestStock_right[3]
# print(bestStock)
# print('\n')
return bestStock
Here we need to handle 2 kinds of end-point cases: 1 row and 2 rows. The reason is that for case with only 1 row, we cannot calculate the gain and can only set the gain to zero. Gain can start to calculate with 2 rows. If not split into these 2 end-point cases, we could end up only propagating zero gain all the way up.
Here is a demo of how you should code the recursive calls to take advantage of it. There is limitation of the codes that you still need to fine-tune. You have to enhance it further to handle stock break case. The codes for 2 rows and >= 3 rows now assume no stock break at the moment.

Pandas - DateTime within X amount of minutes from row

I am not entirely positive the best way to ask or phrase this question so I will highlight my problem, dataset, my thoughts on the method and end goal and hopefully it will be clear by the end.
My problem:
My company dispatches workers and will load up dispatches to a single employee even if they are on their current dispatch. This is due to limitation in the software we use. If an employee receives two dispatches within 30 minutes, we call this a double dispatch.
We are analyzing our dispatching efficiency and I am running into a bit of a head scratcher. I need to run through our 100k row database and add an additional column that will read as a dummy variable 1 for double 0 for normal. BUT as we have multiple people we dispatch and B our records do not start ordered by dispatch, I need to determine how often a dispatch occurs to the same person within 30 minutes.
Dataset:
The dataset is incredibly massive due to poor organization in our data warehouse but for terms of what items I need these are the columns I will need for my calc.
Tech Name | Dispatch Time (PST)
John Smith | 1/1/2017 12:34
Jane Smith | 1/1/2017 12:46
John Smith | 1/1/2017 18:32
John Smith | 1/1/2017 18:50
My Thoughts:
How I would do it is clunky and it could work one way but not backwards. I would more or less write my code as:
import pandas as pd
df = pd.read_excel('data.xlsx')
df.sort('Dispatch Time (PST)', inplace = True)
tech_name = None
dispatch_time = pd.to_datetime('1/1/1900 00:00:00')
for index, row in df.iterrows():
if tech_name is None:
tech_name = row['Tech Name']
else:
if dispatch_time.pd.time_delta('0 Days 00:30:00') > row['Tech Dispatch Time (PST)'] AND row['Tech Name'] = tech_name:
row['Double Dispatch'] = 1
dispatch_time = row['Tech Dispatch Time (PST)']
else:
dispatch_time = row['Tech Dispatch Time (PST)']
tech_name = row['Tech Name']
This has many problems from being slow, only tracking dates going backwards and not forwards so I will be missing many dispatches.
End Goal:
My goal is to have a dataset I can then plug back into Tableau for my report by adding on one column that reads as that dummy variable so I can filter and calculate on that.
I appreciate your time and help and let me know if any more details are necessary.
Thank you!
------------------ EDIT -------------
Added a edit to make the question clear as I failed to do so earlier.
Question: Is Pandas the best tool to use to iterate over my dataframe to see each for each datetime dispatch, is there a record that matches the Tech's Name AND is less then 30 minutes away from this record.
If so, how could I improve my algorithm or theory, if not what would the best tool be.
Desired Output - An additional column that records if a dispatch happened within a 30 minute window as a dummy variable 1 for True 0 for False. I need to see when double dispatches are occuring and how many records are true double dispatches, and not just a count that says there were 100 instances of double dispatch, but that involved over 200 records. I need to be able to sort and see each record.

Hello I think I found a solution. It slow, only compares one index before or after, but in terms of cases that have 3 dispatches within thirty minutes, this represents less then .5 % for us.
import pandas as pd
import numpy as np
import datetime as dt
dispatch = 'Tech Dispatched Date-Time (PST)'
tech = 'CombinedTech'
df = pd.read_excel('combined_data.xlsx')
df.sort_values(dispatch, inplace=True)
df.reset_index(inplace = True)
df['Double Dispatch'] = np.NaN
writer = pd.ExcelWriter('final_output.xlsx', engine='xlsxwriter')
dispatch_count = 0
time = dt.timedelta(minutes = 30)
for index, row in df.iterrows():
try:
tech_one = df[tech].loc[(index - 1)]
dispatch_one = df[dispatch].loc[(index - 1)]
except KeyError:
tech_one = None
dispatch_one = pd.to_datetime('1/1/1990 00:00:00')
try:
tech_two = df[tech].loc[(index + 1)]
dispatch_two = df[dispatch].loc[(index + 1)]
except KeyError:
tech_two = None
dispatch_two = pd.to_datetime('1/1/2020 00:00:00')
first_time = dispatch_one + time
second_time = pd.to_datetime(row[dispatch]) + time
dispatch_pd = pd.to_datetime(row[dispatch])
if tech_one == row[tech] or tech_two == row[tech]:
if first_time > row[dispatch] or second_time > dispatch_two:
df.set_value(index, 'Double Dispatch', 1)
dispatch_count += 1
else:
df.set_value(index, 'Double Dispatch', 0)
dispatch_count += 1
print(dispatch_count) # This was to monitor total # of records being pushed through
df.to_excel(writer,sheet_name='Sheet1')
writer.save()
writer.close()

Python Pandas: Count quarterly occurrence from start and end date range

I have a dataframe of jobs for different people with star and end time for each job. I'd like to count, every four months, how many jobs each person is responsible for. I figured out away to do it but I'm sure it's tremendously inefficient (I'm new to pandas). It takes quite a while to compute when I run the code on my complete dataset (hundreds of persons and jobs).
Here is what I have so far.
#create a data frame
import pandas as pd
import numpy as np
df = pd.DataFrame({'job': pd.Categorical(['job1','job2','job3','job4']),
'person': pd.Categorical(['p1', 'p1', 'p2','p2']),
'start': ['2015-01-01', '2015-06-01', '2015-01-01', '2016- 01- 01'],
'end': ['2015-07-01', '2015- 12-31', '2016-03-01', '2016-12-31']})
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
Which gives me
I then create a new dataset with
bdate = min(df['start'])
edate = max(df['end'])
dates = pd.date_range(bdate, edate, freq='4MS')
people = sorted(set(list(df['person'])))
df2 = pd.DataFrame(np.zeros((len(dates), len(people))), index=dates, columns=people)
for d in pd.date_range(bdate, edate, freq='MS'):
for p in people:
contagem = df[(df['person'] == p) &
(df['start'] <= d) &
(df['end'] >= d)]
pos = np.argmin(np.abs(dates - d))
df2.iloc[pos][p] = len(contagem.index)
df2
And I get
I'm sure there must be a better way of doing this without having to loop through all dates and persons. But how?

This answer assumes that each job-person combination is unique. It creates a series for every row with the value equal to the job an index that expands the dates. Then it resamples every 4th month (which is not quarterly but what your solution describes) and counts the unique non-na occurrences.
def make_date_range(x):
return pd.Series(index=pd.date_range(x.start.values[0], x.end.values[0], freq='M'), data=x.job.values[0])
# Iterate through each job person combo and make an entry for each month with the job as the value
df1 = df.groupby(['job', 'person']).apply(make_date_range).unstack('person')
# remove outer level from index
df1.index = df1.index.droplevel('job')
# resample each month counting only unique values
df1.resample('4MS').agg(lambda x: len(x[x.notnull()].unique()))
Output
person p1 p2
2015-01-01 1 1
2015-05-01 2 1
2015-09-01 1 1
2016-01-01 0 2
2016-05-01 0 1
2016-09-01 0 1
And here is a long one line solution that iterates over every rows and creates a new dataframe and stacks all of them together via pd.concat and then resamples.
pd.concat([pd.DataFrame(index = pd.date_range(tup.start, tup.end, freq='4MS'),
data=[[tup.job]],
columns=[tup.person]) for tup in df.itertuples()])\
.resample('4MS').count()
And another one that is faster
df1 = pd.melt(df, id_vars=['job', 'person'], value_name='date').set_index('date')
g = df1.groupby([pd.TimeGrouper('4MS'), 'person'])['job']
g.agg('nunique').unstack('person', fill_value=0)

Timestamp from date, and formatting in panda or csv

I have a function that outputs a dataframe generated from a RINEX (GPS) file. At present, I get the dataframe to be output into separated satellite (1-32) files. I'd like to access in the first column (either when it's still a dataframe or in these new files) in order to format the date to a timestamp in seconds, like below:
Epochs Epochs
2014-04-27 00:00:00 -> 00000
2014-04-27 00:00:30 -> 00030
2014-04-27 00:01:00 -> 00060
This requires stripping the date away, then converting hh:mm:ss to seconds. I've hit a wall trying to figure out how best to access this first column (Epochs) and then make the conversion on the entire column. The code I have been working on is:
def read_data(self, RINEXfile):
obs_data_chunks = []
while True:
obss, _, _, epochs, _ = self.read_data_chunk(RINEXfile)
if obss.shape[0] == 0:
break
obs_data_chunks.append(pd.Panel(
np.rollaxis(obss, 1, 0),
items=['G%02d' % d for d in range(1, 33)],
major_axis=epochs,
minor_axis=self.obs_types
).dropna(axis=0, how='all').dropna(axis=2, how='all'))
obs_data_chunks_dataframe = obs_data_chunks[0]
for sv in range(32):
sat = obs_data_chunks_dataframe[sv, :]
print "sat_columns: {0}".format(sat.columns[0]) #list header of first column: L1
sat.to_csv(('SV_{0}').format(sv+1), index_label="Epochs", sep='\t')
Do I perform this conversion within the dataframe i.e on "sat", or on the files after using the "to_csv"? I'm a bit lost here. Same question for formatting the columns. See the not-so-nicely formatted columns below:
Epochs L1 L2 P1 P2 C1 S1 S2
2014-04-27 00:00:00 669486.833 530073.33 24568752.516 24568762.572 24568751.442 43.0 38.0
2014-04-27 00:00:30 786184.519 621006.551 24590960.634 24590970.218 24590958.374 43.0 38.0
2014-04-27 00:01:00 902916.181 711966.252 24613174.234 24613180.219 24613173.065 42.0 38.0
2014-04-27 00:01:30 1019689.006 802958.016 24635396.428 24635402.41 24635395.627 42.0 37.0
2014-04-27 00:02:00 1136478.43 893962.705 24657620.079 24657627.11 24657621.828 42.0 37.0
UPDATE:
By saying that I've hit a wall trying to figure out how best to access this first column (Epochs), the ""sat" dataframe originally in its header had no "Epochs". It simply had the signals:
L1 L2 P1 P2 C1 S1 S2
The index, (date&time), was missing from the header. In order to overcome this in my csv output files, I "forced" the name with:
sat.to_csv(('SV_{0}').format(sv+1), index_label="Epochs", sep='\t')
I would expect before generating the csv files, I should (but don't know how) be able to access this index (date&time) column and simply convert all dates/times in one swoop, so that the timestamps are outputted.
UPDATE:
The epochs are generated in the dataframe in another function as so:
epochs = np.zeros(CHUNK_SIZE, dtype='datetime64[us]')
UPDATE:
def read_data_chunk(self, RINEXfile, CHUNK_SIZE = 10000):
obss = np.empty((CHUNK_SIZE, TOTAL_SATS, len(self.obs_types)), dtype=np.float64) * np.NaN
llis = np.zeros((CHUNK_SIZE, TOTAL_SATS, len(self.obs_types)), dtype=np.uint8)
signal_strengths = np.zeros((CHUNK_SIZE, TOTAL_SATS, len(self.obs_types)), dtype=np.uint8)
epochs = np.zeros(CHUNK_SIZE, dtype='datetime64[us]')
flags = np.zeros(CHUNK_SIZE, dtype=np.uint8)
i = 0
while True:
hdr = self.read_epoch_header(RINEXfile)
#print hdr
if hdr is None:
break
epoch, flags[i], sats = hdr
epochs[i] = np.datetime64(epoch)
sat_map = np.ones(len(sats)) * -1
for n, sat in enumerate(sats):
if sat[0] == 'G':
sat_map[n] = int(sat[1:]) - 1
obss[i], llis[i], signal_strengths[i] = self.read_obs(RINEXfile, len(sats), sat_map)
i += 1
if i >= CHUNK_SIZE:
break
return obss[:i], llis[:i], signal_strengths[:i], epochs[:i], flags[:i]
UPDATE:
My apologies if my description was somewhat vague. Actually I'm modifying code already developed, and I'm not a SW developer so it's a strong learning curve for me too. Let me explain further: the "Epochs" are read from another function:
def read_epoch_header(self, RINEXfile):
epoch_hdr = RINEXfile.readline()
if epoch_hdr == '':
return None
year = int(epoch_hdr[1:3])
if year >= 80:
year += 1900
else:
year += 2000
month = int(epoch_hdr[4:6])
day = int(epoch_hdr[7:9])
hour = int(epoch_hdr[10:12])
minute = int(epoch_hdr[13:15])
second = int(epoch_hdr[15:18])
microsecond = int(epoch_hdr[19:25]) # Discard the least significant digits (use microseconds only).
epoch = datetime.datetime(year, month, day, hour, minute, second, microsecond)
flag = int(epoch_hdr[28])
if flag != 0:
raise ValueError("Don't know how to handle epoch flag %d in epoch header:\n%s", (flag, epoch_hdr))
n_sats = int(epoch_hdr[29:32])
sats = []
for i in range(0, n_sats):
if ((i % 12) == 0) and (i > 0):
epoch_hdr = RINEXfile.readline()
sats.append(epoch_hdr[(32+(i%12)*3):(35+(i%12)*3)])
return epoch, flag, sats
In the above read_data function, these are appended into a dataframe. I basically want to have this dataframe separated by its satellite axis, so that each satellite file has in the first column, the epochs, then the following 7 signals. The last bit of code in the read_data file (below) explains this:
for sv in range(32):
sat = obs_data_chunks_dataframe[sv, :]
print "sat_columns: {0}".format(sat.columns[0]) #list header of first column: L1
sat.to_csv(('SV_{0}').format(sv+1), index_label="Epochs", sep='\t')
The problem here is (1) I want to have the first column as timestamps (so, strip the date, convert so midnight = 00000s and 23:59:59 = 86399s) not as they are now, and (2) ensure the columns are aligned, so I can eventually manipulate these further using a different class to perform other calculations i.e. L1 minus L2 plotted against time, etc.

It will be much quicker to do this when it's a df, if the dtype is datetime64 then just convert to int64 and then divide by nanoseconds:
In [241]:
df['Epochs'].astype(np.int64) // 10**9
Out[241]:
0 1398556800
1 1398556830
2 1398556860
3 1398556890
4 1398556920
Name: Epochs, dtype: int64
If it's a string then convert using to_datetime and then perform the above:
df['Epochs'] = pd.to_datetime(df['Epochs']).astype(np.int64) // 10**9
see related

I resolved part of this myself in the end: in the read_epoch_header function, I simply manipulated a variable that converted just hh:mm:ss to seconds, and used this as the epoch. Doesn't look that elegant but it works. Just need to format the header so that it aligns with the columns (and they are aligned too). Cheers, pymat

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare each pair of dates in two columns in python efficiently - python

Why not just do it in a vectorized way: is_correct = dates['Start'] < dates['End'] is_incorrect = dates['Start'] > dates['End'] is_same = ~is_correct & ~is_incorrect

Since the list doesn't need to be compared sequentially, you can gain performance by splitting your dataset and then using multiple processes to perform the comparison simultaneously. Take a look at the multiprocessing module for help.

Related

Resample a data frame into n-month periods with arbitrary end-of-period months

How can I make my python program run faster?

Pandas - DateTime within X amount of minutes from row

Python Pandas: Count quarterly occurrence from start and end date range

Timestamp from date, and formatting in panda or csv

Categories

Resources