Optimize file reading with numpy

Optimize file reading with numpy - python

I have a .dat file made by an FPGA. The file contains 3 columns: the first is the input channel (it can be 1 or 2), the second column is the timestamp at which an event occurred, the third is the local time at which the same event occurred. The third column is necessary because sometimes the FPGA has to reset the clock counter in such a way that it doesn't count in a continuous way. An example of what I am saying is represented in the next figure.
An example of some lines from the .datfile is the following:
1 80.80051152 2022-02-24T18:28:49.602000
2 80.91821978 2022-02-24T18:28:49.716000
1 80.94284154 2022-02-24T18:28:49.732000
2 0.01856876 2022-02-24T18:29:15.068000
2 0.04225772 2022-02-24T18:29:15.100000
2 0.11766780 2022-02-24T18:29:15.178000
The time column is given by the FPGA (in tens of nanosecond), the date column is written by the python script that listen the data from the FPGA, when it has to write a timestamp it saves also the local time as a date.
I am interested in getting two arrays (one for each channel) where I have for each event the time at which that event occurs relatively to the starting time of the acquisition. An example of how the data given before should look at the end is the following:
8.091821978000000115e+01
1.062702197800000050e+02
1.062939087400000062e+02
1.063693188200000179e+02
These data refere to the second channel only. Double check can be made by observing third column in the previous data.
I tried to achieve this whit a function (too messy to me) where I check every time if the difference between two consecutive events in time is greater than 1 second respect to the difference in local time, if that's the case I evaluate the time interval through the local time column. So I correct the timestamp by the right amount of time:
ch, time, date = np.genfromtxt("events220302_1d.dat", unpack=True,
dtype=(int, float, 'datetime64[ms]'))
mask1 = ch==1
mask2 = ch==2
time1 = time[mask1]
time2 = time[mask2]
date1 = date[mask1]
date2 = date[mask2]
corr1 = np.zeros(len(time1))
for idx, val in enumerate(time1):
if idx < len(time1) - 1:
if check_dif(time1[idx], time1[idx+1], date1[idx], date1[idx+1]) == 0:
corr1[idx+1] = val + (date1[idx+1]-date1[idx])/np.timedelta64(1,'s') - time1[idx+1]
time1 = time1 + corr1.cumsum()
Where check_dif is a function that returns 0 if the difference in time between consecutive events is inconsistent with the difference in date between the two same events as I said before.
Is there any more elegant or even faster way to get what I want with maybe some fancy NumPy coding?

A simple initial way to optimize your code is to make the code if-less, thus getting rid of both the if statements. To do so, instead of returning 0 in check_dif, you can return 1 when "the difference in time between consecutive events is inconsistent with the difference in date between the two same events as I said before", otherwise 0.
Your for loop will be something like that:
for idx in range(len(time1) - 1):
is_dif = check_dif(time1[idx], time1[idx+1], date1[idx], date1[idx+1])
# Correction value: if is_dif == 0, no correction; otherwise a correction takes place
correction = is_dif * (date1[idx+1]-date1[idx])/np.timedelta64(1,'s') - time1[idx+1]
corr1[idx+1] = time1[idx] + correction
A more numpy way to do things could be through vectorization. I don't know if you have some benchmark on the speed or how big the file is, but I think in your case the previous change should be good enough

Related

A dataframe splitting problem in Pandas, any thoughts?

The probe of an instrument is cycling back and forward along an x direction while is recording its position and acquiring the measurements. The probe makes 10 cycles, let's say from 0 to 10 um (go and back) and records the measurements. This gives 2 columns of data: position and measurement, where the position number cycle 0um->10um->0->10->0..., but these numbers have an experimental error so they are all different.
I need to split the dataframe at the beginning of each cycle. Any interesting strategy to tackle this problem? Please, let me know if you need more info. Thank in advance.
Below is link to an example of the dataframe that I have.
https://www.dropbox.com/s/af4r8lw5lfhwexr/Example.xlsx?dl=0
In this example the instrument made 3 cycles and generated the data (measurement). Cycle 1 = Index 0-20; Cycle 1 = Index 20-40; and Cycle 1 = Index 40-60. I need to divide this dataframe into 3 dataframes, one for each cycle (Index 0-20; Index 20-40; Index 40-60).
The tricky part is that the method needs to be "general" because each cycle can have a different number of datapoints (in this example that is fixed to 20), and different experiments can be performed with a different number of cycles.

My objective is to keep tract when the numbers start increasing again after decreasing to determine the cycle number. Not very elegant sorry.
import pandas as pd
df = pd.read_excel('Example.xlsx')
def cycle(array):
increasing = 1
cycle_num = 0
answer = []
for ind,val in enumerate(array):
try:
if array[ind+1]-array[ind]>=0:
if increasing==0:
cycle_num+=1
increasing=1
answer.append(cycle_num)
else:
answer.append(cycle_num)
increasing=0
except:
answer.append(cycle_num)
return answer
df['Cycle'] = cycle(df['Distance'].to_list())
grouped = df.groupby(['Cycle'])
print(grouped.get_group(0))
print(grouped.get_group(1))
print(grouped.get_group(2))

What is the most efficient way to count the number of instances occurring within a time frame in python?

I am trying to run a simple count function which runs a dataframe of event times (specifically surgeries) against another dataframe of shift time frames, and returns a list of how many events occur during each shift. These csvs are thousands of rows, though, so while the way I have it set up currently works, it takes forever. This is what I have:
numSurgeries = [0 for shift in range(len(df.Date))]
for i in range(len(OR['PATIENT_IN_ROOM_DTTM'])):
for shift in range(len(df.DateTime)):
if OR['PATIENT_IN_ROOM_DTTM'][i] >= df.DateTime[shift] and OR['PATIENT_IN_ROOM_DTTM'][i] < df.DateTime[shift+1]:
numSurgeries[shift] += 1
So it loops through each event and checks to see which shift time frame it is in, then increments the count for that time frame. Logical, works, but definitely not efficient.
EDIT:
Example of OR data file
Example of df data file

Without example data, it's not absolutely clear what you want. But this should help you vectorise:
numSurgeries = {shift: np.sum((OR['PATIENT_IN_ROOM_DTTM'] >= df.DateTime[shift]) & \
(OR['PATIENT_IN_ROOM_DTTM'] < df.DateTime[shift+1])) \
for shift in range(len(df.Date))}
The output is a dictionary mapping integer shift to numSurgeries.

As mentioned above, it is hard to answer without example data.
However, a boolean mask sounds fitting. See Select dataframe rows between two dates.
Create a date mask from shift, we'll call the start and end dates start_shift and end_shift respectively. These should be in datetime format.
date_mask = (df['datetime'] >= start_shift) & (df['datetime'] <= end_shift)
Locate all values in df that fit this range.
df_shift = df.loc[date_mask]
Count the number of instances in the new df_shift.
num_surgeries = len(df_shift.index())
Cycle through all shifts.
def count_shifts(df, shift, results_df, start_shift, end_shift):
date_mask = (df['datetime'] >= start_shift) & (df['datetime'] <= end_shift)
df_shift = df.loc[date_mask]
num_surgeries = len(df_shift.index())
return(num_surgeries)
# iterates through df and applies the above function to every row
results_df['num_surgeries'] = results_df.apply(calculate_num_surgeries,axis=1)
Also remember to name variables according to PEP8 Style Guide! Camelcase is not recommended in Python.

numpy.where does not work properly with pandas dataframe

I am trying to divide a huge log data sets containing log data with StartTime and EndTime and other stuff.
I am using np.where to compare pandas dataframe object and then to divide it to hourly (may be half hour or quarterly) chunks, depends on hr and timeWindow value.
Below, Here, I am trying to divide the entire day logs to 1 hour chunks, but It does not gives me expected output.
I am out of ideas like where exactly my fault is!
# Holding very first time in the log data and stripping off
# second, minutes and microseconds.
today = datetime.strptime(log_start_time, "%Y-%m-%d %H:%M:%S.%f").replace(second = 0, minute = 0, microsecond = 0)
today_ts = int(time.mktime(today.timetuple())*1e9)
hr = 1
timeWindow = int(hr*60*60*1e9) #hour*minute*second*restdigits
parts = [df.loc[np.where((df["StartTime"] >= (today_ts + (i)*timeWindow)) & \
(df["StartTime"] < (today_ts + (i+1)*timeWindow)))].dropna(axis= 0, \
how='any') for i in range(0, rngCounter)]
If I check for first log entry inside my parts data, it is something like below:
00:00:00.
00:43:23.
01:12:59.
01:53:55.
02:23:52.
....
Where as I expect the output to be like below:
00:00:00
01:00:01
02:00:00
03:00:00
04:00:01
....
Though I have implemented it using an alternative way, but that's a work around and I lost few features by not having it like this.
Can Someone please figure out what exactly wrong with this approach?
Note: I am using python notebook with pandas, numpy.

Thanks to #raganjosh.
I got my solution to the problem by using pandas Grouper.
Below is my implementation.
I have used dynamic value for 'hr'.
timeWindow = str(hr)+'H'
# Dividing the log into "n" parts. Depends on timewindow initialisation.
df["ST"] = df['StartTime']
df = df.set_index(['ST'])
# Using the copied column as an index.
df.index = pd.to_datetime(df.index)
# Here the parts contain grouped chunks of data as per timewindow, list[0] = key of the group, list[1] = values.
parts = list(df.groupby(pd.TimeGrouper(freq=timeWindow))['StartTime', "ProcessTime", "EndTime"])

Agglomerate time series data

I have AWS EC2 instance CPU utilization and other metric data given to me in CSV format like this:
Date,Time,CPU_Utilization,Unit
2016-10-17,09:25:00,22.5,Percent
2016-10-17,09:30:00,6.534,Percent
2016-10-17,09:35:00,19.256,Percent
2016-10-17,09:40:00,43.032,Percent
2016-10-17,09:45:00,58.954,Percent
2016-10-17,09:50:00,56.628,Percent
2016-10-17,09:55:00,25.866,Percent
2016-10-17,10:00:00,17.742,Percent
2016-10-17,10:05:00,34.22,Percent
2016-10-17,10:10:00,26.07,Percent
2016-10-17,10:15:00,20.066,Percent
2016-10-17,10:20:00,15.466,Percent
2016-10-17,10:25:00,16.2,Percent
2016-10-17,10:30:00,14.27,Percent
2016-10-17,10:35:00,5.666,Percent
2016-10-17,10:40:00,4.534,Percent
2016-10-17,10:45:00,4.6,Percent
2016-10-17,10:50:00,4.266,Percent
2016-10-17,10:55:00,4.2,Percent
2016-10-17,11:00:00,4.334,Percent
2016-10-17,11:05:00,4.334,Percent
2016-10-17,11:10:00,4.532,Percent
2016-10-17,11:15:00,4.266,Percent
2016-10-17,11:20:00,4.266,Percent
2016-10-17,11:25:00,4.334,Percent
As in evident, this is reported every 5 minutes. I do not have access to the aws-cli. I need to process this and report average utilization every 15 minutes for visualization. That is, for every hour, I need to find the average of the values in the first 15 minutes, the next fifteen minutes and so on. So, I will be reporting 4 values every hour.
A sample output would be:
Date,Time,CPU_Utilization,Unit
2016-10-17,09:30:00,14.517,Percent
2016-10-17,09:45:00,40.414,Percent
2016-10-17,10:00:00,33.412,Percent
2016-10-17,10:15:00,26.785,Percent
...
One way to do it would be to read the entire file (which has 10000+) lines, then for each date, find the values which belong to one window of 15 minutes, compute their average and repeat for all the values. This does not seem to be the best and the most efficient approach. Is there a better way to do it? Thank you.

As your input data is actually pretty small, I'd suggest to read it in at once by use of np.genfromtxt. Then you can find the appropriate range by checking when a full quarter of an hour is reached and end by counting how many full quarters are left. Then you can use np.reshape to get the array to a form with rows of quarters of hours and then average over those rows:
import numpy as np
# Read in the data:
data = np.genfromtxt("data.dat", skip_header=1,
dtype=[("date", "|S10"),
("time", "|S8"),
("cpu_usage", "f8")],
delimiter=',', usecols=(0, 1, 2))
# Find the first full quarter:
firstQuarterHour = 0
while not (int(data[firstQuarterHour]["time"][3:5]) % 15 == 0):
firstQuarterHour += 1
noOfQuarterHours = data[firstQuarterHour:].shape[0]/3
# Create a reshaped array
reshaped = data[firstQuarterHour:firstQuarterHour+3*noOfQuarterHours+1].reshape(
(noOfQuarterHours, 3))
# Average over cpu_usage and take the appropriate dates and times:
cpu_usage = reshaped["cpu_usage"].mean(axis=1)
dates = reshaped["date"][:, 0]
times = reshaped["time"][:, 0]
Now you can use those arrays to for example save into another text file by use of np.savetxt.

Python Pandas optimization algorithm: compare datetime and retrieve datas

This post is quiet long and I will be very grateful to everybody who reads it until the end. :)
I am experimenting execution python code issues and would like to know if you have a better way of doing what I want to.
I explain my problem brifely. I have plenty solar panels measurements. Each one of them is done each 3 minutes. Unfortunately, some measurements can fail. The goal is to compare the time in order to keep only the values that have been measured in the same minutes and then retrieve them. A GUI is also included in my software, so each time the user changes the panels to compare, the calculation has to be done again. To do so, I have implemented 2 parts, the first one creates a vector of true or false for each panel for each minute, and the second compare the previous vector and keep only the common measures.
All the datas are contained in the pandas df energiesDatas. The relevant columns are:
name: contains the name of the panel (length 1)
date: contains the day of the measurement (length 1)
list_time: contains a list of all time of measurement of a day (length N)
list_energy_prod : contains the corresponding measures (length N)
The first part loop over all possible minutes from beginning to end of measurements. If a measure has been done, add True, otherwise add False.
self.ListCompare2=pd.DataFrame()
for n in self.NameList:#loop over all my solar panels
m=self.energiesDatas[self.energiesDatas['Name']==n]#all datas
#table_date contains all the possible date from the 1st measure, with interval of 1 min.
table_list=[1 for i in range(len(table_date))]
pointerDate=0 #pointer to the current value of time
#all the measures of a given day are transform into a str of hour-minutes
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
#some test
changeDate=0
count=0
#store the current pointed date
m_date=m['Date'].iloc[pointerDate]
#for all possible time
for curr_date in table_date:
#if considered date is bigger, move pointer to next day
while curr_date.date()>m_date:
pointerDate+=1
changeDate=1
m_date=m['Date'].iloc[pointerDate]
#if the day is changed, recalculate the measures of this new day
if changeDate:
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
changeDate=0
#check if a measure has been done at the considered time
table_list[count]=curr_date.strftime('%H-%M') in DateString
count+=1
#add to a dataframe
self.ListCompare2[n]=table_list
l2=self.ListCompare2
The second part is the following: given a "ListOfName" of modules to compare, check if they have been measured in the same time and only keep the values measure in the same minute.
ListToKeep=self.ListCompare2[ListOfName[0]]#take list of True or False done before
for i in ListOfName[1:]#for each other panels, check if True too
ListToKeep=ListToKeep&self.ListCompare2[i]
for i in ListOfName:#for each module, recover values
tmp=self.energiesDatas[self.energiesDatas['Name']==i]
count=0
#loop over value we want to keep (also energy produced and the interval of time)
for j,k,l,m,n in zip(tmp['list_time'],tmp['Date'],tmp['list_energy_prod'],tmp['list_energy_rec'],tmp['list_interval']):
#calculation of the index
delta_day=(k-self.dt.date()).days*(18*60)
#if the value of ListToKeep corresponding to the index is True, we keep the value
tmp['list_energy_prod'].iloc[count]=[ l[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_energy_rec'].iloc[count]=[ m[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_interval'].iloc[count]=[ n[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
count+=1
self.store_compare=self.store_compare.append(tmp)
Actually, this part is the one that takes a very long time.
My question is: Is there a way to save time, using build-in function or anything.
Thank you very much
Kilian

The answer of chris-sc sloved my problem:
I believe your data structure isn't appropriate for your problem. Especially the list in fields of a DataFrame, they make loops or apply almost unavoidable. Could you in principle re-structure the data? (For example one df per solar panel with columns date, time, energy)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.