Python Pandas optimization algorithm: compare datetime and retrieve datas - python

This post is quiet long and I will be very grateful to everybody who reads it until the end. :)
I am experimenting execution python code issues and would like to know if you have a better way of doing what I want to.
I explain my problem brifely. I have plenty solar panels measurements. Each one of them is done each 3 minutes. Unfortunately, some measurements can fail. The goal is to compare the time in order to keep only the values that have been measured in the same minutes and then retrieve them. A GUI is also included in my software, so each time the user changes the panels to compare, the calculation has to be done again. To do so, I have implemented 2 parts, the first one creates a vector of true or false for each panel for each minute, and the second compare the previous vector and keep only the common measures.
All the datas are contained in the pandas df energiesDatas. The relevant columns are:
name: contains the name of the panel (length 1)
date: contains the day of the measurement (length 1)
list_time: contains a list of all time of measurement of a day (length N)
list_energy_prod : contains the corresponding measures (length N)
The first part loop over all possible minutes from beginning to end of measurements. If a measure has been done, add True, otherwise add False.
self.ListCompare2=pd.DataFrame()
for n in self.NameList:#loop over all my solar panels
m=self.energiesDatas[self.energiesDatas['Name']==n]#all datas
#table_date contains all the possible date from the 1st measure, with interval of 1 min.
table_list=[1 for i in range(len(table_date))]
pointerDate=0 #pointer to the current value of time
#all the measures of a given day are transform into a str of hour-minutes
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
#some test
changeDate=0
count=0
#store the current pointed date
m_date=m['Date'].iloc[pointerDate]
#for all possible time
for curr_date in table_date:
#if considered date is bigger, move pointer to next day
while curr_date.date()>m_date:
pointerDate+=1
changeDate=1
m_date=m['Date'].iloc[pointerDate]
#if the day is changed, recalculate the measures of this new day
if changeDate:
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
changeDate=0
#check if a measure has been done at the considered time
table_list[count]=curr_date.strftime('%H-%M') in DateString
count+=1
#add to a dataframe
self.ListCompare2[n]=table_list
l2=self.ListCompare2
The second part is the following: given a "ListOfName" of modules to compare, check if they have been measured in the same time and only keep the values measure in the same minute.
ListToKeep=self.ListCompare2[ListOfName[0]]#take list of True or False done before
for i in ListOfName[1:]#for each other panels, check if True too
ListToKeep=ListToKeep&self.ListCompare2[i]
for i in ListOfName:#for each module, recover values
tmp=self.energiesDatas[self.energiesDatas['Name']==i]
count=0
#loop over value we want to keep (also energy produced and the interval of time)
for j,k,l,m,n in zip(tmp['list_time'],tmp['Date'],tmp['list_energy_prod'],tmp['list_energy_rec'],tmp['list_interval']):
#calculation of the index
delta_day=(k-self.dt.date()).days*(18*60)
#if the value of ListToKeep corresponding to the index is True, we keep the value
tmp['list_energy_prod'].iloc[count]=[ l[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_energy_rec'].iloc[count]=[ m[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_interval'].iloc[count]=[ n[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
count+=1
self.store_compare=self.store_compare.append(tmp)
Actually, this part is the one that takes a very long time.
My question is: Is there a way to save time, using build-in function or anything.
Thank you very much
Kilian

The answer of chris-sc sloved my problem:
I believe your data structure isn't appropriate for your problem. Especially the list in fields of a DataFrame, they make loops or apply almost unavoidable. Could you in principle re-structure the data? (For example one df per solar panel with columns date, time, energy)

Related

Optimize file reading with numpy

I have a .dat file made by an FPGA. The file contains 3 columns: the first is the input channel (it can be 1 or 2), the second column is the timestamp at which an event occurred, the third is the local time at which the same event occurred. The third column is necessary because sometimes the FPGA has to reset the clock counter in such a way that it doesn't count in a continuous way. An example of what I am saying is represented in the next figure.
An example of some lines from the .datfile is the following:
1 80.80051152 2022-02-24T18:28:49.602000
2 80.91821978 2022-02-24T18:28:49.716000
1 80.94284154 2022-02-24T18:28:49.732000
2 0.01856876 2022-02-24T18:29:15.068000
2 0.04225772 2022-02-24T18:29:15.100000
2 0.11766780 2022-02-24T18:29:15.178000
The time column is given by the FPGA (in tens of nanosecond), the date column is written by the python script that listen the data from the FPGA, when it has to write a timestamp it saves also the local time as a date.
I am interested in getting two arrays (one for each channel) where I have for each event the time at which that event occurs relatively to the starting time of the acquisition. An example of how the data given before should look at the end is the following:
8.091821978000000115e+01
1.062702197800000050e+02
1.062939087400000062e+02
1.063693188200000179e+02
These data refere to the second channel only. Double check can be made by observing third column in the previous data.
I tried to achieve this whit a function (too messy to me) where I check every time if the difference between two consecutive events in time is greater than 1 second respect to the difference in local time, if that's the case I evaluate the time interval through the local time column. So I correct the timestamp by the right amount of time:
ch, time, date = np.genfromtxt("events220302_1d.dat", unpack=True,
dtype=(int, float, 'datetime64[ms]'))
mask1 = ch==1
mask2 = ch==2
time1 = time[mask1]
time2 = time[mask2]
date1 = date[mask1]
date2 = date[mask2]
corr1 = np.zeros(len(time1))
for idx, val in enumerate(time1):
if idx < len(time1) - 1:
if check_dif(time1[idx], time1[idx+1], date1[idx], date1[idx+1]) == 0:
corr1[idx+1] = val + (date1[idx+1]-date1[idx])/np.timedelta64(1,'s') - time1[idx+1]
time1 = time1 + corr1.cumsum()
Where check_dif is a function that returns 0 if the difference in time between consecutive events is inconsistent with the difference in date between the two same events as I said before.
Is there any more elegant or even faster way to get what I want with maybe some fancy NumPy coding?
A simple initial way to optimize your code is to make the code if-less, thus getting rid of both the if statements. To do so, instead of returning 0 in check_dif, you can return 1 when "the difference in time between consecutive events is inconsistent with the difference in date between the two same events as I said before", otherwise 0.
Your for loop will be something like that:
for idx in range(len(time1) - 1):
is_dif = check_dif(time1[idx], time1[idx+1], date1[idx], date1[idx+1])
# Correction value: if is_dif == 0, no correction; otherwise a correction takes place
correction = is_dif * (date1[idx+1]-date1[idx])/np.timedelta64(1,'s') - time1[idx+1]
corr1[idx+1] = time1[idx] + correction
A more numpy way to do things could be through vectorization. I don't know if you have some benchmark on the speed or how big the file is, but I think in your case the previous change should be good enough

Python - Selecting data from at one specific time per day

I'm quite new to Python and have a simple question (I think). I have an array with hourly data for an entire month. I want to make a new array with only data at a specific time every day (at 00:00 UTC). How do I do this?
This is what the array looks like:
Thank you for your help!
I would go about it by first making a boolean mask representing which times satisfied that condition, and then selecting values based on that mask. If your data is truly hourly this could be done using the following:
mask = q_era5_feb.time.dt.hour == 0
result = q_era5_feb.sel(time=mask)

Windows of difference between 2 time series

I am trying to find 3 areas of difference between 2 time series. I am able to see the difference between the 2 but i want to eventually automatically detect the biggest difference and the smaller between the 2 curves. Using the following code i can view the difference between the 2 curves but i want to be able to find the 3 areas (chronologically) by defining a number of points or time period like in the image. So, for example find 3 windows of a week each where the difference is small then big and then small again. Any idea if there is a build in function for this?
Thank you
ax.fill_between(
x=feature.reset_index().index,
y1=feature.1,
y2=feature.2,
alpha=0.3
)
The 2 time series and 3 wanted areas that i would like to find
As a concept:
Define a large time window as t_0 to T, find the initial minimum in the difference of the two series (i.e. find the minimum of the spread) and record the location of this time. If you have an aligned data.frame of the time series this should be rudimentary in finding the minimum of the difference and looking up the loc of that item to identify the time within the window.
Then restrict your search to t_min_1 to T, and search for the maximum, again obtaining the loc for this maximum value in the spread. Lastly, search over t_max to T, for a local minimum within the spread and find the loc for that value.
This will return for you in your given window the times of your first minimum (t_min_1), second maximum (t_max) and third minimum (t_min_2) following within each event.

Number of quarters gap to int

I calculate number of quarters gap between two dates. Now, I want to test if the number of quarters gap is bigger than 2.
Thank you for your comments!
I'm actually running a code from WRDS (Wharton Research Data Services). Below, fst_vint is a DataFrame with two date variables, rdate and lag_rdate. First line seems to convert them to quarter variables (e.g., 9/8/2019 to 2019Q1), and then take differences between them, storing it in a new column qtr.
fst_vint.qtr >= 2 creates a problem, because the former is a QuarterEnd object, while the latter is an integer. How do I deal with this problem?
fst_vint['qtr'] = (fst_vint['rdate'].dt.to_period('Q')-\
fst_vint['lag_rdate'].dt.to_period('Q'))
# label first_report flag
fst_vint['first_report'] = ((fst_vint.qtr.isnull()) | (fst_vint.qtr>=2))
Using .diff() when column is converted to integer with .astype(int) results in the desired answer. So the code in your case would be:
fst_vint['qtr'] = fst_vint['rdate'].astype(int).diff()

Pandas convert timeframes to indexes

can you please suggest me an easy way to convert time periods to the corresponding indexes?
I have a function that picks entries from data frames based on numerical indexes (from 10th to 20th row) that I can not change. At the same time my data frame has time indexes and I have picked parts of it based on timestamps. How to convert those timestamps to the corresponding numerical indexes?
Thanks a lot
Alex
Adding some examples:
small_df.index[1]
Out[894]: Timestamp('2019-02-08 07:53:33.360000')
small_df.index[10]
Out[895]: Timestamp('2019-02-08 07:54:00.149000') # instead of time stamps.
These are the time period I want to pick from a second data frame that has time indexing as well. But I want to do that with numerical indexing
That means then
1. Find which numerical indexes correspond to the time period above
Based on the comment above this might be quite close on what I need:
start=second_dataframe.index.get_loc(pd.Timestamp(small_df.index[1]))
end=second_dataframe.index.get_loc(pd.Timestamp(small_df.index[10]))
picked_rows= second_dataframe[start:end]
Is there a better way to do that?
I believe you need Index.get_loc if need position:
small_df.index.get_loc(pd.Timestamp('2019-02-08 07:53:33.360000'))
1
EDIT: If values always matched, is possible get timestamp form first and extract second rows by DataFrame.loc:
start = small_df.index[1]
end = small_df.index[10]
picked_rows = second_dataframe.loc[start:end]
OrL
start=pd.Timestamp(small_df.index[1])
end=pd.Timestamp(small_df.index[10])
picked_rows = second_dataframe.loc[start:end]

Categories