i'm working with timeseries data with this format:
[timestamp][rain value]
i wanted to count rainfall events in the timeseries data, where we define a rainfall event as a subdataframe of the main dataframe which contains nonzero values between zero rainfall values
i managed to get the start of the dataframe by getting the index of rainfall value before the first nonzero value:
start = df.rain.values.nonzero()[0][0] - 1
cur = df[start:]
what i can't figure out is how to find the end. i was looking for some function zero():
end=cur.rain.values.zero()[0][0]
to find the next zero value in the rain column and mark that as the end of my subdataframe
additionally, because my data is sampled at 15min intervals, it would mean that a temporary lull of 15mins would give me two rainfall events instead of one, which realistically isn't true. which means i would like to define some time period - 6hrs for example - to separate rainfall events.
what i was thinking of (but could not execute because i couldn't find the end of the subdataframe yet), in pseudocode:
start = df.rain.values.nonzero()[0][0] - 1
cur = df[start:]
end=cur.rain.values.zero()[0][0]
temp = df[end:]
z = temp.rain.values.nonzero()[0][0] - 1
if timedelta (z-end) >=6hrs:
end stays as endpoint of cur
else:
z is new endpoint, find next nonzero to again check
so i guess my question is, how do i find the end of my subdataframe if i don't want to iterate over all rows
and am i on the right track with my pseudocode in defining the end of a rainfall event as, say, 6 hours of 0 rain.
Related
I have the following problem set which I have spent days on trying to find the optimal solution:
giving a country evaluating process having 3 parameters (V1,V2,V3), which may be recorded in sperate dates (Date1,Date2,Date3) repsectively, each record of a single parameter is stored in one row, with date and value, as illustrated in the picture below.
I need to fill empty cells in a row with two parameters recorded in other rows based on two rules
for the country (ISO), and the recorded parameter in current row, if the other two parameters' recordings of the same country exist in the table in other rows, chose the recorded parameter value/date for the same country in other rows with the date most close to the parameter recording date in current row.
otherwise, if the same country doesn't have other one or two parameters recorded in other rows yet, use the recorded parameter value and date in the current row to fill the empty cells of the other two parameters in the current row.
for example:
Record NO.1, country "AND" only have one parameter V1 recorded in the table (C2,D2), thus, (E2,F2) and (G2,H2) should be filled with value from (C2,D2),as 123 and 2022/4/12
Record NO.2, Country "COR" has only one V2 recorded (E3,F3), and two V1 recorded in Record NO.4 and NO.5, then, in Record NO.2, V1(C3,D3) should be filled with value in Record NO.5 (C6,D6), since Record NO.5's date (D6,2022.07.12) is most close to Record NO.2(D3,2022.07.13).
The process has to loop through the dataframe to fill all the empty cells. Please HELP!
There's no way I can think of without looping over the whole dataset each time you encounter an empty value. With that in mind, this could take a while to run if the file size is large. You may have to switch out date formats if excel auto formats the date differently.
import pandas
from datetime import datetime
pd = pandas.read_csv(filepath_or_buffer=r"path/to/csv")
dt_format = '%Y/%m/%d'
# dt_format = '%m/%d/%Y'
for index, row in pd.iterrows():
checks = [pandas.isnull(row['V1']), pandas.isnull(row['V2']), pandas.isnull(row['V3'])]
empties = [i + 1 for i in range(0, 3) if checks[i]]
reference = checks.index(False) + 1
closest = None
for empty in empties:
iso = row['ISO']
# Create timestamp of date of the filled in Date column
ts = datetime.strptime(row[f'Date{reference}'], dt_format).timestamp()
for index2, row2 in pd.iterrows():
if row2['ISO'] == iso and not pandas.isnull(row2[f'V{empty}']):
if closest is None:
closest = (row2[f'V{empty}'], row2[f'Date{empty}'])
else:
delta1 = abs(datetime.strptime(closest[1], dt_format).timestamp() - ts)
delta2 = abs(datetime.strptime(row2[f'Date{empty}'], dt_format).timestamp() - ts)
if delta2 < delta1:
closest = (row2[f'V{empty}'], row2[f'Date{empty}'])
if closest is not None:
pd.at[index, f'V{empty}'] = closest[0]
pd.at[index, f'Date{empty}'] = closest[1]
pd.to_csv(r"path/to/csv")
I have defined this function:
def RCP(row):
### This function is what we use to predict the total number of purchases our customers will make over the
### remainder of their lifetime as a customer. For each row in the dataframe, we iterate on the library's
### built-in `conditional_expected_number_of_purchases_to_time` increasing t until the incremental RCP is below a
### certain threshold.
init_pur = 0 # start the loop at this value
current_pur = 0 # the value of this variable updates after each loop
t = 1 # time
eps_tol=1e-6 # threshold for ending the loop
while True:
## here we calculate the incremental number of purchases between n and n-1, which gets added to the previous value of the variable
current_pur += (mbgf.conditional_expected_number_of_purchases_up_to_time(t, row['frequency'], row['recency'], row['T']) -
mbgf.conditional_expected_number_of_purchases_up_to_time((t-1), row['frequency'], row['recency'], row['T']))
# if the difference between the most recent loop and the prior loop is less than the threshold, stop the loop
if (current_pur - init_pur < eps_tol):
break
init_pur = current_pur #reset the starting loop value
t += 1 # increment the time period by 1
return current_pur
What I am trying to do is run this function on each row in my dataframe until the difference between the current value and the previous value is less than my threshold (defined here by eps_tol), then move on to the next
It is working as expected, but the problem is that it is taking forever to run on dataframes of any meaningful size. I am currently working with a dataframe comprised of 40k rows and in some cases will have dataframes with more than 100k rows.
Can anyone recommend to me how I might be able to tweak this function - or re-write it - so that it runs faster?
I have a dataframe with the following structure:
Timestamp, Value, Start, End
I would like to know for each row, the maximum value based on TimeStamp >= Start & TimeStamp <= End
I cannot use rolling.max() because the start & end are not equally defined.
This is some data I have.. So basically for each row I would like to find the highest High[0] row where Time[0] is between Time[0] and Time[1]
I have a dataframe where rows of data are in one second intervals, so 08:00:00, 08:00:01, etc. I want to take a rolling average over a period of 10 minutes, but I only want the rolling average to update on a minute by minute basis. So the rolling average values for 08:10:00 - 08:10:59 would all be the same value, and then at 8:11:00, it would update to a new value for the next minute.
Currently I'm using the following line to calculate a rolling average which updates every second:
df['counts-avg'] = df['counts'].rolling(window=600).mean()
I have another column for the seconds value called df['sec']. I got the indices of rows where seconds = 0 (the zeroth second of each minute) and replaced every other row with np.nan. Then I used fillna(method='ffill') to copy values downward.
df['counts-avg'] = df['counts'].rolling(window=600).mean()
erase_idx = df[df['sec'] > 0].index.values
ma = df['counts-avg']
ma.loc[erase_idx] = np.nan
ma = ma.fillna(method='ffill')
This post is quiet long and I will be very grateful to everybody who reads it until the end. :)
I am experimenting execution python code issues and would like to know if you have a better way of doing what I want to.
I explain my problem brifely. I have plenty solar panels measurements. Each one of them is done each 3 minutes. Unfortunately, some measurements can fail. The goal is to compare the time in order to keep only the values that have been measured in the same minutes and then retrieve them. A GUI is also included in my software, so each time the user changes the panels to compare, the calculation has to be done again. To do so, I have implemented 2 parts, the first one creates a vector of true or false for each panel for each minute, and the second compare the previous vector and keep only the common measures.
All the datas are contained in the pandas df energiesDatas. The relevant columns are:
name: contains the name of the panel (length 1)
date: contains the day of the measurement (length 1)
list_time: contains a list of all time of measurement of a day (length N)
list_energy_prod : contains the corresponding measures (length N)
The first part loop over all possible minutes from beginning to end of measurements. If a measure has been done, add True, otherwise add False.
self.ListCompare2=pd.DataFrame()
for n in self.NameList:#loop over all my solar panels
m=self.energiesDatas[self.energiesDatas['Name']==n]#all datas
#table_date contains all the possible date from the 1st measure, with interval of 1 min.
table_list=[1 for i in range(len(table_date))]
pointerDate=0 #pointer to the current value of time
#all the measures of a given day are transform into a str of hour-minutes
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
#some test
changeDate=0
count=0
#store the current pointed date
m_date=m['Date'].iloc[pointerDate]
#for all possible time
for curr_date in table_date:
#if considered date is bigger, move pointer to next day
while curr_date.date()>m_date:
pointerDate+=1
changeDate=1
m_date=m['Date'].iloc[pointerDate]
#if the day is changed, recalculate the measures of this new day
if changeDate:
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
changeDate=0
#check if a measure has been done at the considered time
table_list[count]=curr_date.strftime('%H-%M') in DateString
count+=1
#add to a dataframe
self.ListCompare2[n]=table_list
l2=self.ListCompare2
The second part is the following: given a "ListOfName" of modules to compare, check if they have been measured in the same time and only keep the values measure in the same minute.
ListToKeep=self.ListCompare2[ListOfName[0]]#take list of True or False done before
for i in ListOfName[1:]#for each other panels, check if True too
ListToKeep=ListToKeep&self.ListCompare2[i]
for i in ListOfName:#for each module, recover values
tmp=self.energiesDatas[self.energiesDatas['Name']==i]
count=0
#loop over value we want to keep (also energy produced and the interval of time)
for j,k,l,m,n in zip(tmp['list_time'],tmp['Date'],tmp['list_energy_prod'],tmp['list_energy_rec'],tmp['list_interval']):
#calculation of the index
delta_day=(k-self.dt.date()).days*(18*60)
#if the value of ListToKeep corresponding to the index is True, we keep the value
tmp['list_energy_prod'].iloc[count]=[ l[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_energy_rec'].iloc[count]=[ m[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_interval'].iloc[count]=[ n[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
count+=1
self.store_compare=self.store_compare.append(tmp)
Actually, this part is the one that takes a very long time.
My question is: Is there a way to save time, using build-in function or anything.
Thank you very much
Kilian
The answer of chris-sc sloved my problem:
I believe your data structure isn't appropriate for your problem. Especially the list in fields of a DataFrame, they make loops or apply almost unavoidable. Could you in principle re-structure the data? (For example one df per solar panel with columns date, time, energy)