I have defined this function:
def RCP(row):
### This function is what we use to predict the total number of purchases our customers will make over the
### remainder of their lifetime as a customer. For each row in the dataframe, we iterate on the library's
### built-in `conditional_expected_number_of_purchases_to_time` increasing t until the incremental RCP is below a
### certain threshold.
init_pur = 0 # start the loop at this value
current_pur = 0 # the value of this variable updates after each loop
t = 1 # time
eps_tol=1e-6 # threshold for ending the loop
while True:
## here we calculate the incremental number of purchases between n and n-1, which gets added to the previous value of the variable
current_pur += (mbgf.conditional_expected_number_of_purchases_up_to_time(t, row['frequency'], row['recency'], row['T']) -
mbgf.conditional_expected_number_of_purchases_up_to_time((t-1), row['frequency'], row['recency'], row['T']))
# if the difference between the most recent loop and the prior loop is less than the threshold, stop the loop
if (current_pur - init_pur < eps_tol):
break
init_pur = current_pur #reset the starting loop value
t += 1 # increment the time period by 1
return current_pur
What I am trying to do is run this function on each row in my dataframe until the difference between the current value and the previous value is less than my threshold (defined here by eps_tol), then move on to the next
It is working as expected, but the problem is that it is taking forever to run on dataframes of any meaningful size. I am currently working with a dataframe comprised of 40k rows and in some cases will have dataframes with more than 100k rows.
Can anyone recommend to me how I might be able to tweak this function - or re-write it - so that it runs faster?
Related
I have a pandas DataFrame with more than 100 thousands of rows. Index represents the time and two columns represents the sensor data and the condition.
When the condition becomes 1, I want to start calculating score card (average and standard deviation) till the next 1 comes. This needs to be calculated for the whole dataset.
Here is a picture of the DataFrame for a specific time span:
What I thought is to iterate through index and items of the df and when condition is met I start to calculate the descriptive statistics.
cycle = 0
for i, row in df_b.iterrows():
if row['condition'] == 1:
print('Condition is changed')
cycle += 1
print('cycle: ', cycle)
#start = ?
#end = ?
#df_b.loc[start:end]
I am not sure how to calculate start and end for this DataFrame. The end will be the start for the next cycle. Additionally, I think this iteration is not the optimal one because it takes a bit of long time to iterate. I appreciate any idea or solution for this problem.
Maybe start out with getting the rows where condition == 1:
cond_1_df = df.loc[df['condition'] == 1]
This dataframe will only contain the rows that meet your condition (being 1).
From here on, you can access the timestamps pairwise, meaning that the first element is beginning and second element is end, sketched below:
former = 0
stamp_pairs = []
df = cond_1_df.reset_index() # make sure indexes pair with number of rows
for index, row in df.iterrows():
if former != 0:
beginning = former
end = row["timestamp"]
former = row["timestamp"]
else:
beginning = 0
end = row["timestamp"]
former = row["timestamp"]
stamp_pairs.append([beginning, end])
This should give you something like this:
[[stamp0, stamp1], [stamp1,stamp2], [stamp2, stamp3]...]
for each of these pairs, you can again create a df containing only the subset of rows where stamp_x < timestamp < stamp_x+1:
time_cond_df = df.loc[(df['timestamp'] > stamp_x) & (df['timestamp'] < stamp_x+1)]
Finally, you get one time_cond_df per timestamp tuple, on which you can perform your score calculations.
Just make shure that your timestamps are comparable with operators ">" and "<"! We can't tell since you did not explicate how you produced the timestamps.
I have a .dat file made by an FPGA. The file contains 3 columns: the first is the input channel (it can be 1 or 2), the second column is the timestamp at which an event occurred, the third is the local time at which the same event occurred. The third column is necessary because sometimes the FPGA has to reset the clock counter in such a way that it doesn't count in a continuous way. An example of what I am saying is represented in the next figure.
An example of some lines from the .datfile is the following:
1 80.80051152 2022-02-24T18:28:49.602000
2 80.91821978 2022-02-24T18:28:49.716000
1 80.94284154 2022-02-24T18:28:49.732000
2 0.01856876 2022-02-24T18:29:15.068000
2 0.04225772 2022-02-24T18:29:15.100000
2 0.11766780 2022-02-24T18:29:15.178000
The time column is given by the FPGA (in tens of nanosecond), the date column is written by the python script that listen the data from the FPGA, when it has to write a timestamp it saves also the local time as a date.
I am interested in getting two arrays (one for each channel) where I have for each event the time at which that event occurs relatively to the starting time of the acquisition. An example of how the data given before should look at the end is the following:
8.091821978000000115e+01
1.062702197800000050e+02
1.062939087400000062e+02
1.063693188200000179e+02
These data refere to the second channel only. Double check can be made by observing third column in the previous data.
I tried to achieve this whit a function (too messy to me) where I check every time if the difference between two consecutive events in time is greater than 1 second respect to the difference in local time, if that's the case I evaluate the time interval through the local time column. So I correct the timestamp by the right amount of time:
ch, time, date = np.genfromtxt("events220302_1d.dat", unpack=True,
dtype=(int, float, 'datetime64[ms]'))
mask1 = ch==1
mask2 = ch==2
time1 = time[mask1]
time2 = time[mask2]
date1 = date[mask1]
date2 = date[mask2]
corr1 = np.zeros(len(time1))
for idx, val in enumerate(time1):
if idx < len(time1) - 1:
if check_dif(time1[idx], time1[idx+1], date1[idx], date1[idx+1]) == 0:
corr1[idx+1] = val + (date1[idx+1]-date1[idx])/np.timedelta64(1,'s') - time1[idx+1]
time1 = time1 + corr1.cumsum()
Where check_dif is a function that returns 0 if the difference in time between consecutive events is inconsistent with the difference in date between the two same events as I said before.
Is there any more elegant or even faster way to get what I want with maybe some fancy NumPy coding?
A simple initial way to optimize your code is to make the code if-less, thus getting rid of both the if statements. To do so, instead of returning 0 in check_dif, you can return 1 when "the difference in time between consecutive events is inconsistent with the difference in date between the two same events as I said before", otherwise 0.
Your for loop will be something like that:
for idx in range(len(time1) - 1):
is_dif = check_dif(time1[idx], time1[idx+1], date1[idx], date1[idx+1])
# Correction value: if is_dif == 0, no correction; otherwise a correction takes place
correction = is_dif * (date1[idx+1]-date1[idx])/np.timedelta64(1,'s') - time1[idx+1]
corr1[idx+1] = time1[idx] + correction
A more numpy way to do things could be through vectorization. I don't know if you have some benchmark on the speed or how big the file is, but I think in your case the previous change should be good enough
When I run the code below for each individually loop to run functions on a subset of the main dataframe that corresponds to specific dates, it runs quickly. When I try to loop it, however, it seemingly runs forever. There are 60 loops to go through in total.
My goal is to create a new column (col3) for each subset of date and combine it all into a single dataframe again.
data = pd.read_csv("df.csv")
data.YYMM = data.YYMM.apply(pd.to_datetime)
dates = data.groupby(data.YYMM).sum().index.values
data1 = pd.DataFrame()
for i in dates:
df1 = data[data.YYMM == i]
df1 = df1.sort_values(by='col1', ascending=False)
df1['col2'] = df1.col1 / sum(df1.col1)
df1['col3'] = reweight(df1.col2, cap)
data1 = data1.append(df1, ignore_index = True)
Would appreciate any help!
The reweights function:
def reweight(weights, cap):
# Obtain constrained weights
constrained_wts = np.minimum(cap, weights)
# Locate all stocks with less than max weight
nonmax = constrained_wts.ne(cap)
# Calculate adjustment factor - this is proportional to original weights
adj = ((1 - constrained_wts.sum()) *
weights.loc[nonmax] / weights.loc[nonmax].sum())
# Apply adjustment to obtain final weights
constrained_wts = constrained_wts.mask(nonmax, weights + adj)
# Repeat process in loop till conditions are satisfied
while ((constrained_wts.sum() < 1) or
(len(constrained_wts[constrained_wts > cap]) >=1 )):
# Obtain constrained weights
constrained_wts = np.minimum(cap, constrained_wts)
# Locate all stocks with less than max weight
nonmax = constrained_wts.ne(cap)
# Calculate adjustment factor - this is proportional to original weights
adj = ((1 - constrained_wts.sum()) *
constrained_wts.loc[nonmax] / weights.loc[nonmax].sum())
# Apply adjustment to obtain final weights
constrained_wts = constrained_wts.mask(nonmax, constrained_wts + adj)
return constrained_wts
A vital part of debugging code is adding logging so that you get an idea of what the state of the script is. I would start by adding some log calls in reweight() so that you can get an idea of where your time is getting spent.
For more information on logging, take a look at Logging HOWTO in the Python Documentation. (Yes, you can just use print() if you want, but with logging it's easy to make it output the current time for each line by default). Make sure you add some idea of what is in your variables when you log so you can see just how much work is being performed in each iteration, and whether you might have a condition where you'll never satisfy the while condition.
Key places to log out would be at the start and end of the while loop, so you have an idea of how your exit conditions are moving with each iteration, and whether you just have a lot of iterations or if your iterations are slowing down.
i'm working with timeseries data with this format:
[timestamp][rain value]
i wanted to count rainfall events in the timeseries data, where we define a rainfall event as a subdataframe of the main dataframe which contains nonzero values between zero rainfall values
i managed to get the start of the dataframe by getting the index of rainfall value before the first nonzero value:
start = df.rain.values.nonzero()[0][0] - 1
cur = df[start:]
what i can't figure out is how to find the end. i was looking for some function zero():
end=cur.rain.values.zero()[0][0]
to find the next zero value in the rain column and mark that as the end of my subdataframe
additionally, because my data is sampled at 15min intervals, it would mean that a temporary lull of 15mins would give me two rainfall events instead of one, which realistically isn't true. which means i would like to define some time period - 6hrs for example - to separate rainfall events.
what i was thinking of (but could not execute because i couldn't find the end of the subdataframe yet), in pseudocode:
start = df.rain.values.nonzero()[0][0] - 1
cur = df[start:]
end=cur.rain.values.zero()[0][0]
temp = df[end:]
z = temp.rain.values.nonzero()[0][0] - 1
if timedelta (z-end) >=6hrs:
end stays as endpoint of cur
else:
z is new endpoint, find next nonzero to again check
so i guess my question is, how do i find the end of my subdataframe if i don't want to iterate over all rows
and am i on the right track with my pseudocode in defining the end of a rainfall event as, say, 6 hours of 0 rain.
This post is quiet long and I will be very grateful to everybody who reads it until the end. :)
I am experimenting execution python code issues and would like to know if you have a better way of doing what I want to.
I explain my problem brifely. I have plenty solar panels measurements. Each one of them is done each 3 minutes. Unfortunately, some measurements can fail. The goal is to compare the time in order to keep only the values that have been measured in the same minutes and then retrieve them. A GUI is also included in my software, so each time the user changes the panels to compare, the calculation has to be done again. To do so, I have implemented 2 parts, the first one creates a vector of true or false for each panel for each minute, and the second compare the previous vector and keep only the common measures.
All the datas are contained in the pandas df energiesDatas. The relevant columns are:
name: contains the name of the panel (length 1)
date: contains the day of the measurement (length 1)
list_time: contains a list of all time of measurement of a day (length N)
list_energy_prod : contains the corresponding measures (length N)
The first part loop over all possible minutes from beginning to end of measurements. If a measure has been done, add True, otherwise add False.
self.ListCompare2=pd.DataFrame()
for n in self.NameList:#loop over all my solar panels
m=self.energiesDatas[self.energiesDatas['Name']==n]#all datas
#table_date contains all the possible date from the 1st measure, with interval of 1 min.
table_list=[1 for i in range(len(table_date))]
pointerDate=0 #pointer to the current value of time
#all the measures of a given day are transform into a str of hour-minutes
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
#some test
changeDate=0
count=0
#store the current pointed date
m_date=m['Date'].iloc[pointerDate]
#for all possible time
for curr_date in table_date:
#if considered date is bigger, move pointer to next day
while curr_date.date()>m_date:
pointerDate+=1
changeDate=1
m_date=m['Date'].iloc[pointerDate]
#if the day is changed, recalculate the measures of this new day
if changeDate:
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
changeDate=0
#check if a measure has been done at the considered time
table_list[count]=curr_date.strftime('%H-%M') in DateString
count+=1
#add to a dataframe
self.ListCompare2[n]=table_list
l2=self.ListCompare2
The second part is the following: given a "ListOfName" of modules to compare, check if they have been measured in the same time and only keep the values measure in the same minute.
ListToKeep=self.ListCompare2[ListOfName[0]]#take list of True or False done before
for i in ListOfName[1:]#for each other panels, check if True too
ListToKeep=ListToKeep&self.ListCompare2[i]
for i in ListOfName:#for each module, recover values
tmp=self.energiesDatas[self.energiesDatas['Name']==i]
count=0
#loop over value we want to keep (also energy produced and the interval of time)
for j,k,l,m,n in zip(tmp['list_time'],tmp['Date'],tmp['list_energy_prod'],tmp['list_energy_rec'],tmp['list_interval']):
#calculation of the index
delta_day=(k-self.dt.date()).days*(18*60)
#if the value of ListToKeep corresponding to the index is True, we keep the value
tmp['list_energy_prod'].iloc[count]=[ l[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_energy_rec'].iloc[count]=[ m[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_interval'].iloc[count]=[ n[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
count+=1
self.store_compare=self.store_compare.append(tmp)
Actually, this part is the one that takes a very long time.
My question is: Is there a way to save time, using build-in function or anything.
Thank you very much
Kilian
The answer of chris-sc sloved my problem:
I believe your data structure isn't appropriate for your problem. Especially the list in fields of a DataFrame, they make loops or apply almost unavoidable. Could you in principle re-structure the data? (For example one df per solar panel with columns date, time, energy)