Easy way to cluster entries in a numpy array with a condition - python

Background: I have a numpy array of float entries. This is basically a set of observations of something, suppose temperature measured during 24 hours. Imagine that one who records the temperature is not available for the entire day, instead he/she takes few (say 5) readings during an hour and again after few hours, takes reading (say 8 times). All the measurements he/she puts in a single np.array and has handed over to me!
Problem: I have no idea when the readings were taken. So I decide to cluster the observations in the following way: maybe, first recognize local peaks in the array and all entries that are close enough (chosen tolerance, say 1 deg) are grouped together, meaning, I want to split the array into a list of sub-arrays. Note, any entry should belong to exactly one group.
One possible approach: First, sort the array, then split it into sub-arrays with two conditions: (1) Difference between the first and last entries is not more than 1 deg, (2) Difference between the last entry of a sub-array and the first entry of the next sub-array is greater than 1 deg. How can I achieve this fast (numpy way)?

Related

Start calculation if BIN is complete (Python)

I have a df with 250.000 rows, sorted by date. Another column called 'wind speed' consists of numbers from 1 to 25 in increments of 0.1, but is unsorted. Now I want to start a calculation with the 'Power' column if I have at least every integer number of 'wind speed' from 1 to 25 once. If I have multiple values for one number I want to built the average.
When I try an if clause, it checks the entire df and doesn't start the calculation when the BIN is complete. Then a new BIN should be searched for.
Does someoen have an idea how to do it?
This is a part of my df:

How to identify levels of "plateaus" in a pandas series of float values?

I have a pandas dataframe df1 that contains, amongst others, a series of time measurements (duration n of experiment x on sample y; in seconds).
In theory, every duration n is an integer multiple of the shortest duration within the series. Note that the shortest possible duration varies across different samples.
In reality, the time measurements are an approximation only. When sorting the duration according to length in seconds and plotting the result, I get something like this:
I want to open a new column and assign an integer to every measurement. How to determine plateaus 1-3 in the figure above?
I am interested in a scalable solution and hence can't divide by the smallest number in the series, since I will be facing thousands of samples in the future.

Correlation table

Suppose that you have hundreds of numpy arrays and you want to calculate correlation between each of them. I calculated it with the help of nested for loops. But, execution took huge time(20 minutes!). One way to make this calculation more efficient is to calculate one half of the correlation table diagonal, copy it to other half and make diagonal line equal to 1. What I mean is that, correlation(x,y)=correlation(y,x) and correlation(x,x) is always equal to 1. However, with these corrections, code will also take much time(approx 7-8 minutes). Any other suggestions?
My code
for x in data_set:
for y in data_set:
correlation = np.corrcoef(x,y)[1][0]
I am quite sure you can achieve must faster results by creating a 2-D array and calculating its correlation matrix (as opposed to calculate pair wise correlations one by one).
From numpy's corrcoef documentation the input can be:
" 1-D or 2-D array containing multiple variables and observations. Each row of m represents a variable, and each column a single observation of all those variables."
https://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html

(Randomly?) find an amount by summing a 2D array

I have a 2D array with :
an index,
a numerical value
When I sum this 2D array I get an amount (let's say "a").
I am provided with another amount (let's say "b", a <> b , and b is the target) and the granularity isn't fine enough to segregate one row from another.
The idea here is to try to find all the rows that compose b and discard the others.
What I am trying to do is building a script that (randomly ?) select rows and sum them until it approches (reduce the distance) the targeted sum.
I was first thinking about trying to start at a random point and from there try to sum each combination of rows and keep adding them up until
-> I have something close enough
or
-> the number of set iteration is over (1 million ?)
... but with the number of rows involved this won't fit in memory.
Any Ideas ?

Python Pandas optimization algorithm: compare datetime and retrieve datas

This post is quiet long and I will be very grateful to everybody who reads it until the end. :)
I am experimenting execution python code issues and would like to know if you have a better way of doing what I want to.
I explain my problem brifely. I have plenty solar panels measurements. Each one of them is done each 3 minutes. Unfortunately, some measurements can fail. The goal is to compare the time in order to keep only the values that have been measured in the same minutes and then retrieve them. A GUI is also included in my software, so each time the user changes the panels to compare, the calculation has to be done again. To do so, I have implemented 2 parts, the first one creates a vector of true or false for each panel for each minute, and the second compare the previous vector and keep only the common measures.
All the datas are contained in the pandas df energiesDatas. The relevant columns are:
name: contains the name of the panel (length 1)
date: contains the day of the measurement (length 1)
list_time: contains a list of all time of measurement of a day (length N)
list_energy_prod : contains the corresponding measures (length N)
The first part loop over all possible minutes from beginning to end of measurements. If a measure has been done, add True, otherwise add False.
self.ListCompare2=pd.DataFrame()
for n in self.NameList:#loop over all my solar panels
m=self.energiesDatas[self.energiesDatas['Name']==n]#all datas
#table_date contains all the possible date from the 1st measure, with interval of 1 min.
table_list=[1 for i in range(len(table_date))]
pointerDate=0 #pointer to the current value of time
#all the measures of a given day are transform into a str of hour-minutes
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
#some test
changeDate=0
count=0
#store the current pointed date
m_date=m['Date'].iloc[pointerDate]
#for all possible time
for curr_date in table_date:
#if considered date is bigger, move pointer to next day
while curr_date.date()>m_date:
pointerDate+=1
changeDate=1
m_date=m['Date'].iloc[pointerDate]
#if the day is changed, recalculate the measures of this new day
if changeDate:
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
changeDate=0
#check if a measure has been done at the considered time
table_list[count]=curr_date.strftime('%H-%M') in DateString
count+=1
#add to a dataframe
self.ListCompare2[n]=table_list
l2=self.ListCompare2
The second part is the following: given a "ListOfName" of modules to compare, check if they have been measured in the same time and only keep the values measure in the same minute.
ListToKeep=self.ListCompare2[ListOfName[0]]#take list of True or False done before
for i in ListOfName[1:]#for each other panels, check if True too
ListToKeep=ListToKeep&self.ListCompare2[i]
for i in ListOfName:#for each module, recover values
tmp=self.energiesDatas[self.energiesDatas['Name']==i]
count=0
#loop over value we want to keep (also energy produced and the interval of time)
for j,k,l,m,n in zip(tmp['list_time'],tmp['Date'],tmp['list_energy_prod'],tmp['list_energy_rec'],tmp['list_interval']):
#calculation of the index
delta_day=(k-self.dt.date()).days*(18*60)
#if the value of ListToKeep corresponding to the index is True, we keep the value
tmp['list_energy_prod'].iloc[count]=[ l[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_energy_rec'].iloc[count]=[ m[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_interval'].iloc[count]=[ n[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
count+=1
self.store_compare=self.store_compare.append(tmp)
Actually, this part is the one that takes a very long time.
My question is: Is there a way to save time, using build-in function or anything.
Thank you very much
Kilian
The answer of chris-sc sloved my problem:
I believe your data structure isn't appropriate for your problem. Especially the list in fields of a DataFrame, they make loops or apply almost unavoidable. Could you in principle re-structure the data? (For example one df per solar panel with columns date, time, energy)

Categories