A dataframe splitting problem in Pandas, any thoughts? - python

The probe of an instrument is cycling back and forward along an x direction while is recording its position and acquiring the measurements. The probe makes 10 cycles, let's say from 0 to 10 um (go and back) and records the measurements. This gives 2 columns of data: position and measurement, where the position number cycle 0um->10um->0->10->0..., but these numbers have an experimental error so they are all different.
I need to split the dataframe at the beginning of each cycle. Any interesting strategy to tackle this problem? Please, let me know if you need more info. Thank in advance.
Below is link to an example of the dataframe that I have.
https://www.dropbox.com/s/af4r8lw5lfhwexr/Example.xlsx?dl=0
In this example the instrument made 3 cycles and generated the data (measurement). Cycle 1 = Index 0-20; Cycle 1 = Index 20-40; and Cycle 1 = Index 40-60. I need to divide this dataframe into 3 dataframes, one for each cycle (Index 0-20; Index 20-40; Index 40-60).
The tricky part is that the method needs to be "general" because each cycle can have a different number of datapoints (in this example that is fixed to 20), and different experiments can be performed with a different number of cycles.

My objective is to keep tract when the numbers start increasing again after decreasing to determine the cycle number. Not very elegant sorry.
import pandas as pd
df = pd.read_excel('Example.xlsx')
def cycle(array):
increasing = 1
cycle_num = 0
answer = []
for ind,val in enumerate(array):
try:
if array[ind+1]-array[ind]>=0:
if increasing==0:
cycle_num+=1
increasing=1
answer.append(cycle_num)
else:
answer.append(cycle_num)
increasing=0
except:
answer.append(cycle_num)
return answer
df['Cycle'] = cycle(df['Distance'].to_list())
grouped = df.groupby(['Cycle'])
print(grouped.get_group(0))
print(grouped.get_group(1))
print(grouped.get_group(2))

Related

how to automatically classify a list of numbers

Well, the context is: I have a list of wind speeds, let's imagine, 100 wind measurements from 0 to 50 km/h, so I want to automate the creation of a list by uploading the csv, let's imagine, every 5 km/h, that is, the ones that they go from 0 to 5, what go from 5 to 10... etc.
Let's go to the code:
wind = pd.read_csv("wind.csv")
df = pd.DataFrame(wind)
x = df["Value"]
d = sorted(pd.Series(x))
lst = [[] for i in range(0,(int(x.max())+1),5)]
this gives me a list of empty lists, i.e. if the winds go from 0 to 54 km/h will create 11 empty lists.
Now, to classify I did this:
for i in range(0,len(lst),1):
for e in range(0,55,5):
for n in d:
if n>e and n< (e+5):
lst[i].append(n)
else:
continue
My objective would be that when it reaches a number greater than 5, it jumps to the next level, that is, it adds 5 to the limits of the interval (e) and jumps to the next i to fill the second empty list in lst. I tried it in several ways because I imagine that the loops must go in a specific order to give a good result. This code is just an example of several that I tried, but they all gave me similar results, either all the lists were filled with all the numbers, or only the first list was filled with all the numbers
Your title mentions classifying the numbers -- are you looking for a categorical output like calm | gentle breeze | strong breeze | moderate gale | etc.? If so, take a look at the second example on the pd.qcut docs.
Since you're already using pandas, use pd.cut with an IntervalIndex (constructed with the pd.interval_range function) to get a Series of bins, and then groupby on that.
import pandas as pd
from math import ceil
BIN_WIDTH = 5
wind_velocity = (pd.read_csv("wind.csv")["Value"]).sort_values()
upper_bin_lim = BIN_WIDTH * ceil(wind_velocity.max() / BIN_WIDTH)
bins = pd.interval_range(
start=0,
end=upper_bin_lim,
periods=upper_bin_lim//BIN_WIDTH,
closed='left')
velocity_bins = pd.cut(wind_velocity, bins)
groups = wind_velocity.groupby(velocity_bins)
for name, group in groups:
#TODO: use `groups` to do stuff

calculate descriptive statistics in pandas dataframe based on a condition cyclically

I have a pandas DataFrame with more than 100 thousands of rows. Index represents the time and two columns represents the sensor data and the condition.
When the condition becomes 1, I want to start calculating score card (average and standard deviation) till the next 1 comes. This needs to be calculated for the whole dataset.
Here is a picture of the DataFrame for a specific time span:
What I thought is to iterate through index and items of the df and when condition is met I start to calculate the descriptive statistics.
cycle = 0
for i, row in df_b.iterrows():
if row['condition'] == 1:
print('Condition is changed')
cycle += 1
print('cycle: ', cycle)
#start = ?
#end = ?
#df_b.loc[start:end]
I am not sure how to calculate start and end for this DataFrame. The end will be the start for the next cycle. Additionally, I think this iteration is not the optimal one because it takes a bit of long time to iterate. I appreciate any idea or solution for this problem.
Maybe start out with getting the rows where condition == 1:
cond_1_df = df.loc[df['condition'] == 1]
This dataframe will only contain the rows that meet your condition (being 1).
From here on, you can access the timestamps pairwise, meaning that the first element is beginning and second element is end, sketched below:
former = 0
stamp_pairs = []
df = cond_1_df.reset_index() # make sure indexes pair with number of rows
for index, row in df.iterrows():
if former != 0:
beginning = former
end = row["timestamp"]
former = row["timestamp"]
else:
beginning = 0
end = row["timestamp"]
former = row["timestamp"]
stamp_pairs.append([beginning, end])
This should give you something like this:
[[stamp0, stamp1], [stamp1,stamp2], [stamp2, stamp3]...]
for each of these pairs, you can again create a df containing only the subset of rows where stamp_x < timestamp < stamp_x+1:
time_cond_df = df.loc[(df['timestamp'] > stamp_x) & (df['timestamp'] < stamp_x+1)]
Finally, you get one time_cond_df per timestamp tuple, on which you can perform your score calculations.
Just make shure that your timestamps are comparable with operators ">" and "<"! We can't tell since you did not explicate how you produced the timestamps.

Take intersection of data-frames in a for loop

I have a data-frame of 4 columns
DateTime | WindSpeed1 | WindSpeed2 |Direction
I have created a for loop where I am taking DateTime, WindSpeed1, WindSpeed2 columns iteratively and drop the rows with less than 3m/s wind speed. My question is-
How to access the iterations of a for loop in order to merge those iterations for the common DateTime index.
I am using Python 2.7 by the way.
Thank you very much in advance.
I am sorry for not being clear enough. What I have mentioned above is actually for a GUI that I am programming.
As you can see below I am taking the data frame and delete wind speeds less than a user given value(result[4]). But since "v_range2" differs from iteration to iteration I would like to find a way to merge every iteration for the common DataTime.
Do you have an idea how I can do that?
for i in range(0, len(dirbins)): # iterates on sectors
meansd = [] # mean ws by sector
heightsd = []
for j in result[0]: # iterates on ws
if j != "":
v = m.datafr(df).df[j]
v_range1 = v >= result[4]
v_range2 = v[v_range1]
vidx = m.chnames().index(j)
heightsd.append(float(str(m.channels[vidx].height)))
meansd.append(v_range2[digitized_d == i].mean()) # mean ws by sector

Finding a range in the series with the most frequent occurrence of the entries over a defined time (in Pandas)

I have a large dataset in Pandas in which the entries are marked with a time stamp. I'm looking for a solution how to get a range of a defined length (like 1 minute) with the highest occurrence of entries.
One solution could be to resample the data to a higher timeframe (such as a minute) and comparing the sections with the highest number of values. However, It would only find ranges that correspond to the start and end time of the given timeframe.
I'd rather find a solution to find any 1-minute ranges no matter where they actually start.
In following example I would be looking for 1 minute “window” with highest occurrence of the entries starting with the first signal in the range and ending with last signal in the range:
8:50:00
8:50:01
8:50:03
8:55:00
8:59:10
9:00:01
9:00:02
9:00:03
9:00:04
9:05:00
Thus I would like to get range 8:59:10 - 9:00:04
Any hint how to accomplish this?
You need to create 1 minute windows with a sliding start time of 1 second; compute the maximum occurrence for any of the windows. In pandas 0.19.0 or greater, you can resample a time series using base as an argument to start the resampled windows at different times.
I used tempfile to copy your data as a toy data set below.
import tempfile
import pandas as pd
tf = tempfile.TemporaryFile()
tf.write(b'''8:50:00
8:50:01
8:50:03
8:55:00
8:59:10
9:00:01
9:00:02
9:00:03
9:00:04
9:05:00''')
tf.seek(0)
df = pd.read_table(tf, header=None)
df.columns = ['time']
df.time = pd.to_datetime(df.time)
max_vals = []
for t in range(60):
# .max().max() is not a mistake, use it to return just the value
max_vals.append(
(t, df.resample('60s', on='time', base=t).count().max().max())
)
max(max_vals, key=lambda x: x[-1])
# returns:
(5, 5)
For this toy dataset, an offset of 5 seconds for the window (i.e. 8:49:05, 8:50:05, ...) has the first of the maximum count for a windows of 1 minute with 5 counts.

Python Pandas optimization algorithm: compare datetime and retrieve datas

This post is quiet long and I will be very grateful to everybody who reads it until the end. :)
I am experimenting execution python code issues and would like to know if you have a better way of doing what I want to.
I explain my problem brifely. I have plenty solar panels measurements. Each one of them is done each 3 minutes. Unfortunately, some measurements can fail. The goal is to compare the time in order to keep only the values that have been measured in the same minutes and then retrieve them. A GUI is also included in my software, so each time the user changes the panels to compare, the calculation has to be done again. To do so, I have implemented 2 parts, the first one creates a vector of true or false for each panel for each minute, and the second compare the previous vector and keep only the common measures.
All the datas are contained in the pandas df energiesDatas. The relevant columns are:
name: contains the name of the panel (length 1)
date: contains the day of the measurement (length 1)
list_time: contains a list of all time of measurement of a day (length N)
list_energy_prod : contains the corresponding measures (length N)
The first part loop over all possible minutes from beginning to end of measurements. If a measure has been done, add True, otherwise add False.
self.ListCompare2=pd.DataFrame()
for n in self.NameList:#loop over all my solar panels
m=self.energiesDatas[self.energiesDatas['Name']==n]#all datas
#table_date contains all the possible date from the 1st measure, with interval of 1 min.
table_list=[1 for i in range(len(table_date))]
pointerDate=0 #pointer to the current value of time
#all the measures of a given day are transform into a str of hour-minutes
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
#some test
changeDate=0
count=0
#store the current pointed date
m_date=m['Date'].iloc[pointerDate]
#for all possible time
for curr_date in table_date:
#if considered date is bigger, move pointer to next day
while curr_date.date()>m_date:
pointerDate+=1
changeDate=1
m_date=m['Date'].iloc[pointerDate]
#if the day is changed, recalculate the measures of this new day
if changeDate:
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
changeDate=0
#check if a measure has been done at the considered time
table_list[count]=curr_date.strftime('%H-%M') in DateString
count+=1
#add to a dataframe
self.ListCompare2[n]=table_list
l2=self.ListCompare2
The second part is the following: given a "ListOfName" of modules to compare, check if they have been measured in the same time and only keep the values measure in the same minute.
ListToKeep=self.ListCompare2[ListOfName[0]]#take list of True or False done before
for i in ListOfName[1:]#for each other panels, check if True too
ListToKeep=ListToKeep&self.ListCompare2[i]
for i in ListOfName:#for each module, recover values
tmp=self.energiesDatas[self.energiesDatas['Name']==i]
count=0
#loop over value we want to keep (also energy produced and the interval of time)
for j,k,l,m,n in zip(tmp['list_time'],tmp['Date'],tmp['list_energy_prod'],tmp['list_energy_rec'],tmp['list_interval']):
#calculation of the index
delta_day=(k-self.dt.date()).days*(18*60)
#if the value of ListToKeep corresponding to the index is True, we keep the value
tmp['list_energy_prod'].iloc[count]=[ l[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_energy_rec'].iloc[count]=[ m[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_interval'].iloc[count]=[ n[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
count+=1
self.store_compare=self.store_compare.append(tmp)
Actually, this part is the one that takes a very long time.
My question is: Is there a way to save time, using build-in function or anything.
Thank you very much
Kilian
The answer of chris-sc sloved my problem:
I believe your data structure isn't appropriate for your problem. Especially the list in fields of a DataFrame, they make loops or apply almost unavoidable. Could you in principle re-structure the data? (For example one df per solar panel with columns date, time, energy)

Categories