calculate descriptive statistics in pandas dataframe based on a condition cyclically - python

I have a pandas DataFrame with more than 100 thousands of rows. Index represents the time and two columns represents the sensor data and the condition.
When the condition becomes 1, I want to start calculating score card (average and standard deviation) till the next 1 comes. This needs to be calculated for the whole dataset.
Here is a picture of the DataFrame for a specific time span:
What I thought is to iterate through index and items of the df and when condition is met I start to calculate the descriptive statistics.
cycle = 0
for i, row in df_b.iterrows():
if row['condition'] == 1:
print('Condition is changed')
cycle += 1
print('cycle: ', cycle)
#start = ?
#end = ?
#df_b.loc[start:end]
I am not sure how to calculate start and end for this DataFrame. The end will be the start for the next cycle. Additionally, I think this iteration is not the optimal one because it takes a bit of long time to iterate. I appreciate any idea or solution for this problem.

Maybe start out with getting the rows where condition == 1:
cond_1_df = df.loc[df['condition'] == 1]
This dataframe will only contain the rows that meet your condition (being 1).
From here on, you can access the timestamps pairwise, meaning that the first element is beginning and second element is end, sketched below:
former = 0
stamp_pairs = []
df = cond_1_df.reset_index() # make sure indexes pair with number of rows
for index, row in df.iterrows():
if former != 0:
beginning = former
end = row["timestamp"]
former = row["timestamp"]
else:
beginning = 0
end = row["timestamp"]
former = row["timestamp"]
stamp_pairs.append([beginning, end])
This should give you something like this:
[[stamp0, stamp1], [stamp1,stamp2], [stamp2, stamp3]...]
for each of these pairs, you can again create a df containing only the subset of rows where stamp_x < timestamp < stamp_x+1:
time_cond_df = df.loc[(df['timestamp'] > stamp_x) & (df['timestamp'] < stamp_x+1)]
Finally, you get one time_cond_df per timestamp tuple, on which you can perform your score calculations.
Just make shure that your timestamps are comparable with operators ">" and "<"! We can't tell since you did not explicate how you produced the timestamps.

Related

A dataframe splitting problem in Pandas, any thoughts?

The probe of an instrument is cycling back and forward along an x direction while is recording its position and acquiring the measurements. The probe makes 10 cycles, let's say from 0 to 10 um (go and back) and records the measurements. This gives 2 columns of data: position and measurement, where the position number cycle 0um->10um->0->10->0..., but these numbers have an experimental error so they are all different.
I need to split the dataframe at the beginning of each cycle. Any interesting strategy to tackle this problem? Please, let me know if you need more info. Thank in advance.
Below is link to an example of the dataframe that I have.
https://www.dropbox.com/s/af4r8lw5lfhwexr/Example.xlsx?dl=0
In this example the instrument made 3 cycles and generated the data (measurement). Cycle 1 = Index 0-20; Cycle 1 = Index 20-40; and Cycle 1 = Index 40-60. I need to divide this dataframe into 3 dataframes, one for each cycle (Index 0-20; Index 20-40; Index 40-60).
The tricky part is that the method needs to be "general" because each cycle can have a different number of datapoints (in this example that is fixed to 20), and different experiments can be performed with a different number of cycles.
My objective is to keep tract when the numbers start increasing again after decreasing to determine the cycle number. Not very elegant sorry.
import pandas as pd
df = pd.read_excel('Example.xlsx')
def cycle(array):
increasing = 1
cycle_num = 0
answer = []
for ind,val in enumerate(array):
try:
if array[ind+1]-array[ind]>=0:
if increasing==0:
cycle_num+=1
increasing=1
answer.append(cycle_num)
else:
answer.append(cycle_num)
increasing=0
except:
answer.append(cycle_num)
return answer
df['Cycle'] = cycle(df['Distance'].to_list())
grouped = df.groupby(['Cycle'])
print(grouped.get_group(0))
print(grouped.get_group(1))
print(grouped.get_group(2))

Pandas: Comparing all rows within group and check if condition is fulfilled

I am trying to compare all rows within a group to check if a condition is fulfilled. If the condition is not fulfilled, I set the new column to True, else False. The issue I am having is finding a neat way to compare all rows within each group. I have something that works but will not work where there are a lot of rows in a group.
for i in range(8):
n = -i-1
cond=(((df['age']-df['age'].shift(n))*(df['weight']-df['weight'].shift(n)))<0)&(df['ref']==df['ref'].shift(n))&(df['age']<7)&(df['age'].shift(n)<7)
df['x'+i] = cond.groupby(df['ref']).transform('any')
df.loc[:,'WFA'] = 0
df.loc[(df['x0']==False)&(df['x1']==False)&(df['x2']==False)&(df['x3']==False)&(df['x4']==False)&(df['x5']==False)&(df['x6']==False)&(df['x7']==False),'WFA'] = 1
To iterate through each row, I have created a loop that compares adjacent rows (using shift). Each loop represents the next adjacent row. In effect, I am able to compare all rows within a group where the number of rows within a group is 8 or less. As you can imagine, it becomes pretty cumbersome as the number of rows grows large.
Instead of creating of column for each period in shift, I want to see if any row matches the condition with any other row. Then set the new column 'WFA' True or False.
If anyone is interested, I post the answer to my own question here (although it is very slow):
df.loc[:,'WFA'] = 0
for ref, gref in df.groupby('ref'):
count=0
for r_idx, row in gref.iterrows():
cond = ((((row['age']-gref.loc[gref['age']<7, 'age'])*(row['weight']-gref.loc[gref['age']<7, 'weight']))<0).any())&(row['age']<7)
if cond==False:
count+=1
if count==len(gref):
df.loc[df['ref']==ref, 'WFA'] = 1

Take intersection of data-frames in a for loop

I have a data-frame of 4 columns
DateTime | WindSpeed1 | WindSpeed2 |Direction
I have created a for loop where I am taking DateTime, WindSpeed1, WindSpeed2 columns iteratively and drop the rows with less than 3m/s wind speed. My question is-
How to access the iterations of a for loop in order to merge those iterations for the common DateTime index.
I am using Python 2.7 by the way.
Thank you very much in advance.
I am sorry for not being clear enough. What I have mentioned above is actually for a GUI that I am programming.
As you can see below I am taking the data frame and delete wind speeds less than a user given value(result[4]). But since "v_range2" differs from iteration to iteration I would like to find a way to merge every iteration for the common DataTime.
Do you have an idea how I can do that?
for i in range(0, len(dirbins)): # iterates on sectors
meansd = [] # mean ws by sector
heightsd = []
for j in result[0]: # iterates on ws
if j != "":
v = m.datafr(df).df[j]
v_range1 = v >= result[4]
v_range2 = v[v_range1]
vidx = m.chnames().index(j)
heightsd.append(float(str(m.channels[vidx].height)))
meansd.append(v_range2[digitized_d == i].mean()) # mean ws by sector

Place Variable into a Specific Location [row,column] Pandas Python

I've been working to place a string variable "revenue" into a pandas dataframe df1. As you can see, I used df.ait.
More details about the code: It's about finding the specific date row, by m counting loop.
My issue occurs at the .iat.
if info[1] == "1": #Get Kikko1
listofdates = df.Date.tolist()
m = 0
for i in listofdates:
if i != date: #Counting the rows
m = m+1
elif i == date: #Select the row with the matched date
df.iat[m, 9] = "revenue"
break
The error says:
IndexError: index 36 is out of bounds for axis 0 with size 31
One of the main benefits of using a package like pandas is to avoid this kind of manual looping, which is very difficult to follow and modify.
I think you can do what you need to in one line. Something like:
df1.loc[date, 9] = 'revenue'
If that doesn't work, could you edit into your question some example data and your desired output?

What is the most efficient way to count the number of instances occurring within a time frame in python?

I am trying to run a simple count function which runs a dataframe of event times (specifically surgeries) against another dataframe of shift time frames, and returns a list of how many events occur during each shift. These csvs are thousands of rows, though, so while the way I have it set up currently works, it takes forever. This is what I have:
numSurgeries = [0 for shift in range(len(df.Date))]
for i in range(len(OR['PATIENT_IN_ROOM_DTTM'])):
for shift in range(len(df.DateTime)):
if OR['PATIENT_IN_ROOM_DTTM'][i] >= df.DateTime[shift] and OR['PATIENT_IN_ROOM_DTTM'][i] < df.DateTime[shift+1]:
numSurgeries[shift] += 1
So it loops through each event and checks to see which shift time frame it is in, then increments the count for that time frame. Logical, works, but definitely not efficient.
EDIT:
Example of OR data file
Example of df data file
Without example data, it's not absolutely clear what you want. But this should help you vectorise:
numSurgeries = {shift: np.sum((OR['PATIENT_IN_ROOM_DTTM'] >= df.DateTime[shift]) & \
(OR['PATIENT_IN_ROOM_DTTM'] < df.DateTime[shift+1])) \
for shift in range(len(df.Date))}
The output is a dictionary mapping integer shift to numSurgeries.
As mentioned above, it is hard to answer without example data.
However, a boolean mask sounds fitting. See Select dataframe rows between two dates.
Create a date mask from shift, we'll call the start and end dates start_shift and end_shift respectively. These should be in datetime format.
date_mask = (df['datetime'] >= start_shift) & (df['datetime'] <= end_shift)
Locate all values in df that fit this range.
df_shift = df.loc[date_mask]
Count the number of instances in the new df_shift.
num_surgeries = len(df_shift.index())
Cycle through all shifts.
def count_shifts(df, shift, results_df, start_shift, end_shift):
date_mask = (df['datetime'] >= start_shift) & (df['datetime'] <= end_shift)
df_shift = df.loc[date_mask]
num_surgeries = len(df_shift.index())
return(num_surgeries)
# iterates through df and applies the above function to every row
results_df['num_surgeries'] = results_df.apply(calculate_num_surgeries,axis=1)
Also remember to name variables according to PEP8 Style Guide! Camelcase is not recommended in Python.

Categories