The probe of an instrument is cycling back and forward along an x direction while is recording its position and acquiring the measurements. The probe makes 10 cycles, let's say from 0 to 10 um (go and back) and records the measurements. This gives 2 columns of data: position and measurement, where the position number cycle 0um->10um->0->10->0..., but these numbers have an experimental error so they are all different.
I need to split the dataframe at the beginning of each cycle. Any interesting strategy to tackle this problem? Please, let me know if you need more info. Thank in advance.
Below is link to an example of the dataframe that I have.
https://www.dropbox.com/s/af4r8lw5lfhwexr/Example.xlsx?dl=0
In this example the instrument made 3 cycles and generated the data (measurement). Cycle 1 = Index 0-20; Cycle 1 = Index 20-40; and Cycle 1 = Index 40-60. I need to divide this dataframe into 3 dataframes, one for each cycle (Index 0-20; Index 20-40; Index 40-60).
The tricky part is that the method needs to be "general" because each cycle can have a different number of datapoints (in this example that is fixed to 20), and different experiments can be performed with a different number of cycles.
My objective is to keep tract when the numbers start increasing again after decreasing to determine the cycle number. Not very elegant sorry.
import pandas as pd
df = pd.read_excel('Example.xlsx')
def cycle(array):
increasing = 1
cycle_num = 0
answer = []
for ind,val in enumerate(array):
try:
if array[ind+1]-array[ind]>=0:
if increasing==0:
cycle_num+=1
increasing=1
answer.append(cycle_num)
else:
answer.append(cycle_num)
increasing=0
except:
answer.append(cycle_num)
return answer
df['Cycle'] = cycle(df['Distance'].to_list())
grouped = df.groupby(['Cycle'])
print(grouped.get_group(0))
print(grouped.get_group(1))
print(grouped.get_group(2))
I am trying to run a simple count function which runs a dataframe of event times (specifically surgeries) against another dataframe of shift time frames, and returns a list of how many events occur during each shift. These csvs are thousands of rows, though, so while the way I have it set up currently works, it takes forever. This is what I have:
numSurgeries = [0 for shift in range(len(df.Date))]
for i in range(len(OR['PATIENT_IN_ROOM_DTTM'])):
for shift in range(len(df.DateTime)):
if OR['PATIENT_IN_ROOM_DTTM'][i] >= df.DateTime[shift] and OR['PATIENT_IN_ROOM_DTTM'][i] < df.DateTime[shift+1]:
numSurgeries[shift] += 1
So it loops through each event and checks to see which shift time frame it is in, then increments the count for that time frame. Logical, works, but definitely not efficient.
EDIT:
Example of OR data file
Example of df data file
Without example data, it's not absolutely clear what you want. But this should help you vectorise:
numSurgeries = {shift: np.sum((OR['PATIENT_IN_ROOM_DTTM'] >= df.DateTime[shift]) & \
(OR['PATIENT_IN_ROOM_DTTM'] < df.DateTime[shift+1])) \
for shift in range(len(df.Date))}
The output is a dictionary mapping integer shift to numSurgeries.
As mentioned above, it is hard to answer without example data.
However, a boolean mask sounds fitting. See Select dataframe rows between two dates.
Create a date mask from shift, we'll call the start and end dates start_shift and end_shift respectively. These should be in datetime format.
date_mask = (df['datetime'] >= start_shift) & (df['datetime'] <= end_shift)
Locate all values in df that fit this range.
df_shift = df.loc[date_mask]
Count the number of instances in the new df_shift.
num_surgeries = len(df_shift.index())
Cycle through all shifts.
def count_shifts(df, shift, results_df, start_shift, end_shift):
date_mask = (df['datetime'] >= start_shift) & (df['datetime'] <= end_shift)
df_shift = df.loc[date_mask]
num_surgeries = len(df_shift.index())
return(num_surgeries)
# iterates through df and applies the above function to every row
results_df['num_surgeries'] = results_df.apply(calculate_num_surgeries,axis=1)
Also remember to name variables according to PEP8 Style Guide! Camelcase is not recommended in Python.
I am trying to divide a huge log data sets containing log data with StartTime and EndTime and other stuff.
I am using np.where to compare pandas dataframe object and then to divide it to hourly (may be half hour or quarterly) chunks, depends on hr and timeWindow value.
Below, Here, I am trying to divide the entire day logs to 1 hour chunks, but It does not gives me expected output.
I am out of ideas like where exactly my fault is!
# Holding very first time in the log data and stripping off
# second, minutes and microseconds.
today = datetime.strptime(log_start_time, "%Y-%m-%d %H:%M:%S.%f").replace(second = 0, minute = 0, microsecond = 0)
today_ts = int(time.mktime(today.timetuple())*1e9)
hr = 1
timeWindow = int(hr*60*60*1e9) #hour*minute*second*restdigits
parts = [df.loc[np.where((df["StartTime"] >= (today_ts + (i)*timeWindow)) & \
(df["StartTime"] < (today_ts + (i+1)*timeWindow)))].dropna(axis= 0, \
how='any') for i in range(0, rngCounter)]
If I check for first log entry inside my parts data, it is something like below:
00:00:00.
00:43:23.
01:12:59.
01:53:55.
02:23:52.
....
Where as I expect the output to be like below:
00:00:00
01:00:01
02:00:00
03:00:00
04:00:01
....
Though I have implemented it using an alternative way, but that's a work around and I lost few features by not having it like this.
Can Someone please figure out what exactly wrong with this approach?
Note: I am using python notebook with pandas, numpy.
Thanks to #raganjosh.
I got my solution to the problem by using pandas Grouper.
Below is my implementation.
I have used dynamic value for 'hr'.
timeWindow = str(hr)+'H'
# Dividing the log into "n" parts. Depends on timewindow initialisation.
df["ST"] = df['StartTime']
df = df.set_index(['ST'])
# Using the copied column as an index.
df.index = pd.to_datetime(df.index)
# Here the parts contain grouped chunks of data as per timewindow, list[0] = key of the group, list[1] = values.
parts = list(df.groupby(pd.TimeGrouper(freq=timeWindow))['StartTime', "ProcessTime", "EndTime"])
I have AWS EC2 instance CPU utilization and other metric data given to me in CSV format like this:
Date,Time,CPU_Utilization,Unit
2016-10-17,09:25:00,22.5,Percent
2016-10-17,09:30:00,6.534,Percent
2016-10-17,09:35:00,19.256,Percent
2016-10-17,09:40:00,43.032,Percent
2016-10-17,09:45:00,58.954,Percent
2016-10-17,09:50:00,56.628,Percent
2016-10-17,09:55:00,25.866,Percent
2016-10-17,10:00:00,17.742,Percent
2016-10-17,10:05:00,34.22,Percent
2016-10-17,10:10:00,26.07,Percent
2016-10-17,10:15:00,20.066,Percent
2016-10-17,10:20:00,15.466,Percent
2016-10-17,10:25:00,16.2,Percent
2016-10-17,10:30:00,14.27,Percent
2016-10-17,10:35:00,5.666,Percent
2016-10-17,10:40:00,4.534,Percent
2016-10-17,10:45:00,4.6,Percent
2016-10-17,10:50:00,4.266,Percent
2016-10-17,10:55:00,4.2,Percent
2016-10-17,11:00:00,4.334,Percent
2016-10-17,11:05:00,4.334,Percent
2016-10-17,11:10:00,4.532,Percent
2016-10-17,11:15:00,4.266,Percent
2016-10-17,11:20:00,4.266,Percent
2016-10-17,11:25:00,4.334,Percent
As in evident, this is reported every 5 minutes. I do not have access to the aws-cli. I need to process this and report average utilization every 15 minutes for visualization. That is, for every hour, I need to find the average of the values in the first 15 minutes, the next fifteen minutes and so on. So, I will be reporting 4 values every hour.
A sample output would be:
Date,Time,CPU_Utilization,Unit
2016-10-17,09:30:00,14.517,Percent
2016-10-17,09:45:00,40.414,Percent
2016-10-17,10:00:00,33.412,Percent
2016-10-17,10:15:00,26.785,Percent
...
One way to do it would be to read the entire file (which has 10000+) lines, then for each date, find the values which belong to one window of 15 minutes, compute their average and repeat for all the values. This does not seem to be the best and the most efficient approach. Is there a better way to do it? Thank you.
As your input data is actually pretty small, I'd suggest to read it in at once by use of np.genfromtxt. Then you can find the appropriate range by checking when a full quarter of an hour is reached and end by counting how many full quarters are left. Then you can use np.reshape to get the array to a form with rows of quarters of hours and then average over those rows:
import numpy as np
# Read in the data:
data = np.genfromtxt("data.dat", skip_header=1,
dtype=[("date", "|S10"),
("time", "|S8"),
("cpu_usage", "f8")],
delimiter=',', usecols=(0, 1, 2))
# Find the first full quarter:
firstQuarterHour = 0
while not (int(data[firstQuarterHour]["time"][3:5]) % 15 == 0):
firstQuarterHour += 1
noOfQuarterHours = data[firstQuarterHour:].shape[0]/3
# Create a reshaped array
reshaped = data[firstQuarterHour:firstQuarterHour+3*noOfQuarterHours+1].reshape(
(noOfQuarterHours, 3))
# Average over cpu_usage and take the appropriate dates and times:
cpu_usage = reshaped["cpu_usage"].mean(axis=1)
dates = reshaped["date"][:, 0]
times = reshaped["time"][:, 0]
Now you can use those arrays to for example save into another text file by use of np.savetxt.
This post is quiet long and I will be very grateful to everybody who reads it until the end. :)
I am experimenting execution python code issues and would like to know if you have a better way of doing what I want to.
I explain my problem brifely. I have plenty solar panels measurements. Each one of them is done each 3 minutes. Unfortunately, some measurements can fail. The goal is to compare the time in order to keep only the values that have been measured in the same minutes and then retrieve them. A GUI is also included in my software, so each time the user changes the panels to compare, the calculation has to be done again. To do so, I have implemented 2 parts, the first one creates a vector of true or false for each panel for each minute, and the second compare the previous vector and keep only the common measures.
All the datas are contained in the pandas df energiesDatas. The relevant columns are:
name: contains the name of the panel (length 1)
date: contains the day of the measurement (length 1)
list_time: contains a list of all time of measurement of a day (length N)
list_energy_prod : contains the corresponding measures (length N)
The first part loop over all possible minutes from beginning to end of measurements. If a measure has been done, add True, otherwise add False.
self.ListCompare2=pd.DataFrame()
for n in self.NameList:#loop over all my solar panels
m=self.energiesDatas[self.energiesDatas['Name']==n]#all datas
#table_date contains all the possible date from the 1st measure, with interval of 1 min.
table_list=[1 for i in range(len(table_date))]
pointerDate=0 #pointer to the current value of time
#all the measures of a given day are transform into a str of hour-minutes
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
#some test
changeDate=0
count=0
#store the current pointed date
m_date=m['Date'].iloc[pointerDate]
#for all possible time
for curr_date in table_date:
#if considered date is bigger, move pointer to next day
while curr_date.date()>m_date:
pointerDate+=1
changeDate=1
m_date=m['Date'].iloc[pointerDate]
#if the day is changed, recalculate the measures of this new day
if changeDate:
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
changeDate=0
#check if a measure has been done at the considered time
table_list[count]=curr_date.strftime('%H-%M') in DateString
count+=1
#add to a dataframe
self.ListCompare2[n]=table_list
l2=self.ListCompare2
The second part is the following: given a "ListOfName" of modules to compare, check if they have been measured in the same time and only keep the values measure in the same minute.
ListToKeep=self.ListCompare2[ListOfName[0]]#take list of True or False done before
for i in ListOfName[1:]#for each other panels, check if True too
ListToKeep=ListToKeep&self.ListCompare2[i]
for i in ListOfName:#for each module, recover values
tmp=self.energiesDatas[self.energiesDatas['Name']==i]
count=0
#loop over value we want to keep (also energy produced and the interval of time)
for j,k,l,m,n in zip(tmp['list_time'],tmp['Date'],tmp['list_energy_prod'],tmp['list_energy_rec'],tmp['list_interval']):
#calculation of the index
delta_day=(k-self.dt.date()).days*(18*60)
#if the value of ListToKeep corresponding to the index is True, we keep the value
tmp['list_energy_prod'].iloc[count]=[ l[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_energy_rec'].iloc[count]=[ m[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_interval'].iloc[count]=[ n[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
count+=1
self.store_compare=self.store_compare.append(tmp)
Actually, this part is the one that takes a very long time.
My question is: Is there a way to save time, using build-in function or anything.
Thank you very much
Kilian
The answer of chris-sc sloved my problem:
I believe your data structure isn't appropriate for your problem. Especially the list in fields of a DataFrame, they make loops or apply almost unavoidable. Could you in principle re-structure the data? (For example one df per solar panel with columns date, time, energy)