I have a dataframe with the following structure:
Timestamp, Value, Start, End
I would like to know for each row, the maximum value based on TimeStamp >= Start & TimeStamp <= End
I cannot use rolling.max() because the start & end are not equally defined.
This is some data I have.. So basically for each row I would like to find the highest High[0] row where Time[0] is between Time[0] and Time[1]
Related
I have the following problem set which I have spent days on trying to find the optimal solution:
giving a country evaluating process having 3 parameters (V1,V2,V3), which may be recorded in sperate dates (Date1,Date2,Date3) repsectively, each record of a single parameter is stored in one row, with date and value, as illustrated in the picture below.
I need to fill empty cells in a row with two parameters recorded in other rows based on two rules
for the country (ISO), and the recorded parameter in current row, if the other two parameters' recordings of the same country exist in the table in other rows, chose the recorded parameter value/date for the same country in other rows with the date most close to the parameter recording date in current row.
otherwise, if the same country doesn't have other one or two parameters recorded in other rows yet, use the recorded parameter value and date in the current row to fill the empty cells of the other two parameters in the current row.
for example:
Record NO.1, country "AND" only have one parameter V1 recorded in the table (C2,D2), thus, (E2,F2) and (G2,H2) should be filled with value from (C2,D2),as 123 and 2022/4/12
Record NO.2, Country "COR" has only one V2 recorded (E3,F3), and two V1 recorded in Record NO.4 and NO.5, then, in Record NO.2, V1(C3,D3) should be filled with value in Record NO.5 (C6,D6), since Record NO.5's date (D6,2022.07.12) is most close to Record NO.2(D3,2022.07.13).
The process has to loop through the dataframe to fill all the empty cells. Please HELP!
There's no way I can think of without looping over the whole dataset each time you encounter an empty value. With that in mind, this could take a while to run if the file size is large. You may have to switch out date formats if excel auto formats the date differently.
import pandas
from datetime import datetime
pd = pandas.read_csv(filepath_or_buffer=r"path/to/csv")
dt_format = '%Y/%m/%d'
# dt_format = '%m/%d/%Y'
for index, row in pd.iterrows():
checks = [pandas.isnull(row['V1']), pandas.isnull(row['V2']), pandas.isnull(row['V3'])]
empties = [i + 1 for i in range(0, 3) if checks[i]]
reference = checks.index(False) + 1
closest = None
for empty in empties:
iso = row['ISO']
# Create timestamp of date of the filled in Date column
ts = datetime.strptime(row[f'Date{reference}'], dt_format).timestamp()
for index2, row2 in pd.iterrows():
if row2['ISO'] == iso and not pandas.isnull(row2[f'V{empty}']):
if closest is None:
closest = (row2[f'V{empty}'], row2[f'Date{empty}'])
else:
delta1 = abs(datetime.strptime(closest[1], dt_format).timestamp() - ts)
delta2 = abs(datetime.strptime(row2[f'Date{empty}'], dt_format).timestamp() - ts)
if delta2 < delta1:
closest = (row2[f'V{empty}'], row2[f'Date{empty}'])
if closest is not None:
pd.at[index, f'V{empty}'] = closest[0]
pd.at[index, f'Date{empty}'] = closest[1]
pd.to_csv(r"path/to/csv")
I have a pandas DataFrame with more than 100 thousands of rows. Index represents the time and two columns represents the sensor data and the condition.
When the condition becomes 1, I want to start calculating score card (average and standard deviation) till the next 1 comes. This needs to be calculated for the whole dataset.
Here is a picture of the DataFrame for a specific time span:
What I thought is to iterate through index and items of the df and when condition is met I start to calculate the descriptive statistics.
cycle = 0
for i, row in df_b.iterrows():
if row['condition'] == 1:
print('Condition is changed')
cycle += 1
print('cycle: ', cycle)
#start = ?
#end = ?
#df_b.loc[start:end]
I am not sure how to calculate start and end for this DataFrame. The end will be the start for the next cycle. Additionally, I think this iteration is not the optimal one because it takes a bit of long time to iterate. I appreciate any idea or solution for this problem.
Maybe start out with getting the rows where condition == 1:
cond_1_df = df.loc[df['condition'] == 1]
This dataframe will only contain the rows that meet your condition (being 1).
From here on, you can access the timestamps pairwise, meaning that the first element is beginning and second element is end, sketched below:
former = 0
stamp_pairs = []
df = cond_1_df.reset_index() # make sure indexes pair with number of rows
for index, row in df.iterrows():
if former != 0:
beginning = former
end = row["timestamp"]
former = row["timestamp"]
else:
beginning = 0
end = row["timestamp"]
former = row["timestamp"]
stamp_pairs.append([beginning, end])
This should give you something like this:
[[stamp0, stamp1], [stamp1,stamp2], [stamp2, stamp3]...]
for each of these pairs, you can again create a df containing only the subset of rows where stamp_x < timestamp < stamp_x+1:
time_cond_df = df.loc[(df['timestamp'] > stamp_x) & (df['timestamp'] < stamp_x+1)]
Finally, you get one time_cond_df per timestamp tuple, on which you can perform your score calculations.
Just make shure that your timestamps are comparable with operators ">" and "<"! We can't tell since you did not explicate how you produced the timestamps.
I have a dataframe where rows of data are in one second intervals, so 08:00:00, 08:00:01, etc. I want to take a rolling average over a period of 10 minutes, but I only want the rolling average to update on a minute by minute basis. So the rolling average values for 08:10:00 - 08:10:59 would all be the same value, and then at 8:11:00, it would update to a new value for the next minute.
Currently I'm using the following line to calculate a rolling average which updates every second:
df['counts-avg'] = df['counts'].rolling(window=600).mean()
I have another column for the seconds value called df['sec']. I got the indices of rows where seconds = 0 (the zeroth second of each minute) and replaced every other row with np.nan. Then I used fillna(method='ffill') to copy values downward.
df['counts-avg'] = df['counts'].rolling(window=600).mean()
erase_idx = df[df['sec'] > 0].index.values
ma = df['counts-avg']
ma.loc[erase_idx] = np.nan
ma = ma.fillna(method='ffill')
For example in the data frame:
Gyroscope 448083 -0.05965481 -0.26418558 -0.017044231
Gyroscope 448338 -0.033023197 -0.18002969 -0.011717909
Gyroscope 448340 -0.052197956 -0.17470336 -0.049002163
Gyroscope 448341 -0.03834952 -0.16937704 -0.034088463
I know elements in the second column, but I want to pull out the last three elements.
Interval is a tuple where I know the timestamp to start and end at, but those values may not be in the data frame.
If my interval[0] variable was 448075 and my interval[1] variable was 448345, how do I access all the elements from the 3rd, 4th, and 5th columns of the rows with column 1 values between interval[0] and interval[1]?
I try to find the closest values to interval[0] and interval[1] in the actual DataFrame.
start = np.abs(merge.iloc[:, 1] - i[0]).argmin()
end = np.abs(merge.iloc[:, 1] - i[1]).argmin()
for j in range(start, end):
gyro.append((merge.iloc[:, j][2], merge.iloc[:, j][3], merge.iloc[:, j][4])) if merge.iloc[:, j][0] == 'Gyroscope' \
else acc.append((merge.iloc[:, j][2], merge.iloc[:, j][3], merge.iloc[:, j][4]))
Then, using the range of numbers in the interval, I attempt to pull out the last three entries in each row.
gyro and acc are the lists I'm appending the last three entries in the relevant rows to.
merge.head() returns
Gyroscope 448083 -0.05965481 -0.26418558 -0.017044231
0 Gyroscope 448338.0 -0.033023 -0.180030 -0.011718
1 LinearAcceleration 448339.0 -0.048183 -0.427365 -0.019752
2 Gyroscope 448340.0 -0.052198 -0.174703 -0.049002
3 LinearAcceleration 448340.0 0.168791 -0.166547 -0.136918
4 Gyroscope 448341.0 -0.038350 -0.169377 -0.034088
Basically, I want to index based on the range of second column values and iterate through that.
i'm working with timeseries data with this format:
[timestamp][rain value]
i wanted to count rainfall events in the timeseries data, where we define a rainfall event as a subdataframe of the main dataframe which contains nonzero values between zero rainfall values
i managed to get the start of the dataframe by getting the index of rainfall value before the first nonzero value:
start = df.rain.values.nonzero()[0][0] - 1
cur = df[start:]
what i can't figure out is how to find the end. i was looking for some function zero():
end=cur.rain.values.zero()[0][0]
to find the next zero value in the rain column and mark that as the end of my subdataframe
additionally, because my data is sampled at 15min intervals, it would mean that a temporary lull of 15mins would give me two rainfall events instead of one, which realistically isn't true. which means i would like to define some time period - 6hrs for example - to separate rainfall events.
what i was thinking of (but could not execute because i couldn't find the end of the subdataframe yet), in pseudocode:
start = df.rain.values.nonzero()[0][0] - 1
cur = df[start:]
end=cur.rain.values.zero()[0][0]
temp = df[end:]
z = temp.rain.values.nonzero()[0][0] - 1
if timedelta (z-end) >=6hrs:
end stays as endpoint of cur
else:
z is new endpoint, find next nonzero to again check
so i guess my question is, how do i find the end of my subdataframe if i don't want to iterate over all rows
and am i on the right track with my pseudocode in defining the end of a rainfall event as, say, 6 hours of 0 rain.