Obtaining Data from a dataframe at desired timestep - python

I have a long timeseries with a 15 minute time step. I want to obtain timeseries at 3H time step from the existing series. I have tried different methods including the resample method. But the resample method does not work for me. I decided to run a loop to obtain these value. I used the following piece of code. But I am not sure why it is not working as I expect it to work. I cannot use the resample.mean() since I don't want to miss any actual peak values e.g. that of a flood wave. I want to keep the original data as it is.
station_number = []
timestamp = []
water_level = []
discharge = []
for i in df3.index:
station_number.append(df3['Station ID'][i])
timestamp.append(df3['Timestamp'][i])
water_level.append(df3['Water Level (m)'][i])
discharge.append(df3['Discharge (m^3/s)'][i])
i = i + 12
pass
df5 = pd.DataFrame(station_number, columns=['Station ID'],)
df5['Timestamp']= timestamp
df5['Water Level (m)']= water_level
df5['Discharge (m^3/s)']= discharge
df5
Running this code returns me the same dateframe. My logic is that the value of i updates by 12 steps and pick up the corresponding values from the dataset. Please guide if I am doing something wrong.

Related

Append std,mean columns to a DataFrame with a for-loop

I want to put the std and mean of a specific column of a dataframe for different days in a new dataframe. (The data comes from analyses conducted on big data in multiple excel files.)
I use a for-loop and append(), but it returns the last ones, not the whole.
here is my code:
hh = ['01:00','02:00','03:00','04:00','05:00']
for j in hh:
month = 1
hour = j
data = get_data(month, hour) ## it works correctly, reads individual Excel spreadsheet
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td = data.iloc[:,4].std()
meean = data.iloc[:,4].mean()
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
final.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean},ignore_index=True)
I am not sure, but I believe you should assign the final.append(... to a variable:
final = final.append({'Month':j ,'Hour':j,'standard deviation':x,'average':y},ignore_index=True)
Update
If time efficiency is of interest to you, it is suggested to use a list of your desired values ({'Month':j ,'Hour':j,'standard deviation':x,'average':y}), and assign this list to the dataframe. It is said it has better performance.(Thanks to #stefan_aus_hannover)
This is what I am referring to in the comments on Amirhossein's answer:
hh=['01:00','02:00','03:00','04:00','05:00']
lister = []
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
for j in hh:``
month=1
hour = j
data = get_data(month, hour) ## it works correctly
data=pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td=data.iloc[:,4].std()
meean=data.iloc[:,4].mean()
lister.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean})
final = final.append(pd.DataFrame(lister),ignore_index=True)
Conceptually you're just doing aggregate by hour, with the two functions std, mean; then appending that to your result dataframe. Something like the following; I'll revise it if you give us reproducible input data. Note the .agg/.aggregate() function accepts a dict of {'result_col': aggregating_function} and allows you to pass multiple aggregating functions, and directly name their result column, so no need to declare temporaries. If you only care about aggregating column 4 ('Total Load (MWh)'), then no need to read in columns 0..3.
for hour in hh:
# Read in columns-of-interest from individual Excel sheet for this month and day...
data = get_data(1, hour)
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
# Compute corresponding row of the aggregate...
dat_hh_aggregate = pd.DataFrame({['Month':whatever ,'Hour':hour]})
dat_hh_aggregate = dat_hh_aggregate.append(data.agg({'standard deviation':pd.Series.std, 'average':pd.Series.mean)})
final = final.append(dat_hh_aggregate, ignore_index=True)
Notes:
pd.read_excel usecols=['Flowday','Interval',...] allows you to avoid reading in columns that you aren't interested in the first place. You haven't supplied reproducible code for get_data(), but you should parameterize it so you can pass the list of columns-of-interest. But you seem to only want to aggregate column 4 ('Total Load (MWh)') anyway.
There's no need to store separate local variables s_td, meean, just directly use .aggregate()
There's no need to have both lister and final. Just have one results dataframe final, and append to it, ignoring the index. (If you get issues with that, post updated code here, make sure it's reproducible)

Python Pandas: How to get the maximum value per peak in multiple cycles

I am importing data from a machine that has thousands of cycles on it. Each cycle lasts a few minutes and has two peaks in pressure that I need to record. One example can be seen in the graph below.
In this cycle you can see there are two peaks, one at 807 psi and one at 936 psi. I need to record these values. I have sorted the data so i can determine when a cycle is on or off already, but now I need to figure out how to record these two maxes. I previouly tried this:
df2 = df.groupby('group')['Pressure'].nlargest(2).rename_axis (index=['group', 'row_index'])
to get the maxes, but realized this will only give me the two largest values which in some cycles happen right before the the peak.
In this example dataframe I have provided one cycle:
import pandas as pd
data = {'Pressure' : [100,112,114,120,123,420,123,1230,1320,1,23,13,13,13,123,13,123,3,222,2303,1233,1233,1,1,30,20,40,401,10,40,12,122,1,12,333]}
df = pd.DataFrame(data)
The peak values for this should be 1320, and 2303 whilke ignoring the slow increase to these peaks.
Thanks for any help!
(This is also for a ton of cycles, so i need it to be able to go through and record the peaks for each cycle)
Alright, I had a go, using the simple heuristic I suggested in my comment.
def filter_peaks(df):
df["before"] = df["Pressure"].shift(1)
df["after"] = df["Pressure"].shift(-1)
df["max"] = df.max(axis=1)
df = df.fillna(0)
return df[df["Pressure"] == df["max"]]["Pressure"].to_frame()
filter_peaks(df) # test one application
If you apply this once to your test dataframe, you get the following result:
You can see, that it almost doesn't work: the value at line 21 only needed to be a little higher for it to exceed the true second peak at line 8.
You can get round this by iterating, ie., with filter_peaks(filter_peaks(df)). You then do end up with a clean dataframe that you can apply your .nlargest strategy to.
EDIT
Complete code example:
import pandas as pd
data = {'Pressure' : [100,112,114,120,123,420,123,1230,1320,1,23,13,13,13,123,13,123,3,222,2303,1233,1233,1,1,30,20,40,401,10,40,12,122,1,12,333]}
df = pd.DataFrame(data)
def filter_peaks(df):
df["before"] = df["Pressure"].shift(1)
df["after"] = df["Pressure"].shift(-1)
df["max"] = df.max(axis=1)
df = df.fillna(0)
return df[df["Pressure"] == df["max"]]["Pressure"].to_frame()
df2 = filter_peaks(df) # or do it twice if you want to be sure: filter_peaks(filter_peaks(df))
df2["Pressure"].nlargest(2)
Output:
19 2303
8 1320
Name: Pressure, dtype: int64

Is there a way to loop through pandas dataframe and drop windows of rows dependent on condition?

Problem Summary - I have a dataframe of ~10,000 rows. Some rows contain data aberrations that I would like to get rid of, and those aberrations are tied to observations made at certain temperatures (one of the data columns).
What I've tried - My thought is that the easiest way to remove the rows of bad data is to loop through the temperature intervals, find the maximum index that is less than each of the temperature interval observations, and use the df.drop function to get rid of a window of rows around that index. Between every temperature interval at which bad data is observed, I reset the index of the dataframe. However, it seems to be completely unstable!! Sometimes it nearly works, other times it throws key errors. I think my problem may be in working with the data frame "in place," but I don't see another way to do it.
Example Code:
Here is an example with a synthetic dataframe and a function that uses the same principles that I've tried. Note that I've tried different renditions with .loc and .iloc (commented out below).
#Create synthetic dataframe
import pandas as pd
import numpy as np
temp_series = pd.Series(range(25, 126, 1))
temp_noise = np.random.rand(len(temp_series))*3
df = pd.DataFrame({'temp':(temp_series+temp_noise), 'data':(np.random.rand(len(temp_series)))*400})
#calculate length of original and copy original because function works in place.
before_length = len(df)
df_dup = df
temp_intervals = [50, 70, 92.7]
window = 5
From here, run a function based on the dataframe (df), the temperature observations (temp_intervals) and the window size (window):
def remove_window(df, intervals, window):
'''Loop through the temperature intervals to define a window of indices around given temperatures in the dataframe to drop. Drop the window of indices in place and reset the index prior to moving to the next interval.
'''
def remove_window(df, intervals, window):
for temp in intervals[0:len(intervals)]:
#Find index where temperature first crosses the interval input
cent_index = max(df.index[df['temp']<=temp].tolist())
#Define window of indices to remove from the df
drop_indices = list(range(cent_index-window, cent_index+window))
#Use df.drop
df.drop(drop_indices, inplace=True)
df.reset_index(drop=True)
return df
So, is this a problem with he funtcion I've defined or is there a problem with df.drop?
Thank you,
Brad
It can be tricky to repeatedly delete parts of the dataframe and keep track of what you're doing. A cleaner approach is to keep track of which rows you want to delete within the loop, but only delete them outside of the loop, all at once. This should also be faster.
def remove_window(df, intervals, window):
# Create a Boolean array indicating which rows to keep
keep_row = np.repeat(True, len(df))
for temp in intervals[0:len(intervals)]:
# Find index where temperature first crosses the interval input
cent_index = max(df.index[df['temp']<=temp].tolist())
# Define window of indices to remove from the df
keep_row[range(cent_index - window, cent_index + window)] = False
# Delete all unwanted rows at once, outside the loop
df = df[keep_row]
df.reset_index(drop=True, inplace=True)
return df

Python: Shift time series so they all match at a given y value

I'm writing my own code to analyse/visualise COVID-19 data from the European CDC.
https://opendata.ecdc.europa.eu/covid19/casedistribution/csv'
I've got a simple code to extract the data and make plots with cumulative deaths against time, and am trying to add functionality.
My aim is something like the attached graph, with all countries time shifted to match at (in this case the 5th death) I want to make a general bit of code to shift countries to match at the 'n'th death.
https://ourworldindata.org/grapher/covid-confirmed-deaths-since-5th-death
The current way I'm trying to do this is to have a maze of "if group is 'country' shift by ..." terms.
Where ... is a lookup to find the date for the particular 'country' when there were 'n' deaths, and to interpolate fractional dates where appropriate.
i.e. currently deaths are assigned as 00:00 day/month, but the data can be shifted by 2/3 a day as below.
datetime cumulative deaths
00:00 15/02 80
00:00 16/02 110
my '...' should give 16:00 15/02
I'm working on this right now but it doesn't feel very efficient and I'm sure there must be a much simpler way that I'm not seeing.
Essentially despite copious googling I can't seem to find a simple way of automatically shifting a bunch of timeseries to match at a particular y value, which feels like it should have some built-in functionality, i.e. a Lookup with interpolation.
####Live url (I've downloaded my own csv and been calling that for code development)
url = 'https://opendata.ecdc.europa.eu/covid19/casedistribution/csv'
dataraw = pd.read_csv(url)
#extract relevanty colums
data = dataraw.loc[:,["dateRep","countriesAndTerritories","deaths"]]
####convert date format
data['dateRep'] = pd.to_datetime(data['dateRep'],dayfirst=True)
####sort by date
data = data.sort_values(["dateRep"],ascending=True)
data['cumdeaths'] = data.groupby(['countriesAndTerritories']).cumsum()
##### limit to countries with cumulative deaths > 500
data = data.groupby('countriesAndTerritories').filter(lambda x:x['cumdeaths'].max() >500)
###### remove China from data for now as it doesn't match so well with dates
data = data.groupby('countriesAndTerritories').filter(lambda x:(x['countriesAndTerritories'] != "China").any())
##### only recent dates
data = data[data['dateRep'] > '2020-03-01']
print(data)
You can use groupby('country') and the pd.transform function to add a column which will set every row with the date in which its country hit the nth death.
Then you can do a vectorized subtraction of the date column and the new column to get the number of days.

numpy.where does not work properly with pandas dataframe

I am trying to divide a huge log data sets containing log data with StartTime and EndTime and other stuff.
I am using np.where to compare pandas dataframe object and then to divide it to hourly (may be half hour or quarterly) chunks, depends on hr and timeWindow value.
Below, Here, I am trying to divide the entire day logs to 1 hour chunks, but It does not gives me expected output.
I am out of ideas like where exactly my fault is!
# Holding very first time in the log data and stripping off
# second, minutes and microseconds.
today = datetime.strptime(log_start_time, "%Y-%m-%d %H:%M:%S.%f").replace(second = 0, minute = 0, microsecond = 0)
today_ts = int(time.mktime(today.timetuple())*1e9)
hr = 1
timeWindow = int(hr*60*60*1e9) #hour*minute*second*restdigits
parts = [df.loc[np.where((df["StartTime"] >= (today_ts + (i)*timeWindow)) & \
(df["StartTime"] < (today_ts + (i+1)*timeWindow)))].dropna(axis= 0, \
how='any') for i in range(0, rngCounter)]
If I check for first log entry inside my parts data, it is something like below:
00:00:00.
00:43:23.
01:12:59.
01:53:55.
02:23:52.
....
Where as I expect the output to be like below:
00:00:00
01:00:01
02:00:00
03:00:00
04:00:01
....
Though I have implemented it using an alternative way, but that's a work around and I lost few features by not having it like this.
Can Someone please figure out what exactly wrong with this approach?
Note: I am using python notebook with pandas, numpy.
Thanks to #raganjosh.
I got my solution to the problem by using pandas Grouper.
Below is my implementation.
I have used dynamic value for 'hr'.
timeWindow = str(hr)+'H'
# Dividing the log into "n" parts. Depends on timewindow initialisation.
df["ST"] = df['StartTime']
df = df.set_index(['ST'])
# Using the copied column as an index.
df.index = pd.to_datetime(df.index)
# Here the parts contain grouped chunks of data as per timewindow, list[0] = key of the group, list[1] = values.
parts = list(df.groupby(pd.TimeGrouper(freq=timeWindow))['StartTime', "ProcessTime", "EndTime"])

Categories