Dataframe - Create a column based on another with IF formulas - python

I'm struggling with this rather complex calculated column.
The cumulative sum is watts of light.
It changes to 0 when the system resets it to a new day. So the 24hour day is sunrise to sunrise.
I want to use this fact to calculate a 'Date 2' that I can them summarize in the future to report average 24hr day temp, light etc.
For the First 0 Cumulative Sum of every Date, Date + 1 Day, Else the last row of Date 2.
I have been playing around with the following, assuming Advanced Date is a copy of Cumulative Sum:
for i in range(1, len(ClimateDF)):
j = ClimateDF.columns.get_loc('AdvancedDate')
if ClimateDF.iat[i, j] == 0 and ClimateDF.iat[i - 1, j] != 0:
print(ClimateDF.iat[i, j])
# ClimateDF.iat[i, 'AdvancedDate'] = 'New Day' #this doesn't work
ClimateDF['AdvancedDate'].values[i] = 1
else:
print(ClimateDF.iat[i, j])
#ClimateDF.iat[i, 'AdvancedDate'] = 'Not New Day' #this doesn't work
ClimateDF['AdvancedDate'].values[i] = 2
This doesn't quite do what I want, but I thought I was close. However when I change:
ClimateDF['AdvancedDate'].values[i] = 1
to
ClimateDF['AdvancedDate'].values[i] = ClimateDF['Date'].values[i]
I get a:
TypeError: float() argument must be a string or a number, not 'datetime.date'
Am I on the right track? How do I get past this error? Is there a more efficient way I could be doing this?

IIUC, you can first create a cumsum reflecting day change, and then calculate Date_2 by adding it to the first date:
s = (df["sum"].eq(0)&df["sum"].shift().ne(0)).cumsum()
df["Date_2"] = df["Datetime"][0]+pd.to_timedelta(s,unit="D") #base on first day to calculate offset for all days

Related

How to sample a python df on daily rate when it is greater than 500 yrs

I need to sample a dataframe that has a date range of 100 years at a daily rate because I want to get yearly totals (so I thought resample at daily rate then sum the yearly totals).
I tried
d0=start_date
# set date to model start date
d=d0
ind =Time_data2['datetime']
df_out=pd.DataFrame(index=range((max(ind)-d0).days),columns=
['datetime','year','value'])
for i in range((max(ind)-d0).days): # for every day in the total number of days in the simulation
d = d0 + datetime.timedelta(days=i) # get a particular day (= start_date + timedelta)
df_out.loc[i,'datetime']=d # assign datetime for each day
df_out.loc[i,'year']=d.year # assign year for each day
# Assign value based on the first value in the raw timeseries that proceeds the day being filled, this is equivilent to a backfill with the pandas resample
for t in model_flow_ts.index:
dt = t-d # calcualtes a timedelta between each index value in model_flow_ts and the particular day in the simulation
if dt.days < 0:
continue
else:
v = model_flow_ts.loc[t] # get the value
break
df_out.loc[i,'value']=v
if i/50000==int(i/50000):
print(i)
But it takes a really long time because there are so many days to sample...
Does anyone have any suggestions on how to speed it up?
cheers

Trying to find the difference in days between 2 dates

I have a date column in my dataframe and I am trying to create a new column ('delta_days') that has the difference (in days) between the current row and the previous row.
# Find amount of days difference between dates
for i in df:
new_date = date(df.iloc[i,'date'])
old_date = date(df.iloc[i-1,'date']) if i > 0 else date(df.iloc[0, 'date'])
df.iloc[i,'delta_days'] = new_date - old_date
I am using an iloc because I want to directly reference the 'date' column while i repersents the current row.
I am getting this error:
ValueError: Location based indexing can only have [integer, integer
slice (START point is INCLUDED, END point is EXCLUDED), listlike of
integers, boolean array] types
can someone please help
You can use pandas.DataFrame.shift method to achieve what you need.
Something more or less like this:
df['prev_date'] = df['date'].shift(1)
df['delta_days'] = df['date'] - df['prev_date']

Resampling a time series

I have a 40 year time series in the format stn;yyyymmddhh;rainfall , where yyyy= year, mm = month, dd= day,hh= hour. The series is at an hourly resolution. I extracted the maximum values for each year by the following groupby method:
import pandas as pd
df = pd.read_csv('data.txt', delimiter = ";")
df['yyyy'] = df['yyyymmhhdd'].astype(str).str[:4]
df.groupby(['yyyy'])['rainfall'].max().reset_index()
Now, i am trying to extract the maximum values for 3 hour duration each year. I tried this sliding maxima approach but it is not working. k is the duration I am interested in. In simple words,i need maximum precipitation sum for multiple durations in every year (eg 3h, 6h, etc)
class AMS:
def sliding_max(self, k, data):
tp = data.values
period = 24*365
agg_values = []
start_j = 1
end_j = k*int(np.floor(period/k))
for j in range(start_j, end_j + 1):
start_i = j - 1
end_i = j + k + 1
agg_values.append(np.nansum(tp[start_i:end_i]))
self.sliding_max = max(agg_values)
return self.sliding_max
Any suggestions or improvements in my code or is there a way i can implement it with groupby. I am a bit new to python environment, so please excuse if the question isn't put properly.
Stn;yyyymmddhh;rainfall
xyz;1981010100;0.0
xyz;1981010101;0.0
xyz;1981010102;0.0
xyz;1981010103;0.0
xyz;1981010104;0.0
xyz;1981010105;0.0
xyz;1981010106;0.0
xyz;1981010107;0.0
xyz;1981010108;0.0
xyz;1981010109;0.4
xyz;1981010110;0.6
xyz;1981010111;0.1
xyz;1981010112;0.1
xyz;1981010113;0.0
xyz;1981010114;0.1
xyz;1981010115;0.6
xyz;1981010116;0.0
xyz;1981010117;0.0
xyz;1981010118;0.2
xyz;1981010119;0.0
xyz;1981010120;0.0
xyz;1981010121;0.0
xyz;1981010122;0.0
xyz;1981010123;0.0
xyz;1981010200;0.0
You first have to convert your column containing the datetimes to a Series of type datetime. You can do that parsing by providing the format of your datetimes.
df["yyyymmddhh"] = pd.to_datetime(df["yyyymmddhh"], format="%Y%M%d%H")
After having the correct data type you have to set that column as your index and can now use pandas functionality for time series data (resampling in your case).
First you resample the data to 3 hour windows and sum the values. From that you resample to yearly data and take the maximum value of all the 3 hour windows for each year.
df.set_index("yyyymmddhh").resample("3H").sum().resample("Y").max()
# Output
yyyymmddhh rainfall
1981-12-31 1.1

How to do a rolling Groupby using a Multiindex

I have a multi index series. One of the indices is day and I try to go through and get the data in a day range. I have looked into using just rolling with a time given as a string, but it returns a list of the same length whereas I only need 1 response per unique date index.
This is my current code, is there a simpler way to do this:
result = {}
for date in df.index.levels[2]: #this goes through all of the days
pre_date = date - np.timedelta64(window,'D') #find window days ago
cur_df = df.loc[idx[:,:,pre_dat:date],:] #get all data in that window day range
result[date] = f(cur_df)
result = pd.Series(result)

Convert dataframe date row to a weekend / not weekend value

I have a dataframe with a DATE row and I need to convert it to a row of value 1 if the date is a weekend day and 0 if it is not.
So far, I converted the data to weekdays
df['WEEKDAY'] = pandas.to_datetime(df['DATE']).dt.dayofweek
It's there a way to create this "WEEKEND" row without functions?
Thanks!
Here's the solution I've come up with:
df['WEEKDAY'] = ((pd.DatetimeIndex(df.index).dayofweek) // 5 == 1).astype(float)
Essentially all it does is use integer division (//) to test whether the dayofweek attribute of the DatetimeIndex is less than 5. Normally this would return just a True or False, but tacking on the astype(float) at the end returns a 1 or 0 rather than a boolean.
One more way of getting weekend indicator is by where function:
df['WEEKDAY'] = np.where((df['DATE']).dt.dayofweek) < 5,0,1)
One more way of getting weekend indicator is by first converting the date column to day of the week and then using those values to get weekend/not weekend. This can be implemented as follows:
df['WEEKDAY'] = pandas.to_datetime(df['DATE']).dt.dayofweek # monday = 0, sunday = 6
df['weekend_indi'] = 0 # Initialize the column with default value of 0
df.loc[df['WEEKDAY'].isin([5, 6]), 'weekend_indi'] = 1 # 5 and 6 correspond to Sat and Sun
Found the top voted answer in some codebase. Please don't do it that way, it's completely unreadable. Instead do:
df['weekend'] = df['date'].dt.day_name().isin(['Saturday', 'Sunday'])
Note: df['date'] needs to be in datetime64 or similar format.
The simplest solution I found is:
df['WEEKEND'] = df['DATE'].dt.weekday > 4
The question asks for 0 or 1 instead of True and False, just multiply by 1:
df['WEEKEND'] = (df['DATE'].dt.weekday > 4) * 1

Categories