I am trying to write a highly efficient function that would take an average size dataframe (~5000 rows) and return a dataframe with column of the latest year (and same index) such that for each date index of the original dataframe the month containing that date is between some pre-specified start date (st_d) and end date (end_d). I wrote a code where the year is decremented till the month for a particular dateindex is within the desired range. However, it is really slow. For the dataframe with only 366 entries it takes ~0.2s. I need to make it at least an order of magnitude faster so that I can repeatedly apply it to tens of thousands of dataframes. I would very much appreciate any suggestions for this.
import pandas as pd
import numpy as np
import time
from pandas.tseries.offsets import MonthEnd
def year_replace(st_d, end_d, x):
tmp = time.perf_counter()
def prior_year(d):
# 100 is number of the years back, more than enough.
for i_t in range(100):
#The month should have been fully seen in one of the data years.
t_start = pd.to_datetime(str(d.month) + '/' + str(end_d.year - i_t), format="%m/%Y")
t_end = t_start + MonthEnd(1)
if t_start <= end_d and t_start >= st_d and t_end <= end_d and t_end >= st_d:
break
if i_t < 99:
return t_start.year
else:
raise BadDataException("Not enough data for Gradient Boosted tree.")
output = pd.Series(index = x.index, data = x.index.map(lambda tt: prior_year(tt)), name = 'year')
print("time for single dataframe replacement = ", time.perf_counter() - tmp)
return output
i = pd.date_range('01-01-2019', '01-01-2020')
x = pd.DataFrame(index = i, data=np.full(len(i), 0))
st_d = pd.to_datetime('01/2016', format="%m/%Y")
end_d = pd.to_datetime('01/2018', format="%m/%Y")
year_replace(st_d, end_d, x)
My advice is: avoid loop whenever you can and check out if an easier way is available.
If I do understand what you aim to do is:
For given start and stop timestamps, find the latest (higher) timestamp t where month is given from index and start <= t <= stop
I believe this can be formalized as follow (I kept your function signature for conveniance):
def f(start, stop, x):
assert start < stop
tmp = time.perf_counter()
def y(d):
# Check current year:
if start <= d.replace(day=1, year=stop.year) <= stop:
return stop.year
# Check previous year:
if start <= d.replace(day=1, year=stop.year-1) <= stop:
return stop.year-1
# Otherwise fail:
raise TypeError("Ooops")
# Apply to index:
df = pd.Series(index=x.index, data=x.index.map(lambda t: y(t)), name='year')
print("Tick: ", time.perf_counter() - tmp)
return df
It seems to execute faster as requested (almost two decades, we should benchmark to be sure, eg.: with timeit):
Tick: 0.004744200000004639
There is no need to iterate, you can just check current and previous year. If it fails, it cannot exist a timestamp fulfilling your requirements.
If the day must be kept, then just remove the day=1 in replace method. If you require cut criteria not being equal then modify inequalities accordingly. The following function:
def y(d):
if start < d.replace(year=stop.year) < stop:
return stop.year
if start < d.replace(year=stop.year-1) < stop:
return stop.year-1
raise TypeError("Ooops")
Returns the same dataframe as yours.
Related
I'm currently developing something and was wondering if the new match statement in python 3.10 would be suited for such a use case, where I have conditional statements.
As input I have a timestamp and a dataframe with dates and values. The goal is to loop over all rows and add the value to the corresponding bin bases on the date. Here, in which bin the value is placed depends on the date in relation with the timestamp. A date within 1 month of the timestamp is place in bin 1 and within 2 months in bin 2 etc...
The code that I have now is as follows:
bins = [0] * 7
for date, value in zip(df.iloc[:,0],df.iloc[:,1]):
match [date,value]:
case [date,value] if date < timestamp + pd.Timedelta(1,'m'):
bins[0] += value
case [date,value] if date > timestamp + pd.Timedelta(1,'m') and date < timestamp + pd.Timedelta(2,'m'):
bins[1] += value
case [date,value] if date > timestamp + pd.Timedelta(2,'m') and date < timestamp + pd.Timedelta(3,'m'):
bins[2] += value
case [date,value] if date > timestamp + pd.Timedelta(3,'m') and date < timestamp + pd.Timedelta(4,'m'):
bins[3] += value
case [date,value] if date > timestamp + pd.Timedelta(4,'m') and date < timestamp + pd.Timedelta(5,'m'):
bins[4] += value
case [date,value] if date > timestamp + pd.Timedelta(5,'m') and date < timestamp + pd.Timedelta(6,'m'):
bins[5] += value
Correction: originally I stated that this code does not work. It turns out that it actually does. However, I am still wondering if this would be an appropriate use of the match statement.
I'd say it's not a good use of structural pattern matching because there is no actual structure. You are checking values of the single object, so if/elif chain is a much better, more readable and natural choice.
I've got 2 more issues with the way you wrote it -
you do not consider values that are on the edges of the bins
You are checking same condition twice, even though if you reached some check in match/case you are guaranteed that the previous ones were not matched - so you do not need to do if date > timestamp + pd.Timedelta(1,'m') and... if previous check of if date < timestamp + pd.Timedelta(1,'m') failed you already know that it is not smaller. (There is an edge case of equality but it should be handled somehow anyway)
All in all I think this would be the cleaner solution:
for date, value in zip(df.iloc[:,0],df.iloc[:,1]):
if date < timestamp + pd.Timedelta(1,'m'):
bins[0] += value
elif date < timestamp + pd.Timedelta(2,'m'):
bins[1] += value
elif date < timestamp + pd.Timedelta(3,'m'):
bins[2] += value
elif date < timestamp + pd.Timedelta(4,'m'):
bins[3] += value
elif date < timestamp + pd.Timedelta(5,'m'):
bins[4] += value
elif date < timestamp + pd.Timedelta(6,'m'):
bins[5] += value
else:
pass
This should really be done directly with Pandas functions:
import pandas as pd
from datetime import datetime
timestamp = datetime.now()
bins = [pd.Timestamp(year=1970, month=1, day=1)]+[pd.Timestamp(timestamp)+pd.Timedelta(i, 'm') for i in range(6)]+[pd.Timestamp(year=2100, month=1, day=1)] # plus open bin on the right
n_samples = 1000
data = {
'date': [pd.to_datetime(timestamp)+pd.Timedelta(i,'s') for i in range(n_samples)],
'value': list(range(n_samples))
}
df = pd.DataFrame(data)
df['bin'] = pd.cut(df.date, bins, right=False)
df.groupby('bin').value.sum()
I have a large data frame that is being imported through an excel sheet. I already filtered it to exclude weekends but also need to do the same so only daytime hours eg 7:00 - 18:00 will be displayed. Here is what the data frame looks like after successfully taking out weekends.
picture of data
isBusinessDay = BDay().is_on_offset
match_series = pd.to_datetime(df['timestamp(America/New_York)']).map(isBusinessDay)
df_new = df[match_series]
df_new
A simple approach is to use filters on your datetime field using the Series dt accessor.
In this case...
filt = (df['timestamp(America/New_York)'].dt.hour >= 7) & (df['timestamp(America/New_York)'].dt.hour <= 18)
df_filtered = df.loc[filt, :]
More reading: https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.html
For more and a sample of this in action, see the below code block. The random date generator was taken from here and modified slightly.
import random
import time
import pandas as pd
def str_time_prop(start, end, time_format, prop):
"""Get a time at a proportion of a range of two formatted times.
start and end should be strings specifying times formatted in the
given format (strftime-style), giving an interval [start, end].
prop specifies how a proportion of the interval to be taken after
start. The returned time will be in the specified format.
"""
stime = time.mktime(time.strptime(start, time_format))
etime = time.mktime(time.strptime(end, time_format))
ptime = stime + prop * (etime - stime)
return time.strftime(time_format, time.localtime(ptime))
def random_date(start, end, prop):
return str_time_prop(start, end, '%Y-%m-%d %I:%M %p', prop)
dates = {'dtfield':[random_date("2007-1-1 1:30 PM", "2009-1-1 4:50 AM", random.random()) for n in range(1000)]}
df = pd.DataFrame(data=dates)
df['dtfield'] = pd.to_datetime(df['dtfield'])
filt = (df['dtfield'].dt.hour >= 7) & (df['dtfield'].dt.hour <= 18)
df_filtered = df.loc[filt, :]
df_filtered
I have a huge directory with thousand+ files in it
I want to get the files created from 7:30PM to 7:30 AM and vice versa.
I have been using below code to do it but seems its getting slower as files increase. I am running it on Linux.
First I defined get_time function here:
def get_time():
tmp_date = datetime.now()
year = tmp_date.year
month = tmp_date.month
day = tmp_date.day
date_start = datetime(year, month, day, 7,30)
date_end = datetime(year, month, day, 19,30)
shift = "Day Shift"
if (date_start < tmp_date) and (tmp_date > date_end):
date_start = datetime(year, month, day, 19,30)
date_end = datetime(year, month, day, 7,30) + timedelta(1)
shift = "Night Shift"
elif (date_start > tmp_date) and (tmp_date < date_end):
date_start = datetime(year, month, day, 19,30) - timedelta(1)
date_end = datetime(year, month, day, 7,30)
shift = "Night Shift"
return date_start, date_end, shift
and then
def get_qc_success(ROOT_FOLDER):
date_start, date_end, shift = get_time()
files = []
ARCHIVE_FOLDER = os.path.join(ROOT_FOLDER,"LOMS","ARCHIVE")
files = os.listdir(ARCHIVE_FOLDER)
for csv in os.listdir(ARCHIVE_FOLDER):
path = os.path.join(ARCHIVE_FOLDER,csv)
filetime = datetime.fromtimestamp(
os.path.getctime(path))
if (date_start < filetime < date_end):
files.append(csv)
len_success = len(files)
return files, len_success, shift
Is there any other methods to make it even faster ?
What you can do instead of returning is yielding.
def get_qc_success(ROOT_FOLDER,date_start,date_end):
date_start, date_end, shift = get_time()
files = []#You can remove this.
ARCHIVE_FOLDER = os.path.join(ROOT_FOLDER,"LOMS","ARCHIVE")
files = os.listdir(ARCHIVE_FOLDER)
for csv in os.listdir(ARCHIVE_FOLDER):
path = os.path.join(ARCHIVE_FOLDER,csv)
filetime = datetime.fromtimestamp(
os.path.getctime(path))
if (date_start < filetime < date_end):
yield csv
You still need the len_success right? which is the len of files? You can also compute it based on the generator variable.
For me this is the best way, Why? Check here.
date_start, date_end, shift = get_time()
generator = get_qc_success("filepathsample/",date_start,date_end) #Take note this generator contains your data being yielded in this case the csv variable from your function, you just need to iterate over it.
len_files = sum(1 for x in generator)
In case for your get_time() function, I think it would be better off
if you could put it at the top & just pass in the date_start &
date_end variables. Cause as I can see in GENERAL you only want to get
the list of files and you are just appending them in a list and
returning that list. Well there's a better way to do that which is
using the yield keyword.
You can check here about another question on yield.
Return vs Yield
Source here
If you are still curious about yield and return take this example & have a test run and observe the output and flow of the program. You'll see the benefits you'll get if you would consider yield.
import time
def foo():
data = []
for i in range(10):
data.append(i)
print("Sleeping")
time.sleep(2)
return data
def foofoo():
for i in range(10):
yield i
print("Sleeping")
time.sleep(2)
#Run this first and observe the Output program & flow
for f in foo():
print(f)
'''
#Run this second and observe the Output program & flow & comment out the first for loop for foo() above
for f in foofoo():
print(f)
'''
I have a dict of data entries with a UNIX epoch timestamp as the key, and some value (this could be Boolean, int, float, enumerated string). I'm trying to set up a method that takes a start time, end time, and bin size (x minutes, x hours or x days), puts the values in the dict into the array of one of the bins between these times.
Essentially, I'm trying to convert data from the real world measured at a certain time to data occurring on a time-step, starting at time=0 and going until time=T, where the length of the time step can be set when calling the method.
I'm trying to make something along the lines of:
def binTimeSeries(dict, startTime, endTime, timeStep):
bins = []
#floor begin time to a timeStep increment
#ciel end time to a timeStep increment
for key in dict.keys():
if key > floorStartTime and key < cielEndTime:
timeDiff = (key - floorStartTime)
binIndex = floor(timeDiff/timeStep)
bins[binIndex].append(dict[key])
I'm having trouble working out what time format is suitable to do the conversion from UNIX epoch timestamp to, that can handle the floor, ciel and modulo operations given a variable timeStep interval, and then how to actually perform those operations. I've searched for this, but am getting confused with the formalisms for datetime, pandas, and which might be more suitable for this.
Maybe something like this? Instead of asking for a bin size (the interval of each bin), I think it makes more sense to ask how many bins you'd like instead. That way you're guaranteed each bin will be the same size (cover the same interval).
In my example below, I generated some fake data, which I called data. The start- and end-timestamps I picked arbitrarily, as well as the number of bins. I calculate the difference between the end- and start-timestamps, which I'm calling the duration - this yields the total duration between the two timestamps (I realize it's a bit silly to recalculate this value, seeing as how I hardcoded it earlier in the end_time_stamp definition, but it's just there for completeness). The bin_interval (in seconds) can be calculated by dividing the duration by the number of bins.
I ended up doing everything just using plain old UNIX / POSIX timestamps, without any conversion. However, I will mention that datetime.datetime has a method called fromtimestamp, which accepts a POSIX timestamp and returns a datetime object populated with the year, month, seconds, etc.
In addition, in my example, all I end up adding to the bins are the keys - just for demonstration - you'll have to modify it to suit your needs.
def main():
import time
values = ["A", "B", "C", "D", "E", "F", "G"]
data = {time.time() + (offset * 32): value for offset, value in enumerate(values)}
start_time_stamp = time.time() + 60
end_time_stamp = start_time_stamp + 75
number_of_bins = 12
assert end_time_stamp > start_time_stamp
duration = end_time_stamp - start_time_stamp
bin_interval = duration / number_of_bins
bins = [[] for _ in range(number_of_bins)]
for key, value in data.items():
if not (start_time_stamp <= key <= end_time_stamp):
continue
for bin_index, current_bin in enumerate(bins):
if start_time_stamp + (bin_index * bin_interval) <= key < start_time_stamp + ((bin_index + 1) * bin_interval):
current_bin.append(key)
break
print("Original data:")
for key, value in data.items():
print(key, value)
print(f"\nStart time stamp: {start_time_stamp}")
print(f"End time stamp: {end_time_stamp}\n")
print(f"Bin interval: {bin_interval}")
print("Bins:")
for current_bin in bins:
print(current_bin)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
Original data:
1573170895.1871762 A
1573170927.1871762 B
1573170959.1871762 C
1573170991.1871762 D
1573171023.1871762 E
1573171055.1871762 F
1573171087.1871762 G
Start time stamp: 1573170955.1871762
End time stamp: 1573171030.1871762
Bin interval: 6.25
Bins:
[1573170959.1871762]
[]
[]
[]
[]
[1573170991.1871762]
[]
[]
[]
[]
[1573171023.1871762]
[]
Looking for a clean function for this, ideally in Pandas/Numpy. I'm currently building something messy out of CustomBusinessHour() and TimeDelta() functions from Pandas, but I think that there must be a better way. If Pandas had a CustomBusinessMinute() feature, this would be as easy as len(pd.date_range(timestamp1,timestamp,freq=CustomBusinessMinute())).
By "Business Minute," I mean a minute that meets certain criteria. For example, in my case this means 1) does not fall on weekend, 2) falls between 9am and 5pm, and 3) does not fall on Federal Holiday.
Thanks
Consider the following:
You will only have to closely examine the aspects of the start and end dates. IE Carefully calculate the business minutes for those two days.
For every other date in between, you only need to know one or things (1) If it's a weekday and if it is: (2) is it a Federal Holiday
For every qualifying date in the date range, you know exactly how many "Business minutes" are in each day: 480 minutes.
Pandas offers a way to get business days based on US Federal Holidays. That takes care of the hardest part. The rest should be easy to implement.
There's probably a more elegant way, but here's something to start with. Most of the code is for dealing with the start and end dates. Getting all the minutes in between is about 4 lines.
from dateutil.relativedelta import relativedelta
import pandas as pd
from pandas.tseries.offsets import CDay
from pandas.tseries.holiday import USFederalHolidayCalendar
business_day = CDay(calendar=USFederalHolidayCalendar())
def is_weekday(dt):
return dt.weekday() < 5
def is_holiday(dt):
return not len(pd.date_range(dt, dt, freq=business_day))
def weekend_or_holiday(dt):
'''helper function'''
if not is_weekday(dt):
return True
if is_holiday(dt):
return True
return False
def start_day_minutes(dt, end_of_day=None):
'''returns number of business minutes left in the day given a start datetime'''
if not end_of_day:
end_of_day = dt.replace(hour=17, minute=0)
if dt > end_of_day or weekend_or_holiday(dt):
return 0
num_of_minutes = (end_of_day - dt).seconds / 60
return num_of_minutes
def end_day_minutes(dt):
'''like start_day_minutes, but for the ending day.'''
start_of_day = dt.replace(hour=9, minute=0)
if dt < start_of_day or weekend_or_holiday(dt):
return 0
num_of_minutes = (dt - start_of_day).seconds / 60
return num_of_minutes
def business_minutes(t1, t2):
'''returns num of busniess minutes between t1 and t2'''
start = t1.replace(hour=0, minute=0) + relativedelta(days=1)
end = t2.replace(hour=0, minute=0) + relativedelta(days=-1)
days_between = pd.date_range(start, end, freq=business_day)
minutes_between = (len(days_between) * 480)
if (t1.year, t1.day) == (t2.year, t2.day):
start_end_minutes = start_day_minutes(t1, t2)
else:
start_end_minutes = start_day_minutes(t1) + end_day_minutes(t2)
minutes = minutes_between + start_end_minutes
return minutes
Example:
start=datetime(2016,1,1)
end=datetime(2017,1,1)
print(business_minutes(start,end))
#120480
I ended up manually coding my holidays, and writing a simple function based on pd.date_range
def isDuringBiz(t):
if (t.hour <= 8 or t.hour >= 17) or t.dayofweek in (5,6) or (t.day == 5 and t.month == 9):
return False
else:
return True
def getBizTimedelta(start, end):
bizMinutes = 0
minRange = pd.date_range(start,end,freq='1min')
for min in minRange:
if isDuringBiz(min):
bizMinutes += 1
return pd.Timedelta(minutes=bizMinutes)