I'm currently developing something and was wondering if the new match statement in python 3.10 would be suited for such a use case, where I have conditional statements.
As input I have a timestamp and a dataframe with dates and values. The goal is to loop over all rows and add the value to the corresponding bin bases on the date. Here, in which bin the value is placed depends on the date in relation with the timestamp. A date within 1 month of the timestamp is place in bin 1 and within 2 months in bin 2 etc...
The code that I have now is as follows:
bins = [0] * 7
for date, value in zip(df.iloc[:,0],df.iloc[:,1]):
match [date,value]:
case [date,value] if date < timestamp + pd.Timedelta(1,'m'):
bins[0] += value
case [date,value] if date > timestamp + pd.Timedelta(1,'m') and date < timestamp + pd.Timedelta(2,'m'):
bins[1] += value
case [date,value] if date > timestamp + pd.Timedelta(2,'m') and date < timestamp + pd.Timedelta(3,'m'):
bins[2] += value
case [date,value] if date > timestamp + pd.Timedelta(3,'m') and date < timestamp + pd.Timedelta(4,'m'):
bins[3] += value
case [date,value] if date > timestamp + pd.Timedelta(4,'m') and date < timestamp + pd.Timedelta(5,'m'):
bins[4] += value
case [date,value] if date > timestamp + pd.Timedelta(5,'m') and date < timestamp + pd.Timedelta(6,'m'):
bins[5] += value
Correction: originally I stated that this code does not work. It turns out that it actually does. However, I am still wondering if this would be an appropriate use of the match statement.
I'd say it's not a good use of structural pattern matching because there is no actual structure. You are checking values of the single object, so if/elif chain is a much better, more readable and natural choice.
I've got 2 more issues with the way you wrote it -
you do not consider values that are on the edges of the bins
You are checking same condition twice, even though if you reached some check in match/case you are guaranteed that the previous ones were not matched - so you do not need to do if date > timestamp + pd.Timedelta(1,'m') and... if previous check of if date < timestamp + pd.Timedelta(1,'m') failed you already know that it is not smaller. (There is an edge case of equality but it should be handled somehow anyway)
All in all I think this would be the cleaner solution:
for date, value in zip(df.iloc[:,0],df.iloc[:,1]):
if date < timestamp + pd.Timedelta(1,'m'):
bins[0] += value
elif date < timestamp + pd.Timedelta(2,'m'):
bins[1] += value
elif date < timestamp + pd.Timedelta(3,'m'):
bins[2] += value
elif date < timestamp + pd.Timedelta(4,'m'):
bins[3] += value
elif date < timestamp + pd.Timedelta(5,'m'):
bins[4] += value
elif date < timestamp + pd.Timedelta(6,'m'):
bins[5] += value
else:
pass
This should really be done directly with Pandas functions:
import pandas as pd
from datetime import datetime
timestamp = datetime.now()
bins = [pd.Timestamp(year=1970, month=1, day=1)]+[pd.Timestamp(timestamp)+pd.Timedelta(i, 'm') for i in range(6)]+[pd.Timestamp(year=2100, month=1, day=1)] # plus open bin on the right
n_samples = 1000
data = {
'date': [pd.to_datetime(timestamp)+pd.Timedelta(i,'s') for i in range(n_samples)],
'value': list(range(n_samples))
}
df = pd.DataFrame(data)
df['bin'] = pd.cut(df.date, bins, right=False)
df.groupby('bin').value.sum()
Related
I am trying to write a highly efficient function that would take an average size dataframe (~5000 rows) and return a dataframe with column of the latest year (and same index) such that for each date index of the original dataframe the month containing that date is between some pre-specified start date (st_d) and end date (end_d). I wrote a code where the year is decremented till the month for a particular dateindex is within the desired range. However, it is really slow. For the dataframe with only 366 entries it takes ~0.2s. I need to make it at least an order of magnitude faster so that I can repeatedly apply it to tens of thousands of dataframes. I would very much appreciate any suggestions for this.
import pandas as pd
import numpy as np
import time
from pandas.tseries.offsets import MonthEnd
def year_replace(st_d, end_d, x):
tmp = time.perf_counter()
def prior_year(d):
# 100 is number of the years back, more than enough.
for i_t in range(100):
#The month should have been fully seen in one of the data years.
t_start = pd.to_datetime(str(d.month) + '/' + str(end_d.year - i_t), format="%m/%Y")
t_end = t_start + MonthEnd(1)
if t_start <= end_d and t_start >= st_d and t_end <= end_d and t_end >= st_d:
break
if i_t < 99:
return t_start.year
else:
raise BadDataException("Not enough data for Gradient Boosted tree.")
output = pd.Series(index = x.index, data = x.index.map(lambda tt: prior_year(tt)), name = 'year')
print("time for single dataframe replacement = ", time.perf_counter() - tmp)
return output
i = pd.date_range('01-01-2019', '01-01-2020')
x = pd.DataFrame(index = i, data=np.full(len(i), 0))
st_d = pd.to_datetime('01/2016', format="%m/%Y")
end_d = pd.to_datetime('01/2018', format="%m/%Y")
year_replace(st_d, end_d, x)
My advice is: avoid loop whenever you can and check out if an easier way is available.
If I do understand what you aim to do is:
For given start and stop timestamps, find the latest (higher) timestamp t where month is given from index and start <= t <= stop
I believe this can be formalized as follow (I kept your function signature for conveniance):
def f(start, stop, x):
assert start < stop
tmp = time.perf_counter()
def y(d):
# Check current year:
if start <= d.replace(day=1, year=stop.year) <= stop:
return stop.year
# Check previous year:
if start <= d.replace(day=1, year=stop.year-1) <= stop:
return stop.year-1
# Otherwise fail:
raise TypeError("Ooops")
# Apply to index:
df = pd.Series(index=x.index, data=x.index.map(lambda t: y(t)), name='year')
print("Tick: ", time.perf_counter() - tmp)
return df
It seems to execute faster as requested (almost two decades, we should benchmark to be sure, eg.: with timeit):
Tick: 0.004744200000004639
There is no need to iterate, you can just check current and previous year. If it fails, it cannot exist a timestamp fulfilling your requirements.
If the day must be kept, then just remove the day=1 in replace method. If you require cut criteria not being equal then modify inequalities accordingly. The following function:
def y(d):
if start < d.replace(year=stop.year) < stop:
return stop.year
if start < d.replace(year=stop.year-1) < stop:
return stop.year-1
raise TypeError("Ooops")
Returns the same dataframe as yours.
I need to check that two dates, not in any date range on the list.
I want to find out can user check-in in dates (check_range_true - can, check_range_false - can't) or this dates already booked (in date_ranges)
I have range looks like:
date_ranges = [
['2020-1-12', '2020-1-13'],
['2020-1-14', '2020-1-15'],
['2020-1-15', '2020-1-16'],
['2020-1-16', '2020-1-18'],
['2020-1-18', '2020-1-19'],
['2020-1-21', '2020-1-23'],
['2020-1-23', '2020-1-27'],
['2020-1-30', '2020-2-1'],
['2020-2-5', '2020-2-7'],
['2020-2-7', '2020-2-9'],
['2020-2-9', '2020-2-11'],
['2020-2-14', '2020-2-18'],
['2020-2-20', '2020-2-26'],
['2020-3-26', '2020-3-30'],
['2020-5-29', '2020-5-30'],
['2020-10-10', '2021-1-15']
]
And two dates (for example)
check_range_true = ['2020-02-02', '2020-02-04']
check_range_false = ['2020-02-02', '2020-02-05']
I know how check one date in range but not understand how to solve it with two dates.
What to the best way to check these dates in a range and got results, True for the first variable (because of 2020-02-02, 2020-02-04 not "touch" range) and False for the second variable (because of 2020-02-05 is in range of ['2020-2-5', '2020-2-7'])?
What you what to do is to check the dates with (start < first_date < end) and (start < end_date < end) logic
date_ranges = [
['2020-1-12', '2020-1-13'],
['2020-1-14', '2020-1-15'],
['2020-1-15', '2020-1-16'],
['2020-1-16', '2020-1-18'],
['2020-1-18', '2020-1-19'],
['2020-1-21', '2020-1-23'],
['2020-1-23', '2020-1-27'],
['2020-1-30', '2020-2-1'],
['2020-2-5', '2020-2-7'],
['2020-2-7', '2020-2-9'],
['2020-2-9', '2020-2-11'],
['2020-2-14', '2020-2-18'],
['2020-2-20', '2020-2-26'],
['2020-3-26', '2020-3-30'],
['2020-5-29', '2020-5-30'],
['2020-10-10', '2021-1-15']
]
#convert to a flat list
date_ranges = [k for i in date_ranges for k in i]
#truncate the start and the end value
date_ranges = date_ranges[1:-1]
#convert values to datetime
import datetime
date_ranges = [datetime.datetime.strptime(i, '%Y-%m-%d') for i in date_ranges]
#create available time slots
date_ranges = [[date_ranges[i],date_ranges[i+1]] for i in range(0,len(date_ranges),2)]
#convert the check date to date time
check_range = ['2020-02-02', '2020-02-04']
check_range = [datetime.datetime.strptime(i, '%Y-%m-%d') for i in check_range]
# apply the logic of start < date < end twice
any([(i[0] < check_range[0] < i[1]) and (i[0] < check_range[1] < i[1]) for i in date_ranges])
True
check_range = ['2020-02-02', '2020-02-05']
check_range = [datetime.datetime.strptime(i, '%Y-%m-%d') for i in check_range]
any([(i[0] < check_range[0] < i[1]) and (i[0] < check_range[1] < i[1]) for i in date_ranges])
False
If I understand this correctly you want to check if given date range (e.g. check_range_true) overlaps (or not) with any other date range specified in the list. To achieve this, I would first transform string values to proper datetime objects for easier dates comparison. This could be achieved with list comprehension and strptime:
from datetime import datetime
booked_date_ranges = [
[datetime.strptime(start_date, '%Y-%m-%d'), datetime.strptime(end_date, '%Y-%m-%d')]
for start_date, end_date in date_ranges
]
Then I would create a function, which would check if provided date range (consisting of start date, and end date) overlaps with any date range from the previously specified list. You need to check if start date is between date range or end date is between date range. It would be something along these lines:
def dates_overlap(date_range_to_check, booked_date_ranges):
dates = [datetime.strptime(date, '%Y-%m-%d') for date in date_range_to_check]
return any(
(start_date <= dates[0] and dates[0] <= end_date) or (start_date <= dates[1] and dates[1] <= end_date)
for start_date, end_date in booked_date_ranges
)
Then if you want to check if given date range DOES NOT overlap, you can just use the dates_overlap function and negate the result:
>>> not dates_overlap(check_range_false, booked_date_ranges)
False
>>> not dates_overlap(check_range_true, booked_date_ranges)
True
I hope this answers your question. Of course this is just a draft and there's definitely some room for improvement, but should be a working solution to given problem.
I'm trying to create list of hours contained within each specified interval, which would be quite complicated with loop. Therefore, I wanted to ask for datetime recommendations.
# input in format DDHH/ddhh:
validity = ['2712/2812','2723/2805','2800/2812']
# demanded output:
val_hours = ['2712', '2713', '2714'..., '2717', '2723', '2800',...'2804',]
It would be great if last hour of validity would be considered as non-valid, becouse interval is ended by that hour, or more precisely by 59th minute of previous one.
I've tried quite complicated way with if conditions and loops, but I am persuaded that there is better one - as always.
It is something like:
#input in format DDHH/ddhh:
validity = ['2712/2812','2723/2805','2800/2812']
output = []
#upbound = previsously defined function defining list of lengt of each group
upbound = [24, 6, 12]
#For only first 24-hour group:
for hour in range(0,upbound[0]):
item = int(validity[0][-7:-5]) + hour
if (hour >= 24):
hour = hour - 24
output = output + hour
Further I would have to prefix numbers with date smaller than 10, like 112 (01st 12:00 Zulu) with zero and ensure correct day.
Loops and IFs seem to me just to compúlicated. Not mentioning error handling, it looks like two or three conditions.
Thank you for your help!
For each valid string, I use datetime.strptime to parse it, then based on either start date is less than or equal to end date, or greater than end date, I calculate the hours.
For start date less than or equal to end date, I consider original valid string, else I create two strings start_date/3023 and 0100/end_date
import datetime
validity = ['2712/2812','2723/2805','2800/2812','3012/0112','3023/0105','0110/0112']
def get_valid_hours(valid):
hours_li = []
#Parse the start date and end date as datetime
start_date_str, end_date_str = valid.split('/')
start_date = datetime.datetime.strptime(start_date_str,'%d%H')
end_date = datetime.datetime.strptime(end_date_str, '%d%H')
#If start date less than equal to end date
if start_date <= end_date:
dt = start_date
i=0
#Keep creating new dates until we hit end date
while dt < end_date:
#Append the dates to a list
dt = start_date+datetime.timedelta(hours=i)
hours_li.append(dt.strftime('%d%H'))
i+=1
#Else split the validity into two and calculate them separately
else:
start_date_str, end_date_str = valid.split('/')
return get_valid_hours('{}/3023'.format(start_date_str)) + get_valid_hours('0100/{}'.format(end_date_str))
#Append sublist to a bigger list
return hours_li
for valid in validity:
print(get_valid_hours(valid))
The output then looks like, not sure if this was the format needed!
['2712', '2713', '2714', '2715', '2716', '2717', '2718', '2719', '2720', '2721', '2722', '2723', '2800', '2801', '2802', '2803', '2804', '2805', '2806', '2807', '2808', '2809', '2810', '2811', '2812']
['2723', '2800', '2801', '2802', '2803', '2804', '2805']
['2800', '2801', '2802', '2803', '2804', '2805', '2806', '2807', '2808', '2809', '2810', '2811', '2812']
['3012', '3013', '3014', '3015', '3016', '3017', '3018', '3019', '3020', '3021', '3022', '3023', '0100', '0101', '0102', '0103', '0104', '0105', '0106', '0107', '0108', '0109', '0110', '0111', '0112']
['0100', '0101', '0102', '0103', '0104', '0105']
['0110', '0111', '0112']
Finally, I created something easy like this:
validity = ['3012/0112','3023/0105','0110/0112']
upbound = [24, 6, 12]
hours_list = []
for idx, val in enumerate(validity):
hours_li = []
DD = val[:2]
HH = val[2:4]
dd = val[5:7]
hh = val[7:9]
if DD == dd:
for i in range(int(HH),upbound[idx]):
hours_li.append(DD + str(i).zfill(2))
if DD <> dd:
for i in range(int(HH),24):
hours_li.append(DD + str(i).zfill(2))
for j in range(0,int(hh)):
hours_li.append(dd + str(j).zfill(2))
hours_list.append(hours_li)
This works for 24h validity (it could be solved by one if condition and similar block of concatenate), does not use datetime, just numberst and str. It is neither pythonic nor fast, but works.
I have a pandas dataframe in which each cell of a column contains a timestamp, saved as a string:
>>>dataset['DateTime'][1]
'2018-03-14 00:34:46'
I would like to create a new column in which those dates are manipulated in the following way:
year += 1,
month += 2,
day += 3,
hour += 4,
minute += 5,
second += 6
(Important to this manipulation is that the initial date and the new date have a one-to-one relation, so that I can transform the date back later onwards)
In my case, the output I am looking for is as follows:
>>> dataset['newTimestamp'][1]
'2019-05-17 04:39:52'
To do so I am using the datetime library to create datetime objects (as a test, I have started with one variable first):
timestamp = dataset['DateTime'][1]
p = datetime.datetime.strptime(timestamp, "%Y-%m-%d %H:%M:%S")
Currently I am doing arithmetics on the single variables:
year = p.year + 1
if p.month < 12:
month = p.month + 1
else:
month = 1
year += 1
However, as with the months, there are exceptions when you can and when you cannnot add values so that the new timestamp is still a real timestamp (12 + 1 = 13, which is not an actual month).
I could program every rule explicitly, but that seems too much work and I expect there are better ways. How could I do this faster?
Use DateOffset.
Also, have a look at relativedelta module for this kind of manipulations:
dataset['newTimestamp'] = pd.to_datetime(dataset['DateTime']) + pd.DateOffset(years=1, months=2, days=3, hours=4, minutes=5, seconds=6)
You should try out beautiful-date library:
pip install beautiful-date
And use it like so:
from beautiful_date import *
...
dataset['DateTime'].apply(lambda d: d + 1 * years + 2 * months + ... + 6 * seconds)
should do the trick.
strptime() and strftime() are the functions you are looking for.
Just go ahead and google the two fuctions. surely enough,you will be abe to solve the stated problem.
these can be used to directly manipulate date-time quantities
I want to get all months between now and August 2010, as a list formatted like this:
['2010-08-01', '2010-09-01', .... , '2016-02-01']
Right now this is what I have:
months = []
for y in range(2010, 2016):
for m in range(1, 13):
if (y == 2010) and m < 8:
continue
if (y == 2016) and m > 2:
continue
month = '%s-%s-01' % (y, ('0%s' % (m)) if m < 10 else m)
months.append(month)
What would be a better way to do this?
dateutil.relativedelta is handy here.
I've left the formatting out as an exercise.
from dateutil.relativedelta import relativedelta
import datetime
result = []
today = datetime.date.today()
current = datetime.date(2010, 8, 1)
while current <= today:
result.append(current)
current += relativedelta(months=1)
I had a look at the dateutil documentation. Turns out it provides an even more convenient way than using dateutil.relativedelta: recurrence rules (examples)
For the task at hand, it's as easy as
from dateutil.rrule import *
from datetime import date
months = map(
date.isoformat,
rrule(MONTHLY, dtstart=date(2010, 8, 1), until=date.today())
)
The fine print
Note that we're cheating a little bit, here. The elements dateutil.rrule.rrule produces are of type datetime.datetime, even if we pass dtstart and until of type datetime.date, as we do above. I let map feed them to date's isoformat function, which just turns out to convert them to strings as if it were just dates without any time-of-day information.
Therefore, the seemingly equivalent list comprehension
[day.isoformat()
for day in rrule(MONTHLY, dtstart=date(2010, 8, 1), until=date.today())]
would return a list like
['2010-08-01T00:00:00',
'2010-09-01T00:00:00',
'2010-10-01T00:00:00',
'2010-11-01T00:00:00',
⋮
'2015-12-01T00:00:00',
'2016-01-01T00:00:00',
'2016-02-01T00:00:00']
Thus, if we want to use a list comprehension instead of map, we have to do something like
[dt.date().isoformat()
for dt in rrule(MONTHLY, dtstart=date(2010, 8, 1), until=date.today())]
use datetime and timedelta standard Python's modules - without installing any new libraries
from datetime import datetime, timedelta
now = datetime(datetime.now().year, datetime.now().month, 1)
ctr = datetime(2010, 8, 1)
list = [ctr.strftime('%Y-%m-%d')]
while ctr <= now:
ctr += timedelta(days=32)
list.append( datetime(ctr.year, ctr.month, 1).strftime('%Y-%m-%d') )
I'm adding 32 days to enter new month every time (longest months has 31 days)
It's seems like there's a very simple and clean way to do this by generating a list of dates and subsetting to take only the first day of each month, as shown in the example below.
import datetime
import pandas as pd
start_date = datetime.date(2010,8,1)
end_date = datetime.date(2016,2,1)
date_range = pd.date_range(start_date, end_date)
date_range = date_range[date_range.day==1]
print(date_range)
I got another way using datetime, timedelta and calender:
from calendar import monthrange
from datetime import datetime, timedelta
def monthdelta(d1, d2):
delta = 0
while True:
mdays = monthrange(d1.year, d1.month)[1]
d1 += timedelta(days=mdays)
if d1 <= d2:
delta += 1
else:
break
return delta
start_date = datetime(2016, 1, 1)
end_date = datetime(2016, 12, 1)
num_months = [i-12 if i>12 else i for i in range(start_date.month, monthdelta(start_date, end_date)+start_date.month+1)]
monthly_daterange = [datetime(start_date.year,i, start_date.day, start_date.hour) for i in num_months]
You could reduce the number of if statements to two lines instead of four lines because having a second if statement that does the same thing with the previous if statement is a bit redundant.
if (y == 2010 and m < 8) or (y == 2016 and m > 2):
continue
I don't know whether it's better, but an approach like the following might be considered more 'pythonic':
months = [
'{}-{:0>2}-01'.format(year, month)
for year in xrange(2010, 2016 + 1)
for month in xrange(1, 12 + 1)
if not (year <= 2010 and month < 8 or year >= 2016 and month > 2)
]
The main differences here are:
As we want the iteration(s) to produce a list, use a list comprehension instead of aggregating list elements in a for loop.
Instead of explicitly making a distinction between numbers below 10 and numbers 10 and above, use the capabilities of the format specification mini-language for the .format() method of str to specify
a field width (the 2 in the {:0>2} place holder)
right-alignment within the field (the > in the {:0>2} place holder)
zero-padding (the 0 in the {:0>2} place holder)
xrange instead of range returns a generator instead of a list, so that the iteration values can be produced as they're being consumed and don't have to be held in memory. (Doesn't matter for ranges this small, but it's a good idea to get used to this in Python 2.) Note: In Python 3, there is no xrange and the range function already returns a generator instead of a list.
Make the + 1 for the upper bounds explicit. This makes it easier for human readers of the code to recognize that we want to specify an inclusive bound to a method (range or xrange) that treats the upper bound as exclusive. Otherwise, they might wonder what's the deal with the number 13.
A different approach that doesn't require any additional libraries, nor nested or while loops. Simply convert your dates into an absolute number of months from some reference point (it can be any date really, but for simplicity we can use 1st January 0001). For example
a=datetime.date(2010,2,5)
abs_months = a.year * 12 + a.month
Once you have a number representing the month you are in you can simply use range to loop over the months, and then convert back:
Solution to the generalized problem:
import datetime
def range_of_months(start_date, end_date):
months = []
for i in range(start_date.year * 12 + start_date.month, end_date.year*12+end_date.month + 1)
months.append(datetime.date((i-13) // 12 + 1, (i-1) % 12 + 1, 1))
return months
Additional Notes/explanation:
Here // divides rounding down to the nearest whole number, and % 12 gives the remainder when divided by 12, e.g. 13 % 12 is 1.
(Note also that in the above date.year *12 + date.month does not give the number of months since the 1st of January 0001. For example if date = datetime.datetime(1,1,1), then date.year * 12 + date.month gives 13. If I wanted to do the actual number of months I would need to subtract 1 from the year and month, but that would just make the calculations more complicated. All that matters is that we have a consistent way to convert to and from some integer representation of what month we are in.)
fresh pythonic one-liner from me
from dateutil.relativedelta import relativedelta
import datetime
[(start_date + relativedelta(months=+m)).isoformat()
for m in range(0, relativedelta(start_date, end_date).months+1)]
In case you don't have any months duplicates and they are in correct order you can get what you want with this.
from datetime import date, timedelta
first = date.today()
last = first + timedelta(weeks=20)
date_format = "%Y-%m"
results = []
while last >= first:
results.append(last.strftime(date_format))
last -= timedelta(days=last.day)
Similar to #Mattaf, but simpler...
pandas.date_range() has an option frequency freq='m'...
Here I am adding a day (pd.Timedelta('1d')) in order to reach the beginning of each new month:
import pandas as pd
date_range = pd.date_range('2010-07-01','2016-02-01',freq='M')+pd.Timedelta('1d')
print(list(date_range))