I am wondering how am I able to compare dates in a list. I would like to extract the "earliest" date.
(I did a for loop as I had to replace some characters with '-')
comment_list = comment_container.findAll("div", {"class" : "comment-date"})
D =[]
for commentDate in comment_list:
year, month, day = map(int, commentDate.split('-'))
date_object = datetime(year, month, day)
date_object = datetime.strptime(commentDate, '%Y-%m-%d').strftime('%Y-%m-%d')
D.append(date_object)
print(D)
Output:
['2018-06-26', '2018-04-01', '2018-07-19', '2018-04-23', '2018-08-25', '2018-06-08', '2018-06-14', '2018-07-08', '2019-03-15', '2019-03-15', '2019-03-15', '2019-03-15', '2019-03-15']
I want to extract the earliest date:
Eg.
'2018-04-01'
Just use the min function:
A = ['2018-06-26', '2018-04-01', '2018-07-19', '2018-04-23', '2018-08-25', '2018-06-08', '2018-06-14', '2018-07-08', '2019-03-15', '2019-03-15', '2019-03-15', '2019-03-15', '2019-03-15']
print(min(A))
produces
2018-04-01
comment_list = comment_container.findAll("div", {"class" : "comment-date"})
D =[]
for commentDate in comment_list:
year, month, day = map(int, commentDate.split('-'))
date_object = datetime(year, month, day)
D.append(date_object)
print(min(D))
You should keep the dates as datetime objects and then use the min() builtin function to determine the earliest date
from datetime import datetime
D = ['2018-06-26', '2018-04-01', '2018-07-19', '2018-04-23', '2018-08-25', '2018-06-08',
'2018-06-14', '2018-07-08', '2019-03-15', '2019-03-15', '2019-03-15', '2019-03-15', '2019-03-15']
D.sort()
print(D[0])
or this if you dont want change D
T = D[:]
T.sort()
print(T[0])
As suggested by Siong you can use min(D). You can achieve the same like this:
comment_list = comment_container.findAll("div", {"class" : "comment-date"})
D = [datetime.strptime(commentDate, '%Y-%m-%d') for commentDate in comment_list]
print(min(D))
Working with datetime.datetime objects is usually preferable since the comparisons you make are not based on the formatting of the string. You can always convert to string later on:
min_date_str = min(D).strftime('%Y-%m-%d')
If you are sure that all dates are correctly padded with zeros (i.e. 01 for January not 1 and so on) then simple min or max will be enough. However I want to note that tuples of ints might be also sorted which might be of use if you encounter mixed padded and not padded dates, consider for example:
d = ['2018-7-1','2018-08-01']
print(min(d)) #prints 2018-08-01 i.e. later date
print(min(d,key=lambda x:tuple(int(i) for i in x.split('-')))) #prints 2018-7-1
This solution assumes data are not broken, i.e. all elements produced by .split('-') might be turned into ints.
from dateutil.parser import parse
d = ['2018-7-1','2018-08-01']
date_mapping = dict((parse(x), x) for x in d)
earliest_date = date_mapping[min(date_mapping)]
print(earliest_date)
>>>> '2018-7-1'
Related
I have a list of dates.
['1/12/2022', '1/13/2022','1/17/2022']
How do I reformat them to look like this:
['2022-1-12', '2022-1-13','2022-1-17']
EDIT: My original post asked about the wrong format. I've corrected it because I meant for the the format to be "Year-Month-Day"
I am assuming you are using Python... Please correct me if I am wrong. You can loop through the list of dates using a enumerated for-loop (enumerate(list) function lets you know the index of each value during the loop) with each date, use the .replace() method of str to replace '/' with '-' like this:
list_of_dates = ['1/12/2022', '1/13/2022','1/17/2022']
for i, date in enumerate(list_of_dates):
list_of_dates[i] = date.replace('/', '-')
or use list comprehension like this (thank you #Eli Harold ):
list_of_dates = [date.replace('/', '-') for date in list_of_dates]
If you want to change the order of the numbers in the date string you can split them by the '/' or '-' into a new list and change the order if you want like this:
for i, date in enumerate(list_of_dates):
month, day, year = date.split('-') # assuming you already changed it to dashes
list_of_dates[i] = f'{day}-{month}-{year}'
you can use strptime
from datetime import datetime
dates = []
for date_str in ['1/12/2022', '1/13/2022','1/17/2022']:
date = datetime.strptime(date_str, '%m/%d/%Y')
dates.append(date.strftime('%m-%d-%Y'))
I opted to split the individual dates and then add in the "-" delimiter after the fact, but you could also replace those on iteration. Once your data has been transformed, I just pushed it into a new list of reformatted dates.
This may not result in the best performance for longer iterations, though.
dates = ['1/12/2022', '1/13/2022','1/17/2022']
newdates = []
for x in range(0, len(dates)):
split_date = dates[x].split('/')
month = split_date[0]
day = split_date[1]
year = split_date[2]
your_date = year +"-"+month+"-"+day
newdates.apppend(your_date)
print(your_date)
And the output:
2022-1-12
2022-1-13
2022-1-17
from datetime import datetime
dates = [datetime.strptime(x, "%-m/%-d/%Y") for x in list_of_dates]
new_dates = [x.strftime("%Y-%-m-%-d") for x in dates]
dates = ['1/12/2022', '1/13/2022','1/17/2022']
dates = [x.replace('/', '-') for x in dates]
I am trying to get the previous 2 and 3 month end date. In the below code, I am able to get the last month end date which is 2021-01-31, but I also need to get 2020-12-31 and 2020-11-30.
Any advices are greatly appreciated.
today = datetime.date.today()
first = today.replace(day=1)
lastMonth = first - dt.timedelta(days=1)
date1=lastMonth.strftime("%Y-%m-%d")
date1
Out[90]: '2021-01-31'
Try:
prev2month = lastMonth - pd.offsets.MonthEnd(n=1)
prev3month = lastMonth - pd.offsets.MonthEnd(n=2)
More usage information of offset (e.g. MonthEnd, MonthBegin) can be found in the documentation.
A quick and dirty way for doing this without needing to deal with the varying number of days in each month is to simply repeat the process N times, where N is the number of months back you want:
import datetime
today = datetime.date.today()
temp_date = today.replace(day=1)
for _ in range(3):
previous_month = temp_date - datetime.timedelta(days=1)
print(previous_month.strftime("%Y-%m-%d"))
temp_date = previous_month.replace(day=1)
outputs
2021-01-31
2020-12-31
2020-11-30
You can use the calendar package:
import calendar
calendar.monthrange(2020, 2)[1] # gives you the last day of Feb 2020
If you're definitely using pandas then you can make use of date_range, eg:
pd.date_range('today', periods=4, freq='-1M', normalize=True)
That'll give you:
DatetimeIndex(['2021-02-28', '2021-01-31', '2020-12-31', '2020-11-30'], dtype='datetime64[ns]', freq='-1M')
Ignore the first element and use as needed...
Alternatively:
dr = pd.date_range(end='today', periods=3, freq='M', normalize=True)[::-1]
Which gives you:
DatetimeIndex(['2021-01-31', '2020-12-31', '2020-11-30'], dtype='datetime64[ns]', freq='-1M')
Then if you want strings you can use dr.strftime('%Y-%m-%d') which'll give you Index(['2021-01-31', '2020-12-31', '2020-11-30'], dtype='object')
input:
data["Date"] = ["2005-01-01", "2005-01-02" , ""2005-01-03" ,..., "2014-12-30","2014-12-31"]
how can i sort the column such that the column gives 1st date of every year, 2nd date of every and so on:
i.e.
output:
data["Date"] = ["2005-01-01","2006-01-01","2007-01-01", ... "2013-12-31","2014-12-31"]
NOTE: assuming the date column has no leap days
First:
data['D'] = data['Date'].apply(lambda x : datetime.datetime.strptime(x, '%Y-%m-%d'))
data['Day'] = data['D'].apply(lambda x: x.day)
data['Month'] = data['D'].apply(lambda x: x.month)
data['Year'] = data['D'].apply(lambda x: x.year)
data.drop(columns='D', inplace=True)
Then, having 4 columns dataframe, we sort as following:
data.sort_values(by=['Day','Month','Year'], inplace=True)
Finally, you can drop new columns if you won't need them:
data.drop(columns = ['Day','Month','Year'], inplace=True)
Try using lambda expressions.
from datetime import datetime
data = {"Date": ["2005-01-02", "2005-01-01", "2014-12-30", "2014-12-31"]}
data["Date"].sort(key=lambda date: datetime.strptime(date, "%Y-%m-%d"))
>>> import datetime
>>> dates = [datetime.datetime.strptime(ts, "%Y-%m-%d") for ts in data["Date"]]
>>> dates.sort()
>>> sorteddates = [datetime.datetime.strftime(ts, "%Y-%m-%d") for ts in dates]
>>> sorteddates
['2010-01-12', '2010-01-14', '2010-02-07', '2010-02-11', '2010-11-16', '2010-11-
22', '2010-11-23', '2010-11-26', '2010-12-02', '2010-12-13', '2011-02-04', '2011
-06-02', '2011-08-05', '2011-11-30']
Why dont't you try and create a new column in which you change the format of the date?
Like this :
def change_format(row):
date_parts = row.split('-')
new_date = date_parts(2)+"-"+date_parts(1)+"-"+date_parts(0)
return new_date
data["Date_new_format"] = data["Date"].apply(lambda row => change_format(row))
Now you can sort your dataframe according to the column Date_new_format and you will get what you need.
Use:
data["temp"] = pd.to_datetime(data["Date"]).dt.strftime("%d-%Y-%m")
data = data.sort_values(by="temp").drop(columns=["temp"])
I'm looking to compare a list of dates with todays date and would like to return the closest one. Ive had various ideas on it but they are seem very convoluted and involve scoring based on how many days diff and taking the smallest diff. But I have no clue how to do this simply any pointers would be appreciated.
import datetime
import re
date_list = ['2019-02-10', '2018-01-13', '2019-02-8',]
now = datetime.date.today()
for date_ in date_list:
match = re.match('.*(\d{4})-(\d{2})-(\d{2}).*', date_)
if match:
year = match.group(1)
month = match.group(2)
day = match.group(3)
delta = now - datetime.date(int(year), int(month), int(day))
print(delta)
As I was Waiting EDIT
So I solved this using the below
import datetime
import re
date_list = ['2019-02-10', '2018-01-13', '2019-02-8',]
now = datetime.date.today()
for date_ in date_list:
match = re.match('.*(\d{4})-(\d{2})-(\d{2}).*', date_)
if match:
year = match.group(1)
month = match.group(2)
day = match.group(3)
delta = now - datetime.date(int(year), int(month), int(day))
dates_range.append(int(delta.days))
days = min(s for s in dates_range)
convert each string into a datetime.date object, then just subtract and get the smallest difference
import datetime
import re
date_list = ['2019-02-10', '2018-01-13', '2019-02-8',]
now = datetime.date.today()
date_list_converted = [datetime.datetime.strptime(each_date, "%Y-%m-%d").date() for each_date in date_list]
differences = [abs(now - each_date) for each_date in date_list_converted]
minimum = min(differences)
closest_date = date_list[differences.index(minimum)]
This converts the strings to a datetime object, then subracts the current date from that and returns the date with the corresponding lowest absolute difference:
import datetime
import re
date_list = ['2019-02-10', '2018-01-13', '2019-02-8',]
numPattern = re.compile("[0-9]+")
def getclosest(dates):
global numPattern
now = datetime.date.today()
diffs = []
for day in date_list:
year, month, day = [int(i) for i in re.findall(numPattern, day)]
currcheck = datetime.date(year, month, day)
diffs.append(abs(now - currcheck))
return dates[diffs.index(min(diffs))]
It's by no means the most efficient, but it's semi-elegant and works.
Using inbuilts
Python's inbuilt datetime module has the functionality to do what you desire.
Let's first take your list of dates and convert it into a list of datetime objects:
from datetime import datetime
date_list = ['2019-02-10', '2018-01-13', '2019-02-8']
datetime_list = [datetime.strptime(date, "%Y-%m-%d") for date in date_list]
Once we have this we can find the difference between those dates and today's date.
today = datetime.today()
date_diffs = [abs(date - today) for date in datetime_list]
Excellent, date_diffs is now a list of datetime.timedelta objects. All that is left is to find the minimum and find which date this represents.
To find the minimum difference it is simple enough to use min(date_diffs), however, we then want to use this minimum to extract the corresponding closest date. This can be achieved as:
closest_date = date_list[date_diffs.index(min(date_diffs))]
With pandas
If performance is an issue, it may be worth investigating a pandas implementation. Using pandas we can convert your dates to a pandas dataframe:
from datetime import datetime
import pandas as pd
date_list = ['2019-02-10', '2018-01-13', '2019-02-8']
date_df = pd.to_datetime(date_list)
Finally, as in the method using inbuilts we find the differences in the dates and use it to extract the closest date to today.
today = datetime.today()
date_diffs = abs(today - date_df)
closest_date = date_list[date_diffs.argmin()]
The advantage of this method is that we've removed the for loops and so I'd expect this method to be more efficient for large numbers of dates
one fast and simple way will be to use bisect algorithm, especially if your date_list is significantly big :
import datetime
from bisect import bisect_left
FMT = '%Y-%m-%d'
date_list = ['2019-02-10', '2018-01-13', '2019-02-8', '2019-02-12']
date_list.sort()
def closest_day_to_now(days):
"""
Return the closest day form an ordered list of days
"""
now = datetime.datetime.now()
left_closest_day_index = bisect_left(days, now.strftime(FMT))
# check if there is one greater value
if len(days) - 1 > left_closest_day_index:
right_closest_day_index = left_closest_day_index + 1
right_day = datetime.datetime.strptime(days[right_closest_day_index], FMT)
left_day = datetime.datetime.strptime(days[left_closest_day_index], FMT)
closest_day_index = right_closest_day_index if abs(right_day - now) < abs(left_day - now) \
else left_closest_day_index
else:
closest_day_index = left_closest_day_index
return days[closest_day_index]
print(closest_day_to_now(date_list))
I have written the following code to preprocess a dataset like this:
StartLocation StartTime EndTime
school Mon Jul 25 19:04:30 GMT+01:00 2016 Mon Jul 25 19:04:33 GMT+01:00 2016
... ... ...
It contains a list of locations attended by a user with the start and end time. Each location may occur several times and there is no comprehensive list of locations. From this, I want to aggregate data for each location (frequency, total time, mean time). To do this I have written the following code:
def toEpoch(x):
try:
x = datetime.strptime(re.sub(r":(?=[^:]+$)", "", x), '%a %b %d %H:%M:%S %Z%z %Y').strftime('%s')
except:
x = datetime.strptime(x, '%a %b %d %H:%M:%S %Z %Y').strftime('%s')
x = (int(x)/60)
return x
#Preprocess data
df = pd.read_csv('...')
for index, row in df.iterrows():
df['StartTime'][index] = toEpoch(df['StartTime'][index])
df['EndTime'][index] = toEpoch(df['EndTime'][index])
df['TimeTaken'][index] = int(df['EndTime'][index]) - int(df['StartTime'][index])
total = df.groupby(df['StartLocation'].str.lower()).sum()
av = df.groupby(df['StartLocation'].str.lower()).mean()
count = df.groupby(df['StartLocation'].str.lower()).count()
output = pd.DataFrame({"location": total.index, 'total': total['TimeTaken'], 'mean': av['TimeTaken'], 'count': count['TimeTaken']})
print(output)
This code functions correctly, however is quite inefficient. How can I optimise the code?
EDIT: Based on #Batman's helpful comments I no longer iterate. However, I still hope to further optimise this if possible. The updated code is:
df = pd.read_csv('...')
df['StartTime'] = df['StartTime'].apply(toEpoch)
df['EndTime'] = df['EndTime'].apply(toEpoch)
df['TimeTaken'] = df['EndTime'] - df['StartTime']
total = df.groupby(df['StartLocation'].str.lower()).sum()
av = df.groupby(df['StartLocation'].str.lower()).mean()
count = df.groupby(df['StartLocation'].str.lower()).count()
output = pd.DataFrame({"location": total.index, 'total': total['TimeTaken'], 'mean': av['TimeTaken'], 'count': count['TimeTaken']})
print(output)
First thing I'd do is stop iterating over the rows.
df['StartTime'] = df['StartTime'].apply(toEpoch)
df['EndTime'] = df['EndTime'].apply(toEpoch)
df['TimeTaken'] = df['EndTime'] - df['StartTime']
Then, do a single groupby operation.
gb = df.groupby('StartLocation')
total = gb.sum()
av = gb.mean()
count = gb.count()
vectorize the date conversion
take the difference of two series of timestamps gives a series of timedeltas
use total_seconds to get the seconds from the the timedeltas
groupby with agg
# convert dates
cols = ['StartTime', 'EndTime']
df[cols] = pd.to_datetime(df[cols].stack()).unstack()
# generate timedelta then total_seconds via the `dt` accessor
df['TimeTaken'] = (df.EndTime - df.StartTime).dt.total_seconds()
# define the lower case version for cleanliness
loc_lower = df.StartLocation.str.lower()
# define `agg` functions for cleanliness
# this tells `groupby` to use 3 functions, sum, mean, and count
# it also tells what column names to use
funcs = dict(Total='sum', Mean='mean', Count='count')
df.groupby(loc_lower).TimeTaken.agg(funcs).reset_index()
explanation of date conversion
I define cols for convenience
df[cols] = is an assignment to those two columns
pd.to_datetime() is a vectorized date converter but only takes pd.Series not pd.DataFrame
df[cols].stack() makes the 2-column dataframe into a series, now ready for pd.to_datetime()
use pd.to_datetime(df[cols].stack()) as described and unstack() to get back my 2-columns and now ready to be assigned.