I have a list of datetime objects in python and want to aggregate them by the hour. For example if I have a datetime object for
[03/01/2012 00:12:12,
03/01/2012 00:55:12,
03/01/2012 01:12:12,
...]
I want to have a list of datetime objects for every hour along with a count of the number of datetime objects I have that fall into that bucket. For my example above I would want output of
[03/01/2012 00:00:00, 03/01/2012 01:00:00] in one list and a count of the entries in another list: [2,1].
You could store that kind of data efficiently with a dictionary where the keys are hours and the values are lists of the datetime objects. e.g. (untested):
l = [datetime.datetime.now(), datetime.datetime.now()] #...etc.
hour_quantization = {}
for dt in l:
if dt.hour not in hour_quantization:
hour_quantization[dt.hour] = [dt]
else:
hour_quantization[dt.hour].append(dt)
counts = [len(hour_quantization[hour]) for hour in hour_quantization.keys()]
see the doc entry on datetime
Assuming you have a list of datetime objects, you can count how many of each hour there are:
from collections import Counter
hours = [t.hour for t in ts]
Counter(hours)
This will give you:
Counter({0: 2, 1: 1})
Related
I am working with currently with a csv file that contains datetimes and timestamps. The dataframe look like this:
print(df[:10])
[0 '2019-10-10 21:59:17.074007' '2015-10-13 00:55:55.544607'
'2017-05-24 06:00:15.959202' '2016-12-07 09:01:04.729686'
'2019-05-29 11:16:44.130063' '2017-01-19 16:06:37.625964'
'2018-04-07 19:42:43.708620' '2016-06-28 03:13:58.266977'
'2015-03-21 00:03:07.704446']
and now I want to convert those strings into datetime and find the earliest date out of it. I don't have much experience in datetime dataframes so I am not sure how to do it. Any suggestions?
You can convert strings to_datetime, then take min:
dates = ['2019-10-10 21:59:17.074007', '2015-10-13 00:55:55.544607',
'2017-05-24 06:00:15.959202', '2016-12-07 09:01:04.729686',
'2019-05-29 11:16:44.130063', '2017-01-19 16:06:37.625964',
'2018-04-07 19:42:43.708620', '2016-06-28 03:13:58.266977',
'2015-03-21 00:03:07.704446']
pd.to_datetime(dates).min()
Output:
Timestamp('2015-03-21 00:03:07.704446')
Update
If you want to do it across all columns of the dataframe:
df.apply(pd.to_datetime).min().min()
Lets call the list you mentioned l, you can iterate on it and parse dates using datetime.strptime, aggregate them in a new list and return the earliest:
from datetime import datetime
parsed_dates = []
for d in l:
parsed_dates.append(datetime.strptime(d, "%Y-%m-%d %H:%M:%S.%f"))
print(min(parsed_dates))
Convert these value to datetime by using to_datetime() method:
df=pd.to_datetime(df,errors='coerce')
Now find earliest date by using min() method:
earliest_date=df.min()
OR you can also find earliest date by using nsmallest() method(This works on Series):
earliest_date=df.nsmallest(1)
Here's a quick problem that I, at first, dismissed as easy. An hour in, and I'm not so sure!
So, I have a list of Python datetime objects, and I want to graph them. The x-values are the year and month, and the y-values would be the amount of date objects in this list that happened in this month.
Perhaps an example will demonstrate this better (dd/mm/yyyy):
[28/02/2018, 01/03/2018, 16/03/2018, 17/05/2018]
-> ([02/2018, 03/2018, 04/2018, 05/2018], [1, 2, 0, 1])
My first attempt tried to simply group by date and year, along the lines of:
import itertools
group = itertools.groupby(dates, lambda date: date.strftime("%b/%Y"))
graph = zip(*[(k, len(list(v)) for k, v in group]) # format the data for graphing
As you've probably noticed though, this will group only by dates that are already present in the list. In my example above, the fact that none of the dates occurred in April would have been overlooked.
Next, I tried finding the starting and ending dates, and looping over the months between them:
import datetime
data = [[], [],]
for year in range(min_date.year, max_date.year):
for month in range(min_date.month, max_date.month):
k = datetime.datetime(year=year, month=month, day=1).strftime("%b/%Y")
v = sum([1 for date in dates if date.strftime("%b/%Y") == k])
data[0].append(k)
data[1].append(v)
Of course, this only works if min_date.month is smaller than max_date.month which is not necessarily the case if they span multiple years. Also, its pretty ugly.
Is there an elegant way of doing this?
Thanks in advance
EDIT: To be clear, the dates are datetime objects, not strings. They look like strings here for the sake of being readable.
I suggest use pandas:
import pandas as pd
dates = ['28/02/2018', '01/03/2018', '16/03/2018', '17/05/2018']
s = pd.to_datetime(pd.Series(dates), format='%d/%m/%Y')
s.index = s.dt.to_period('m')
s = s.groupby(level=0).size()
s = s.reindex(pd.period_range(s.index.min(), s.index.max(), freq='m'), fill_value=0)
print (s)
2018-02 1
2018-03 2
2018-04 0
2018-05 1
Freq: M, dtype: int64
s.plot.bar()
Explanation:
First create Series from list of dates and convert to_datetimes.
Create PeriodIndex by Series.dt.to_period
groupby by index (level=0) and get counts by GroupBy.size
Add missing periods by Series.reindex by PeriodIndex created by max and min values of index
Last plot, e.g. for bars - Series.plot.bar
using Counter
dates = list()
import random
import collections
for y in range(2015,2019):
for m in range(1,13):
for i in range(random.randint(1,4)):
dates.append("{}/{}".format(m,y))
print(dates)
counter = collections.Counter(dates)
print(counter)
for your problem with dates with no occurrences you can use the subtract method of Counter
generate a list with all range of dates, each date will appear on the list only once, and then you can use subtract
like so
tmp_date_list = ["{}/{}".format(m,y) for y in range(2015,2019) for m in range(1,13)]
counter.subtract(tmp_date_list)
I am trying to make a hash table to speed up the process of finding the difference between a particular date to a holiday date (I have a list of 10 holiday dates).
holidays =['2014-01-01', '2014-01-20', '2014-02-17', '2014-05-26',
'2014-07-04', '2014-09-01', '2014-10-13', '2013-11-11',
'2013-11-28', '2013-12-25']
from datetime import datetime
holidaydate=[]
for i in range(10):
holidaydate.append(datetime.strptime(holidays[i], '%Y-%m-%d'))
newdate=pd.to_datetime(df.YEAR*10000+df.MONTH*100+df.DAY_OF_MONTH,format='%Y-%m-%d')
#newdate contains all the 0.5 million of dates!
Now I want to use a hash table to calculate the difference between each of the 0.5 million dates in "newdate" to the closest holiday. I do NOT want to do the same calculation millions of times, thats why I want to use a hashtable for this.
I tried searching for a solution on google but only found stuff such as:
keys = ['a', 'b', 'c']
values = [1, 2, 3]
hash = {k:v for k, v in zip(keys, values)}
And this does not work in my case.
Thanks for your help!
You need to create the table first. Like this.
import datetime
holidays =['2014-01-01', '2014-01-20', '2014-02-17', '2014-05-26',
'2014-07-04', '2014-09-01', '2014-10-13', '2013-11-11',
'2013-11-28', '2013-12-25']
hdates = []
def return_date(txt):
_t = txt.split("-")
return datetime.date(int(_t[0]), int(_t[1]), int(_t[2]))
def find_closest(d):
_d = min(hdates, key=lambda x:abs(x-d))
_diff = abs(_d - d).days
return _d, _diff
# Convert holidays to datetime.date
for h in holidays:
hdates.append(return_date(h))
# Build the "hash" table
hash_table = {}
i_date = datetime.date(2013, 1, 1)
while i_date < datetime.date(2016,1,1):
cd, cdiff = find_closest(i_date)
hash_table[i_date] = {"date": cd, "difference": cdiff}
i_date = i_date + datetime.timedelta(days=1)
print hash_table[datetime.date(2014,10,15)]
This works on datetime.date objects instead of raw strings. It has a built-in function to convert a "yyyy-mm-dd" string to datetime.date though.
This creates a hash table for all dates between 1/1/2013 and 31/12/2015 and then tests this with just one date. You would then loop your 0.5 million dates and match the result in this dictionary (key is datetime.date object but you can of course convert this back to string if you so desire).
Anyway, this should give you the idea how to do this.
I have a list of lists composed of dates in excel float format (every minute since July 5, 1996) and an integer value associated with each date like this: [[datetime,integer]...]. I need to create a new list composed of all of the dates (no hours or minutes) and the sum of the values for all of the datetimes within that date. In other words, what is the sum of the values for each date when listolists[x][0] >= math.floor(listolists[x][0]) and listolists[x][0] < math.floor(listolists[x][0]). Thanks
Since you didn't provide any actual data (just the data structure you used, nested lists), I created some dummy data below to demonstrate how you might do a SUMIFS-type of problem in Python.
from datetime import datetime
import numpy as np
import pandas as pd
dates_list = []
# just take one month as an example of how to group by day
year = 2015
month = 12
# generate similar data to what you might have
for day in range(1, 32):
for hour in range(1, 24):
for minute in range(1, 60):
dates_list.append([datetime(year, month, day, hour, minute), np.random.randint(20)])
# unpack these nested list pairs so we have all of the dates in
# one list, and all of the values in the other
# this makes it easier for pandas later
dates, values = zip(*dates_list)
# to eventually group by day, we need to forget about all intra-day data, e.g.
# different hours and minutes. we only care about the data for a given day,
# not the by-minute observations. So, let's set all of the intra-day values to
# some constant for easier rolling-up of these dates.
new_dates = []
for d in dates:
new_d = d.replace(hour = 0, minute = 0)
new_dates.append(new_d)
# throw the new dates and values into a pandas.DataFrame object
df = pd.DataFrame({'new_dates': new_dates, 'values': values})
# here's the SUMIFS function you're looking for
grouped = df.groupby('new_dates')['values'].sum()
Let's see the results:
>>> print(grouped.head())
new_dates
2015-12-01 12762
2015-12-02 13292
2015-12-03 12857
2015-12-04 12762
2015-12-05 12561
Name: values, dtype: int64
Edit: If you want these new grouped data back in the nested list format, just do this:
new_list = [[date, value] for date, value in zip(grouped.index, grouped)]
Thanks everyone. This is the simplest code I could come up with that doesn't require panda:
for row in listolist:
for k in (0, 1):
row[k] = math.floor(float(row[k]))
date = {}
for d,v in listolist:
if d in date:
date[math.floor(d)].append(v)
else:
date[math.floor(d)] = [v]
result = [(d,sum(v)) for d,v in date.items()]
I have a large list of dates that are datetime objects like for example
[datetime.datetime(2016,8,14),datetime.datetime(2016,8,13),datetime.datetime(2016,8,12),....etc.]
Instead of datetime objects of the date what I want instead is a list of numerical integer values since the date 1/1/1900. I have defined 1/1/1900 as the base date and in the for loop below, I have calculated the days between the date in the list since that base date:
baseDate = datetime(1900,1,1)
numericalDates = []
for i in enumerate(dates):
a=i[1]-baseDate
numericalDates.append(a)
print(numericalDates)
However when I print this out, I get datetime.timedelta objects instead
[datetime.timedelta(42592), datetime.timedelta(42591), datetime.timedelta(42590),...etc.]
Any ideas on how I can convert it the proper way?
timedelta objects have days attribute, so you can simply append that as an int:
numericalDates.append(a.days)
will result with numericalDates being [42594, 42593, 42592].
Note that you can also simplify your code a bit by using list comprehension:
numericalDates = [(d - baseDate).days for d in dates]