How do I convert a list of dates that are in the form yyyymmdd to a serial number? For example, if I have this list of dates:
t = [1898-10-12 06:00,1898-10-12 12:00,1932-09-30 08:00,1932-09-30 00:00]
How do I convert each date to a serial number? Im currently using the datetime toordinal() command, but each date is being rounded to the same serial number. How do I get the same dates with different times to be different numbers?
The times in the list are the datetime.datetime numbers. I tried then doing:
thurser = []
for i in range(len(t)):
thurser.append(t[i].toordinal())
But am not getting serial numbers as floats.
datetime.toordinal() considers only the 'date' part of the datetime object, not the time. So does date.toordinal() - it only has a date part. The first 2 and last 2 elements in your list have datetimes on the same date but at different times, which .toordinal ignores. So, .toordinal will give you the same value for those same-dated datetimes.
In general, the solution would be to calculate the delta between your dates and a pre-determined/fixed one. I'm using datetime.datetime(1, 1, 1), the earliest possible datetime, so all the deltas are positive:
thurser = []
# assuming t is a list of datetime objects
for d in t:
delta = d - datetime.datetime(1, 1, 1)
thurser.append(delta.days + delta.seconds/(24 * 3600))
>>> print(thurser)
[693149.25, 693149.5, 705555.3333333334, 705555.0]
And if you prefer ints instead of floats, then use seconds instead of days:
thurser.append(int(delta.total_seconds())) # total_seconds has microseconds in the float
>>> print(thurser)
[59888095200, 59888116800, 60959980800, 60959952000]
And to get back the original values in the 2nd example:
>>> [datetime.timedelta(seconds=d) + datetime.datetime(1, 1, 1) for d in thurser]
[datetime.datetime(1898, 10, 12, 6, 0), datetime.datetime(1898, 10, 12, 12, 0),
datetime.datetime(1932, 9, 30, 8, 0), datetime.datetime(1932, 9, 30, 0, 0)]
>>> _ == t # compare with original values
True
Let me know if my understanding is wrong, I tried following and gives distinct numbers for each value of the list.
I modified
t = ['1898-10-12 06:00','1898-10-12 12:00','1932-09-30 08:00','1932-09-30 00:00']
with
t = [datetime.datetime(1898, 10, 12, 6, 0), datetime.datetime(1898, 10, 12, 12, 0), datetime.datetime(1932, 9, 30, 8, 0), datetime.datetime(1932, 9, 30, 0, 0)]
As mentioned in comment it is list of datetime.datetime.
I am considering total MilliSeconds from 1970-01-01 00:00:00 the given date to generate a number.
So dates which are before above date give values in negative. But distinct values.
t = [datetime.datetime(1898, 10, 12, 6, 0), datetime.datetime(1898, 10, 12, 12, 0), datetime.datetime(1932, 9, 30, 8, 0), datetime.datetime(1932, 9, 30, 0, 0)]
thurser = []
x = []
for i in range(len(t)):
thurser.append(t[i].toordinal())
x.append((t[i]-datetime.datetime.utcfromtimestamp(0)).total_seconds() * 1000.0)
print(thurser)
print(x)
output:
[693150, 693150, 705556, 705556]
[-2247501600000.0, -2247480000000.0, -1175616000000.0, -1175644800000.0]
Related
I have a situation where I have a code with which I am processing data for operated shifts.
In it, I have arrays for start and end of shifts (e.g. shift_start[0] and shift_end[0] for shift #1), and for the time between them, I need to know how many weekdays, holidays or weekend days.
The holidays I have already defined in an array of datetime entries, which should represent the holidays of a specific country (it's not the same as here and I do not seek for further more dynamic options here yet).
So basically I have it like that:
started = [datetime.datetime(2022, 2, 1, 0, 0), datetime.datetime(2022, 2, 5, 8, 0), datetime.datetime(2022, 2, 23, 11, 19, 28)]
ended = [datetime.datetime(2022, 2, 2, 16, 0), datetime.datetime(2022, 2, 5, 17, 19, 28), datetime.datetime(2022, 4, 26, 12, 30)]
holidays = [datetime.datetime(2022, 1, 3), datetime.datetime(2022, 3, 3), datetime.datetime(2022, 4, 22), datetime.datetime(2022, 4, 25)]
I'm seeking for options to go thru each of the 3 ranges and match the number of days it contains (e.g. the first range should contain 2 weekdays, the second - one weekend day)
So based on the suggestion by #gimix, I was able to develop what I needed:
for each_start, each_end in zip(started, ended): # For each period
for single_date in self.daterange(each_start, each_end): # For each day of each period
# Checking if holiday or weekend
if (single_date.replace(hour=0, minute=0, second=0) in holidays) or (single_date.weekday() > 4):
set_special_days_worked(1)
# If not holiday or weekend, then it is regular working day
else:
set_regular_days_worked(1)
I'm trying to figure out the easiest way to automate the conversion of an array of seconds into datetime. I'm very familiar with converting the seconds from 1970 into datetime, but the values that I have here are for the seconds elapsed in a given day. For example, 14084 is the number if seconds that has passed on 2011,11,11, and I was able to generate the datetime below.
str(dt.timedelta(seconds = 14084))
Out[245]: '3:54:44'
dt.datetime.combine(date(2011,11,11),time(3,54,44))
Out[250]: datetime.datetime(2011, 11, 11, 3, 54, 44)
Is there a faster way of conversion for an array.
numpy has support for arrays of datetimes with a timedelta type for manipulating them:
https://numpy.org/doc/stable/reference/arrays.datetime.html
e.g. you can do this:
import numpy as np
date_array = np.arange('2005-02', '2005-03', dtype='datetime64[D]')
date_array += np.timedelta64(4, 's') # Add 4 seconds
If you have an array of seconds, you could convert it into an array of timedeltas and add that to a fixed datetime
Say you have
seconds = [14084, 14085, 15003]
You can use pandas
import pandas as pd
series = pd.to_timedelta(seconds, unit='s') + pd.to_datetime('2011-11-11')
series = series.to_series().reset_index(drop=True)
print(series)
0 2011-11-11 03:54:44
1 2011-11-11 03:54:45
2 2011-11-11 04:10:03
dtype: datetime64[ns]
Or a list comprehension
list_comp = [datetime.datetime(2011, 11, 11) +
datetime.timedelta(seconds=s) for s in seconds]
print(list_comp)
[datetime.datetime(2011, 11, 11, 3, 54, 44), datetime.datetime(2011, 11, 11, 3, 54, 45), datetime.datetime(2011, 11, 11, 4, 10, 3)]
As the title says, I'm trying to generate a list of datetimes corresponding to the occurrences of a specific day of the month between two dates.
So given a start date, an end date, and a day of the month, I want to see every occurrence of that day of the month:
from datetime import datetime
end_date = datetime(2012, 9, 15, 0, 0)
start_date = datetime(2012, 6, 1, 0, 0)
day_of_month = 16
dates = "magic code goes here"
dates would then hold an array as such:
dates == [
datetime(2012, 6, 16, 0, 0),
datetime(2012, 7, 16, 0, 0),
datetime(2012, 8, 16, 0, 0)
]
The issue I'm running into is the number of checks I have to perform. First I have to check if it's the start year, if so, then I have to start at the beginning month, but if the day of the month is before the start date, then I have to skip that month. This same thing applies for the end of the period. Not to mention I have to check if the period starts and ends in the same year. All in all it's turning into quite a mess of nested if and for statements.
Here is my solution:
import numpy as np
for year in np.arange(start_date.year, end_date.year + 1):
for month in np.arange(1, 13):
date = datetime(year, month, day_of_month, 0, 0)
if start_date < date < end_date:
dates.append(date)
Is there a more Pythonic way to accomplish this?
Here's a quick and dirty (but reasonably efficient) solution:
import datetime
d = start_date
days = []
while d <= end_date: # Change to < if you do not want the end_date
if d.day == day_of_month:
days.append(d)
d += datetime.timedelta(1)
days
# [datetime.datetime(2012, 6, 16, 0, 0),
# datetime.datetime(2012, 7, 16, 0, 0),
# datetime.datetime(2012, 8, 16, 0, 0)]
Ideally, you want to use pandas for this.
This is a succinct, but not efficient, way using pandas.date_range.
from datetime import datetime
import pandas as pd
end_date = datetime(2012, 9, 15, 0, 0)
start_date = datetime(2012, 6, 1, 0, 0)
day_of_month = 16
rng = [i.to_pydatetime() for i in pd.date_range(start_date, end_date, freq='1D') if i.day == day_of_month]
# [datetime.datetime(2012, 6, 16, 0, 0),
# datetime.datetime(2012, 7, 16, 0, 0),
# datetime.datetime(2012, 8, 16, 0, 0)]
Here is a more efficient method using a generator for the date range, which does not rely on pandas:
def daterange(start_date, end_date):
for n in range(int ((end_date - start_date).days)):
yield start_date + timedelta(n)
rng = [i for i in daterange(start_date, end_date) if i.day == day_of_month]
# [datetime.datetime(2012, 6, 16, 0, 0),
# datetime.datetime(2012, 7, 16, 0, 0),
# datetime.datetime(2012, 8, 16, 0, 0)]
I have a situation where I need to get the third latest date, i.e
INPUT :
['14-04-2001', '29-12-2061', '21-10-2019',
'07-01-1973', '19-07-2014','11-03-1992','21-10-2019']
Also , INPUT
6
14-04-2001
29-12-2061
21-10-2019
07-01-1973
19-07-2014
11-03-1992
OUTPUT : 19-07-2014
import datetime
datelist = ['14-04-2001', '29-12-2061', '21-10-2019', '07-01-1973', '19-07-2014','11-03-1992','21-10-2019' ]
for d in datelist:
x = datetime.datetime.strptime(d,'%d-%m-%Y')
print x
How can i achieve this?
You can sort the list and take the 3rd element from it.
my_list = [datetime.datetime.strptime(d,'%d-%m-%Y') for d in list]
# [datetime.datetime(2001, 4, 14, 0, 0), datetime.datetime(2061, 12, 29, 0, 0), datetime.datetime(2019, 10, 21, 0, 0), datetime.datetime(1973, 1, 7, 0, 0), datetime.datetime(2014, 7, 19, 0, 0), datetime.datetime(1992, 3, 11, 0, 0), datetime.datetime(2019, 10, 21, 0, 0)]
my_list.sort(reverse=True)
my_list[2]
# datetime.datetime(2019, 10, 21, 0, 0)
Also, as per Kerorin's suggestion, if you don't need to sort in-place and just need the 3rd element always, you can simply do
sorted(my_list, reverse=True)[2]
Update
To remove the duplicates, taking inspiration from this answer, you can do the following -
import datetime
datelist = ['14-04-2001', '29-12-2061', '21-10-2019', '07-01-1973', '19-07-2014', '11-03-1992', '21-10-2019']
seen = set()
my_list = [datetime.datetime.strptime(d,'%d-%m-%Y')
for d in datelist
if d not in seen and not seen.add(d)]
my_list.sort(reverse=True)
You can use heapq.nlargest to do this.
import heapq
from datetime import datetime
datelist = [
'14-04-2001',
'29-12-2061',
'21-10-2019',
'07-01-1973',
'19-07-2014',
'11-03-1992',
'21-10-2019'
]
heapq.nlargest(3, {datetime.strptime(d, "%d-%m-%Y") for d in datelist})[-1]
This return datetime.datetime(2014, 7, 19, 0, 0)
update
Technically, I want to convert log data into time series frequency in spark. I've searched a lot, but didn't find a good way to deal with big data.
I know pd.dataframe can get count for some feature, but my dataset is too big to use a dataframe.
which means I need to deal with each line by MapReduce.
And what I've tried are probably stupid....
I have a RDD, whose lines are lists of tuples, which looks like:
[(datetime.datetime(2015, 9, 1, 0, 4, 12), 1),((datetime.datetime(2015, 9, 2, 0, 4, 12), 1),(datetime.datetime(2015, 4, 1, 0, 4, 12), 1),(datetime.datetime(2015, 9, 1, 0, 4, 12),1)]
[(datetime.datetime(2015, 10, 1, 0, 4, 12), 1),(datetime.datetime(2015, 7, 1, 0, 4, 12), 1)]
In each tuple, the first element is a date,
can I write a map function in spark by python to fill the count of of the tuples with the same (month, day, hour) into a 3-d array according to the date (month, day, hour) as (x,y,z) coordinates in the tuple.
here is what I've done:
def write_array(input_rdd, array):
for item in input_rdd:
requestTime = item[0]
array[requestTime.month - 1, requestTime.day -1, requestTime.hour] += 1
array_to_fill = np.zeros([12, 31, 24], dtype=np.int)
filled_array = RDD_to_fill.map(lambda s:write_array(s, array_to_fill)).collect()
with open("output.txt", 'w') as output:
json.dump(traffic, output)
And the error is:
Traceback (most recent call last):
File "traffic_count.py", line 67, in <module>
main()
File "traffic_count.py", line 58, in main
traffic = organic_userList.Map(lambda s: write_array(s, traffic_array)) \
AttributeError: 'PipelinedRDD' object has no attribute 'Map'
I thought there must be some way to save the elements in each line of RDD into a exist data structure..... Can someone help me?
Many Thanks!
If you can have the output data be a list of ((month, day, hour), count) values, the following should work:
from pyspark import SparkConf, SparkContext
import datetime
conf = SparkConf().setMaster("local[*]").setAppName("WriteDates")
sc = SparkContext(conf = conf)
RDD_to_fill = sc.parallelize([(datetime.datetime(2015, 9, 1, 0, 4, 12), 1),(datetime.datetime(2015, 9, 2, 0, 4, 12), 1),(datetime.datetime(2015, 4, 1, 0, 4, 12), 1),(datetime.datetime(2015, 9, 1, 0, 4, 12),1), (datetime.datetime(2015, 10, 1, 0, 4, 12), 1), (datetime.datetime(2015, 7, 1, 0, 4, 12), 1)])
def map_date(tup):
return ((tup[0].month, tup[0].day, tup[0].hour), tup[1])
date_rdd = RDD_to_fill.map(map_date).reduceByKey(lambda x, y: x + y)
# create a tuple for every (month, day, hour) and set the value to 0
zeros = []
for month in range(1,13):
for day in range(1,32):
for hour in range(24):
zeros.append(((month, day, hour), 0))
zeros_rdd = sc.parallelize(zeros)
# union the rdd with the date_rdd (dates with non-zero values) with the zeros_rdd (dates with all zero values)
# and then add aggregate them together (via addition) by key (i.e., date tuple)
filled_tups = date_rdd.union(zeros_rdd).reduceByKey(lambda x, y: x + y).collect()
Then, if you want to access the count for any (month, day, hour) period, you can easily do the following:
filled_dict = dict(filled_tups)
# get count for Sept 1 at 00:00
print(filled_dict[(9,1,0)]) # prints 2
Note this code doesn't properly account for non-existing days such as Feb 30, Feb 31, April 31, June 31...