Python Dates Hashtable - python

I am trying to make a hash table to speed up the process of finding the difference between a particular date to a holiday date (I have a list of 10 holiday dates).
holidays =['2014-01-01', '2014-01-20', '2014-02-17', '2014-05-26',
'2014-07-04', '2014-09-01', '2014-10-13', '2013-11-11',
'2013-11-28', '2013-12-25']
from datetime import datetime
holidaydate=[]
for i in range(10):
holidaydate.append(datetime.strptime(holidays[i], '%Y-%m-%d'))
newdate=pd.to_datetime(df.YEAR*10000+df.MONTH*100+df.DAY_OF_MONTH,format='%Y-%m-%d')
#newdate contains all the 0.5 million of dates!
Now I want to use a hash table to calculate the difference between each of the 0.5 million dates in "newdate" to the closest holiday. I do NOT want to do the same calculation millions of times, thats why I want to use a hashtable for this.
I tried searching for a solution on google but only found stuff such as:
keys = ['a', 'b', 'c']
values = [1, 2, 3]
hash = {k:v for k, v in zip(keys, values)}
And this does not work in my case.
Thanks for your help!

You need to create the table first. Like this.
import datetime
holidays =['2014-01-01', '2014-01-20', '2014-02-17', '2014-05-26',
'2014-07-04', '2014-09-01', '2014-10-13', '2013-11-11',
'2013-11-28', '2013-12-25']
hdates = []
def return_date(txt):
_t = txt.split("-")
return datetime.date(int(_t[0]), int(_t[1]), int(_t[2]))
def find_closest(d):
_d = min(hdates, key=lambda x:abs(x-d))
_diff = abs(_d - d).days
return _d, _diff
# Convert holidays to datetime.date
for h in holidays:
hdates.append(return_date(h))
# Build the "hash" table
hash_table = {}
i_date = datetime.date(2013, 1, 1)
while i_date < datetime.date(2016,1,1):
cd, cdiff = find_closest(i_date)
hash_table[i_date] = {"date": cd, "difference": cdiff}
i_date = i_date + datetime.timedelta(days=1)
print hash_table[datetime.date(2014,10,15)]
This works on datetime.date objects instead of raw strings. It has a built-in function to convert a "yyyy-mm-dd" string to datetime.date though.
This creates a hash table for all dates between 1/1/2013 and 31/12/2015 and then tests this with just one date. You would then loop your 0.5 million dates and match the result in this dictionary (key is datetime.date object but you can of course convert this back to string if you so desire).
Anyway, this should give you the idea how to do this.

Related

how to transform for loop to lambda function

I have written this function:
def time_to_unix(df,dateToday):
'''this function creates the timestamp column for the dataframe. it also gets today's date (ex: 2022-8-8 0:0:0)
and then it adds the seconds that were originally in the timestamp column.
input: dataframe, dateToday(type: pandas.core.series.Series)
output: list of times
'''
dateTime = dateToday[0]
times = []
for i in range(0,len(df['timestamp'])):
dateAndTime = dateTime + timedelta(seconds = float(df['timestamp'][i]))
unix = pd.to_datetime([dateAndTime]).astype(int) / 10**9
times.append(unix[0])
return times
so it takes a dataframe and it gets today's date and then its taking the value of the timestamp in the dataframe( which is in seconds like 10,20,.... ) then it applies the function and returns times in unix time
however, because I have approx 2million row in my dataframe, its taking me a lot of time to run this code.
how can I use lambda function or something else in order to speed up my code and the process.
something along the line of:
df['unix'] = df.apply(lambda row : something in here), axis = 1)
What I think you'll find is that most of the time is spent in the creation and manipulation of the datetime / timestamp objects in the dataframe (see here for more info). I also try to avoid using lambdas like this on large dataframes as they go row by row which should be avoided. What I've done when dealing with datetimes / timestamps / timezone changes in the past is to build a dictionary of the possible datetime combinations and then use map to apply them. Something like this:
import datetime as dt
import pandas as pd
#Make a time key column out of your date and timestamp fields
df['time_key'] = df['date'].astype(str) + '#' + df['timestamp']
#Build a dictionary from the unique time keys in the dataframe
time_dict = dict()
for time_key in df['time_key'].unique():
time_split = time_key.split('#')
#Create the Unix time stamp based on the values in the key; store it in the dictionary so it can be mapped later
time_dict[time_key] = (pd.to_datetime(time_split[0]) + dt.timedelta(seconds=float(time_split[1]))).astype(int) / 10**9
#Now map the time_key to the unix column in the dataframe from the dictionary
df['unix'] = df['time_key'].map(time_dict)
Note if all the datetime combinations are unique in the dataframe, this likely won't help.
I'm not exactly sure what type dateTime[0] has. But you could try a more vectorized approach:
import pandas as pd
df["unix"] = (
(pd.Timestamp(dateTime[0]) + pd.to_timedelta(df["timestamp"], unit="seconds"))
.astype("int").div(10**9)
)
or
df["unix"] = (
(dateTime[0] + pd.to_timedelta(df["timestamp"], unit="seconds"))
.astype("int").div(10**9)
)

Python: Search for DateTime inside a Dictionary that is Less than a certain DateTime

If I have the following data:
>data1 = ({'StartDT':'2017-01-01 04:54:00'},{'EndDT':'2017-01-01 08:56:00'},{'Code':'1234'})
>data2 = ({'StartDT':'2017-01-01 05:54:00'},{'EndDT':'2017-01-01 07:45:00'},{'Code':'1234'})
Question 1 = In Python, what do you think is the best data structure for this?
Question 2 = My goal is to search for data(n) which has a StartDT which is less than a certain DateTime (example: '2017-01-01 06:30:00), and whose EndDT is greater than that certain DateTime.
Thanks for the help!
Here's another approach to the problem.
# First, let's put data in a more useful format...
data = [{'StartDT': '2017-01-01 04:54:00',
'EndDT': '2017-01-01 08:56:00',
'Code': '1234'},
{'StartDT': '2017-01-01 05:54:00',
'EndDT': '2017-01-01 07:45:00',
'Code': '1234'}]
# Convert the date string to datetime (you should really do this as you insert
# the dictionaries into the list...
from datetime import datetime
def convert_timestamp(s):
return datetime.strptime(s, '%Y-%m-%d %H:%M:%S')
for d in data:
d['StartDT'] = convert_timestamp(d['StartDT'])
d['EndDT'] = convert_timestamp(d['EndDT'])
# Next you can use filter and a function to help pick off the entries that
# meet your needs.
start_time = convert_timestamp('2017-01-01 05:30:00')
end_time = convert_timestamp('2017-01-01 08:30:00')
matching = filter(lambda d: d['StartDT'] < start_time and d['EndDT'] > end_time,
data)
print(repr(list(matching)))
# This could be rewritten as...
def is_out_of_range(start, end, d):
return d['StartDT'] < start and d['EndDT'] > end
# We use partial() to add the start and end parameters, which leaves a function
# with one parameter left. That will be the data passed in by filter.
from functools import partial
matching = filter(partial(is_out_of_range, start_time, end_time), data)
# Alternatively, you could avoid the partial with a lambda:
matching = filter(lambda d: is_out_of_range(start_time, end_time, d), data)
print(repr(list(matching)))
You definitely want to use a different organization for your data (a list of dictionaries), and you'll need to convert your strings to something more useful (datetime instances). That allows you to iterate through the data much better and do the matching you want.
Q1
I think the best structure here is to use Class. For example,
def class TimePeriod():
def TimePeriod():
def __init__(self, start_dt, end_dt, code):
self.start_dt = start_dt
self.end_dt = end_dt
self.code = code
Of course, you can use ```dict`` instead of your custom class, but having custom objects here just gives you more clarity.
Q2
Any search in the unsorted array takes O(n) time. So your function might look like this
start_dt_val = datetime....
end_dt_val = datetime....
def predicate_less_than(element):
# element is TimePeriod class
return element.start_dt < start_dt_val and \
element.end_dt > end_dt_val
def find_element_with_predicate(arr, predicate):
for el in arr:
if predicate(el):
return el
t1 = TimePeriod(....)
...
tn = ...
arr = [t1, ..., tn]
el = find_element_with_predicate(arr, predicate_less_than)
el will be None if no element would be found.

Sorting by month-year groups by month instead

I have a curious python problem.
The script takes two csv files, one with a column of dates and the other a column of text snippets. in the other excel file there is a bunch of names (substrings).
All that the code does is step through both lists building up a name-mentioned-per-month matrix.
FILE with dates and text: (Date, Snippet first column)
ENTRY 1 : Sun 21 nov 2014 etc, The release of the iphone 7 was...
-strings file
iphone 7
apple
apples
innovation etc.
The problem is that when i try to order it so that the columns follow in asceding order, e.g. oct-2014, nov-2014, dec-2014 and so on, it just groups the months together instead, which isn't what i want
import csv
from datetime import datetime
file_1 = input('Enter first CSV name (one with the date and snippet): ')
file_2 = input('Enter second CSV name (one with the strings): ')
outp = input('Enter the output CSV name: ')
file_1_list = []
head = True
for row in csv.reader(open(file_1, encoding='utf-8', errors='ignore')):
if head:
head = False
continue
date = datetime.strptime(row[0].strip(), '%a %b %d %H:%M:%S %Z %Y')
date_str = date.strftime('%b %Y')
file_1_list.append([date_str, row[1].strip()])
file_2_dict = {}
for line in csv.reader(open(file_2, encoding='utf-8', errors='ignore')):
s = line[0].strip()
for d in file_1_list:
if s.lower() in d[1].lower():
if s in file_2_dict.keys():
if d[0] in file_2_dict[s].keys():
file_2_dict[s][d[0]] += 1
else:
file_2_dict[s][d[0]] = 1
else:
file_2_dict[s] = {
d[0]: 1
}
months = []
for v in file_2_dict.values():
for k in v.keys():
if k not in months:
months.append(k)
months.sort()
rows = [[''] + months]
for k in file_2_dict.keys():
tmp = [k]
for m in months:
try:
tmp.append(file_2_dict[k][m])
except:
tmp.append(0)
rows.append(tmp)
print("still working on it be patient")
writer = csv.writer(open(outp, "w", encoding='utf-8', newline=''))
for r in rows:
writer.writerow(r)
print('Done...')
From my understanding I am months.sort() isnt doing what i expect it to?
I have looked here , where they apply some other function to sort the data, using attrgetter,
from operator import attrgetter
>>> l = [date(2014, 4, 11), date(2014, 4, 2), date(2014, 4, 3), date(2014, 4, 8)]
and then
sorted(l, key=attrgetter('month'))
But I am not sure whether that would work for me?
From my understanding I parse the dates 12-13, am I missing an order data first, like
data = sorted(data, key = lambda row: datetime.strptime(row[0], "%b-%y"))
I have only just started learning python and so many things are new to me i dont know what is right and what isnt?
What I want(of course with the correctly sorted data):
This took a while because you had so much unrelated stuff about reading csv files and finding and counting tags. But you already have all that, and it should have been completely excluded from the question to avoid confusing people.
It looks like your actual question is "How do I sort dates?"
Of course "Apr-16" comes before "Oct-14", didn't they teach you the alphabet in school? A is the first letter! I'm just being silly to emphasize a point -- it's because they are simple strings, not dates.
You need to convert the string to a date with the datetime class method strptime, as you already noticed. Because the class has the same name as the module, you need to pay attention to how it is imported. You then go back to a string later with the member method strftime on the actual datetime (or date) instance.
Here's an example:
from datetime import datetime
unsorted_strings = ['Oct-14', 'Dec-15', 'Apr-16']
unsorted_dates = [datetime.strptime(value, '%b-%y') for value in unsorted_strings]
sorted_dates = sorted(unsorted_dates)
sorted_strings = [value.strftime('%b-%y') for value in sorted_dates]
print(sorted_strings)
['Oct-14', 'Dec-15', 'Apr-16']
or skipping to the end
from datetime import datetime
unsorted_strings = ['Oct-14', 'Dec-15', 'Apr-16']
print (sorted(unsorted_strings, key = lambda x: datetime.strptime(x, '%b-%y')))
['Oct-14', 'Dec-15', 'Apr-16']

Python SumIfs for list of list dates

I have a list of lists composed of dates in excel float format (every minute since July 5, 1996) and an integer value associated with each date like this: [[datetime,integer]...]. I need to create a new list composed of all of the dates (no hours or minutes) and the sum of the values for all of the datetimes within that date. In other words, what is the sum of the values for each date when listolists[x][0] >= math.floor(listolists[x][0]) and listolists[x][0] < math.floor(listolists[x][0]). Thanks
Since you didn't provide any actual data (just the data structure you used, nested lists), I created some dummy data below to demonstrate how you might do a SUMIFS-type of problem in Python.
from datetime import datetime
import numpy as np
import pandas as pd
dates_list = []
# just take one month as an example of how to group by day
year = 2015
month = 12
# generate similar data to what you might have
for day in range(1, 32):
for hour in range(1, 24):
for minute in range(1, 60):
dates_list.append([datetime(year, month, day, hour, minute), np.random.randint(20)])
# unpack these nested list pairs so we have all of the dates in
# one list, and all of the values in the other
# this makes it easier for pandas later
dates, values = zip(*dates_list)
# to eventually group by day, we need to forget about all intra-day data, e.g.
# different hours and minutes. we only care about the data for a given day,
# not the by-minute observations. So, let's set all of the intra-day values to
# some constant for easier rolling-up of these dates.
new_dates = []
for d in dates:
new_d = d.replace(hour = 0, minute = 0)
new_dates.append(new_d)
# throw the new dates and values into a pandas.DataFrame object
df = pd.DataFrame({'new_dates': new_dates, 'values': values})
# here's the SUMIFS function you're looking for
grouped = df.groupby('new_dates')['values'].sum()
Let's see the results:
>>> print(grouped.head())
new_dates
2015-12-01 12762
2015-12-02 13292
2015-12-03 12857
2015-12-04 12762
2015-12-05 12561
Name: values, dtype: int64
Edit: If you want these new grouped data back in the nested list format, just do this:
new_list = [[date, value] for date, value in zip(grouped.index, grouped)]
Thanks everyone. This is the simplest code I could come up with that doesn't require panda:
for row in listolist:
for k in (0, 1):
row[k] = math.floor(float(row[k]))
date = {}
for d,v in listolist:
if d in date:
date[math.floor(d)].append(v)
else:
date[math.floor(d)] = [v]
result = [(d,sum(v)) for d,v in date.items()]

aggregating datetime objects in python to the hour

I have a list of datetime objects in python and want to aggregate them by the hour. For example if I have a datetime object for
[03/01/2012 00:12:12,
03/01/2012 00:55:12,
03/01/2012 01:12:12,
...]
I want to have a list of datetime objects for every hour along with a count of the number of datetime objects I have that fall into that bucket. For my example above I would want output of
[03/01/2012 00:00:00, 03/01/2012 01:00:00] in one list and a count of the entries in another list: [2,1].
You could store that kind of data efficiently with a dictionary where the keys are hours and the values are lists of the datetime objects. e.g. (untested):
l = [datetime.datetime.now(), datetime.datetime.now()] #...etc.
hour_quantization = {}
for dt in l:
if dt.hour not in hour_quantization:
hour_quantization[dt.hour] = [dt]
else:
hour_quantization[dt.hour].append(dt)
counts = [len(hour_quantization[hour]) for hour in hour_quantization.keys()]
see the doc entry on datetime
Assuming you have a list of datetime objects, you can count how many of each hour there are:
from collections import Counter
hours = [t.hour for t in ts]
Counter(hours)
This will give you:
Counter({0: 2, 1: 1})

Categories