hi I've to preprocess a column that has comma separated values and i can't apply .split(',\s*') because there are places where commas and spaces shouldn't be separate so therefore i'm looking for a regex pattern.
column:
0 12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun)
1 11 AM to 11 PM
2 11:30 AM to 4:30 PM, 6:30 PM to 11 PM
3 12 Noon to 2 AM
4 12noon to 11pm (Mon, Tue, Wed, Thu, Sun), 12no...
...
100 11 AM to 11 PM
101 10 AM to 10 PM (Mon-Thu), 8 AM to 10:30 PM (Fr...
102 12 Noon to 11 PM
103 8am to 12:30AM (Mon-Sun)
104 11:30 AM to 3 PM, 7 PM to 12 Midnight
what i've tried is
import re
pattern = '([\w+\:*\s*\w*(w{2})*]*\s*to\s*[\w+\:*\s*\w*(w{2})*]*\s*[\([a-zA-Z]*\-*\,*\s*
[a-zA-Z]*\s*\)]*)'
timing = data['timings'].str.lower().str.split(pattern).dropna().to_numpy()
output:
array([list(['12noon to 3:30pm,', ' 6:30pm to 11:30pm (mon-sun)', '']),
list(['11 am to 11 pm']),
list(['11:30 am to 4:30 pm, 6:30 pm to 11 pm']),
list(['12 noon to 2 am']),
list(['12noon to 11pm (mon, tue, wed, thu, sun),', ' 12noon to 12midnight (fri-sat)', '']),
list(['12noon to 3:30pm, 4pm to 6:30pm, 7pm to 11:30pm (mon, tue, wed, thu, sun), 12noon to 3:30pm, 4pm to 6:30pm,', ' 7pm to 12midnight (fri-sat)', '']),
list(['7 am to 10 pm']), list(['12 noon to 12 midnight']),
list(['12 noon to 12 midnight']),
list(['', '10 am to 1 am (mon-thu)', ',', ' 10 am to 1:30 am (fri-sun)', '']),
list(['12 noon to 3:30 pm, 7 pm to 10:30 pm']),
list(['12 noon to 3:30 pm, 6:30 pm to 11:30 pm']),
list(['11:30 am to 1 am']),
list(['', '12noon to 12midnight (mon-sun)', '']),
list(['12 noon to 4:30 pm, 6:30 pm to 11:30 pm']),
list(['11 am to 11 pm']), list(['12 noon to 10:30 pm']),
list(['11:30 am to 1 am']), list(['12 noon to 12 midnight']),
list(['12 noon to 11 pm']),
list(['', '12:30 pm to 10 pm (tue-sun)', ', mon closed']),
list(['11:30 am to 3 pm, 7 pm to 11 pm']),
list(['11am to 11:30pm (mon, tue, wed, thu, sun),', ' 11am to 12midnight (fri-sat)', '']),
list(['10 am to 5 am']),
list(['12 noon to 12 midnight (mon-thu, sun),', ' 12 noon to 1 am (fri-sat)', '']),
list(['', '12noon to 11pm (mon-thu)', ',', '12noon to 11:30pm (fri-sun)', '']),
list(['', '12 noon to 11:30 pm (mon-wed)', ',', ' 12 noon to 1 am (fri-sat)', ',', ' 12 noon to 12 midnight (sun)', ', thu closed']),
list(['12 noon to 4 pm, 6:30 pm to 11:30 pm']),
list(['10 am to 1 am']), list(['4:30 pm to 5:30 am']),
list(['11 am to 12 midnight']),
list(['12noon to 4pm,', ' 7pm to 12midnight (mon-sun)', '']),
list(['11 am to 12 midnight']),
list(['', '6am to 12midnight (mon-sun)', '']),
list(['12 noon to 11 pm']),
list(['12:30 pm to 3:30 pm, 7 pm to 10:40 pm']),
list(['12 noon to 4 pm, 7 pm to 11 pm']),
list(['12noon to 11pm (mon, tue, wed, thu, sun),', ' 12noon to 12midnight (fri-sat)', '']),
list(['12 noon to 10:30 pm']),
list(['', '12noon to 11pm (mon-sun)', '']),
list(['10 am to 10 pm']), list(['10 am to 10 pm']),
list(['7 am to 1 am']), list(['12 noon to 11:30 pm']),
list(['', '12noon to 11:30pm (mon-sun)', '']),
list(['12 noon to 11:30 pm']), list(['12 noon to 11 pm']),
list(['6 am to 10:30 pm']),
list(['11:30 am to 3:30 pm, 6:45 pm to 11:30 pm']),
list(['11:55 am to 4 pm, 7 pm to 11:15 pm']),
list(['12 noon to 11 pm']), list(['11 am to 11 pm']),
list(['12noon to 4:30pm, 6:30pm to 11:30pm (mon, tue, wed, fri, sat), closed (thu),', '12noon to 12midnight (sun)', '']),
list(['12noon to 12midnight (mon, tue, wed, thu, sun),', ' 12noon to 1am (fri-sat)', '']),
list(['8 am to 11:30 pm']),
list(['6:30am to 10:30am, 12:30pm to 3pm,', ' 7pm to 11pm (mon)', ',6:30am to 10:30am, 12:30pm to 3pm,', ' 7:30pm to 11pm (tue-sat)', ',6:30am to 10:30am, 12:30pm to 3:30pm,', ' 7pm to 11pm (sun)', '']),
list(['12 noon to 3 pm, 7 pm to 11:30 pm']),
list(['11:30 am to 1 am']), list(['9 am to 10 pm']),
list(['12 noon to 12 midnight (mon-thu, sun),', ' 12 noon to 1 am (fri-sat)', '']),
list(['', '5pm to 12midnight (mon-sun)', '']),
list(['11 am to 11:30 pm']),
list(['', '11:30am to 11pm (mon-sun)', '']),
list(['12 noon to 10:30 pm']), list(['1 pm to 11 pm']),
list(['11:30 am to 12 midnight']),
list(['12 noon to 12 midnight']),
list(['', '12noon to 12midnight (mon-sun)', '']),
list(['', '12noon to 11pm (mon-sun)', '']),
list(['12 noon to 3 pm, 7 pm to 11 pm']),
list(['12 noon to 3 pm, 7 pm to 11 pm']),
list(['', '11 am to 8 pm (mon-sat)', ', sun closed']),
list(['4 am to 12 midnight']), list(['9 am to 1 am']),
list(['10:30 am to 11 pm']), list(['7 am to 11 pm']),
list(['7 am to 10:30 am, 12:30 pm to 3:30 pm, 7 pm to 11 pm']),
list(['12 noon to 3:30 pm, 7 pm to 11:30 pm']),
list(['12 noon to 3:30 pm, 7 pm to 11 pm']),
list(['12noon to 12midnight (mon, tue, wed, thu, sun),', ' 12noon to 1am (fri-sat)', '']),
list(['', '11am to 11pm (mon-sun)', '']),
list(['6 am to 11:30 pm']), list(['11:30 am to 5 am']),
list(['12:30 pm to 3:30 pm, 7 pm to 11 pm']),
list(['', '6pm to 2am (mon-sun)', '']),......)
but what i want is something like this:
[['6pm to 2am (mon-sun)'], ['12 noon to 12 midnight (mon-thu, sun)'] .....] something like this
i think i've to design a better regex pattern in order to separated these values. so can anyone design a better regex pattern? Thanks in advance:).
Here's my attempt:
import re, pandas
data = pandas.read_excel('C:\\Users\\Administrator\\Desktop\\test.xls')
pattern = '(\d{1,2}(?:\:\d{1,2})? ?(?:\w{2,8}) to \d{1,2}(?:\:\d{1,2})? ?(?:\w{2,8}) ?(?:\(\w{3}(?:[ ,-]{1,3}\w{3}){0,6}\))?)'
re.findall(pattern, data["myData"].str.cat(sep=", "))
With the call to re.findall() my output was:
['12noon to 3:30pm', '6:30pm to 11:30pm (Mon-Sun)', '11 AM to 11 PM', '11:30 AM to 4:30 PM', '6:30 PM to 11 PM', '12 Noon to 2 AM', '11 AM to 11 PM', '10 AM to 10 PM (Mon-Thu)', '8 AM to 10:30 PM (Fri,Sat)', '12 Noon to 11 PM', '8am to 12:30AM (Mon-Sun)', '11:30 AM to 3 PM', '7 PM to 12 Midnight']
I have a list of strings and I want to use regex to filter the list to certain strings.
Ex. Here is the original list:
quoteTitle = ['\r\n ', ' ', '\r\n ', '\r\n ', '\r\n ', '\r\n ', '\r\n ', '30. Loyalty', '29. Speed Scale', '28. Security', '27. Every Position', '26. Superior Brain Power', '25. A Long Line of Fighters', '24. Dwight Surveillance', '23. Friends ', '22. Pull the Plug ', '21. Second Life', '20. Accidentally vs. On Purpose', '19. Menstruation Wishes ', '18. Ideal Choice', '17. Healthcare in the Wild', '16. Superior Cousins', '15. Regular Ideas', '14. Immunity Logic', '13. The Person You Least, Medium and Most Suspect', '12. Real Heroes ', '11. Water Cooler Gossip', '10. Stress', '9. All These People!', '8. The “R” Sound', '7. A Woman’s Defects ', '6.Werewolf Hunting Experience ', '5. An Ideal World ', '4. Attention', '3. The Thing About Bear Attacks ', '2. Resume Critiquing', '1. Yeast Infections ', 'Tags', 'Recently in TV', '5/8/2018 3:45:00 PM', '5/3/2018 2:00:00 PM', '5/3/2018 12:00:00 PM', 'Most Popular', '5/3/2018 11:00:00 AM', '5/3/2018 12:00:00 PM', '5/8/2018 3:45:00 PM', '4/13/2018 5:00:00 PM', '4/18/2018 8:22:31 PM', '5/3/2018 2:00:00 PM', '5/8/2018 6:02:54 PM', '5/7/2018 4:52:04 PM', '5/7/2018 2:57:00 PM', '5/3/2018 5:04:43 PM', '5/3/2018 1:06:18 PM', 'Music', '5/3/2018 12:00:00 AM', 'Music', '5/3/2018 12:00:00 AM', 'Music', '5/2/2018 12:00:00 AM', '11/28/2017 8:00:00 AM', '11/30/2017 11:00:00 AM', '5/3/2018 11:00:00 AM', '5/3/2018 2:00:00 PM', '5/3/2018 12:00:00 PM', '12/11/2017 1:00:00 PM', '12/15/2017 8:00:00 AM', '1/9/2018 3:00:00 PM', '1/4/2017 12:30:00 PM', '4/13/2013 12:13:00 PM', 'TV', '4/3/2018 10:00:00 AM', 'TV', '4/3/2018 9:25:00 AM', 'Comedy', '3/22/2018 1:00:28 PM', 'TV', '3/15/2018 10:00:00 AM', 'Comedy', '3/13/2018 2:00:00 PM', 'TV', '3/10/2018 10:00:00 AM', 'TV', '3/2/2018 11:00:00 AM', 'TV', '2/25/2018 10:30:00 PM', 'TV', '2/23/2018 1:00:00 PM', '5/3/2018 11:00:00 AM', '5/3/2018 12:00:00 PM', '5/8/2018 3:45:00 PM', '4/13/2018 5:00:00 PM', '4/18/2018 8:22:31 PM', '5/3/2018 2:00:00 PM', '5/3/2018 10:00:00 AM', '5/7/2018 10:00:00 AM', '4/26/2018 2:00:00 PM', '5/6/2018 10:00:00 PM', '5/8/2018 6:02:54 PM', '5/7/2018 4:52:04 PM', '5/7/2018 2:57:00 PM', '5/3/2018 5:04:43 PM', '5/3/2018 1:06:18 PM', '5/3/2018 12:00:00 AM', '5/3/2018 12:00:00 AM', '5/2/2018 12:00:00 AM', '5/2/2018 12:00:00 AM', '5/2/2018 12:00:00 AM', '11/28/2017 8:00:00 AM', '11/30/2017 11:00:00 AM', '5/3/2018 11:00:00 AM', '5/3/2018 2:00:00 PM', '5/3/2018 12:00:00 PM', '12/11/2017 1:00:00 PM', '12/15/2017 8:00:00 AM', '1/9/2018 3:00:00 PM', '1/4/2017 12:30:00 PM', '4/13/2013 12:13:00 PM', '4/3/2018 10:00:00 AM', '4/3/2018 9:25:00 AM', '3/22/2018 1:00:28 PM', '3/15/2018 10:00:00 AM', '3/13/2018 2:00:00 PM']
I want only the numbered items and their text following from 30 to 1. I can successfully filter out anything that doesn't start with a number using
p = re.compile(r'\w')
q = filter(p.match, quoteTitle)
p = re.compile(r'^\d+')
q = filter(p.match, q)
This gets me to
print(list(q)) --> ['30. Loyalty', '29. Speed Scale', '28. Security', '27. Every Position', '26. Superior Brain Power', '25. A Long Line of Fighters', '24. Dwight Surveillance', '23. Friends ', '22. Pull the Plug ', '21. Second Life', '20. Accidentally vs. On Purpose', '19. Menstruation Wishes ', '18. Ideal Choice', '17. Healthcare in the Wild', '16. Superior Cousins', '15. Regular Ideas', '14. Immunity Logic', '13. The Person You Least, Medium and Most Suspect', '12. Real Heroes ', '11. Water Cooler Gossip', '10. Stress', '9. All These People!', '8. The “R” Sound', '7. A Woman’s Defects ', '6.Werewolf Hunting Experience ', '5. An Ideal World ', '4. Attention', '3. The Thing About Bear Attacks ', '2. Resume Critiquing', '1. Yeast Infections ', 'Tags', 'Recently in TV', '5/8/2018 3:45:00 PM', '5/3/2018 2:00:00 PM', '5/3/2018 12:00:00 PM', 'Most Popular', '5/3/2018 11:00:00 AM', '5/3/2018 12:00:00 PM', '5/8/2018 3:45:00 PM', '4/13/2018 5:00:00 PM', '4/18/2018 8:22:31 PM', '5/3/2018 2:00:00 PM', '5/8/2018 6:02:54 PM', '5/7/2018 4:52:04 PM', '5/7/2018 2:57:00 PM', '5/3/2018 5:04:43 PM', '5/3/2018 1:06:18 PM', 'Music', '5/3/2018 12:00:00 AM', 'Music', '5/3/2018 12:00:00 AM', 'Music', '5/2/2018 12:00:00 AM', '11/28/2017 8:00:00 AM', '11/30/2017 11:00:00 AM', '5/3/2018 11:00:00 AM', '5/3/2018 2:00:00 PM', '5/3/2018 12:00:00 PM', '12/11/2017 1:00:00 PM', '12/15/2017 8:00:00 AM', '1/9/2018 3:00:00 PM', '1/4/2017 12:30:00 PM', '4/13/2013 12:13:00 PM', 'TV', '4/3/2018 10:00:00 AM', 'TV', '4/3/2018 9:25:00 AM', 'Comedy', '3/22/2018 1:00:28 PM', 'TV', '3/15/2018 10:00:00 AM', 'Comedy', '3/13/2018 2:00:00 PM', 'TV', '3/10/2018 10:00:00 AM', 'TV', '3/2/2018 11:00:00 AM', 'TV', '2/25/2018 10:30:00 PM', 'TV', '2/23/2018 1:00:00 PM', '5/3/2018 11:00:00 AM', '5/3/2018 12:00:00 PM', '5/8/2018 3:45:00 PM', '4/13/2018 5:00:00 PM', '4/18/2018 8:22:31 PM', '5/3/2018 2:00:00 PM', '5/3/2018 10:00:00 AM', '5/7/2018 10:00:00 AM', '4/26/2018 2:00:00 PM', '5/6/2018 10:00:00 PM', '5/8/2018 6:02:54 PM', '5/7/2018 4:52:04 PM', '5/7/2018 2:57:00 PM', '5/3/2018 5:04:43 PM', '5/3/2018 1:06:18 PM', '5/3/2018 12:00:00 AM', '5/3/2018 12:00:00 AM', '5/2/2018 12:00:00 AM', '5/2/2018 12:00:00 AM', '5/2/2018 12:00:00 AM', '11/28/2017 8:00:00 AM', '11/30/2017 11:00:00 AM', '5/3/2018 11:00:00 AM', '5/3/2018 2:00:00 PM', '5/3/2018 12:00:00 PM', '12/11/2017 1:00:00 PM', '12/15/2017 8:00:00 AM', '1/9/2018 3:00:00 PM', '1/4/2017 12:30:00 PM', '4/13/2013 12:13:00 PM', '4/3/2018 10:00:00 AM', '4/3/2018 9:25:00 AM', '3/22/2018 1:00:28 PM', '3/15/2018 10:00:00 AM', '3/13/2018 2:00:00 PM']
Now I want to remove the dates in the list
I've tried a lot of combinations of this, but I think I'm missing something or not understanding. My thinking is to get all strings in the list that do not follow the format of the date entries.
p = re.compile(r'[^'\d+/]')
q = filter(p.match, q)
They start with an apostrophe because its a string of a quote and I think that might be my problem. Other than that, the format goes:
apostrophe, number (between 1-12 so \d+), /
That should be enough to filter out the date entries as long as I get it working correctly
Update: even tried this to search for elements of the list that have an AM or PM in them and still no luck
p = re.compile(r'[^(AM|PM)]')
q = filter(p.search, q)
You can search for strings that start with a digit and a .:
import re
quoteTitle = ['\r\n ', ' ', '\r\n ', '\r\n ', '\r\n ', '\r\n ', '\r\n ', '30. Loyalty', '29. Speed Scale', '28. Security', '27. Every Position', '26. Superior Brain Power', '25. A Long Line of Fighters', '24. Dwight Surveillance', '23. Friends ', '22. Pull the Plug ', '21. Second Life', '20. Accidentally vs. On Purpose', '19. Menstruation Wishes ', '18. Ideal Choice', '17. Healthcare in the Wild', '16. Superior Cousins', '15. Regular Ideas', '14. Immunity Logic', '13. The Person You Least, Medium and Most Suspect', '12. Real Heroes ', '11. Water Cooler Gossip', '10. Stress', '9. All These People!', '8. The “R” Sound', '7. A Woman’s Defects ', '6.Werewolf Hunting Experience ', '5. An Ideal World ', '4. Attention', '3. The Thing About Bear Attacks ', '2. Resume Critiquing', '1. Yeast Infections ', 'Tags', 'Recently in TV', '5/8/2018 3:45:00 PM', '5/3/2018 2:00:00 PM', '5/3/2018 12:00:00 PM', 'Most Popular', '5/3/2018 11:00:00 AM', '5/3/2018 12:00:00 PM', '5/8/2018 3:45:00 PM', '4/13/2018 5:00:00 PM', '4/18/2018 8:22:31 PM', '5/3/2018 2:00:00 PM', '5/8/2018 6:02:54 PM', '5/7/2018 4:52:04 PM', '5/7/2018 2:57:00 PM', '5/3/2018 5:04:43 PM', '5/3/2018 1:06:18 PM', 'Music', '5/3/2018 12:00:00 AM', 'Music', '5/3/2018 12:00:00 AM', 'Music', '5/2/2018 12:00:00 AM', '11/28/2017 8:00:00 AM', '11/30/2017 11:00:00 AM', '5/3/2018 11:00:00 AM', '5/3/2018 2:00:00 PM', '5/3/2018 12:00:00 PM', '12/11/2017 1:00:00 PM', '12/15/2017 8:00:00 AM', '1/9/2018 3:00:00 PM', '1/4/2017 12:30:00 PM', '4/13/2013 12:13:00 PM', 'TV', '4/3/2018 10:00:00 AM', 'TV', '4/3/2018 9:25:00 AM', 'Comedy', '3/22/2018 1:00:28 PM', 'TV', '3/15/2018 10:00:00 AM', 'Comedy', '3/13/2018 2:00:00 PM', 'TV', '3/10/2018 10:00:00 AM', 'TV', '3/2/2018 11:00:00 AM', 'TV', '2/25/2018 10:30:00 PM', 'TV', '2/23/2018 1:00:00 PM', '5/3/2018 11:00:00 AM', '5/3/2018 12:00:00 PM', '5/8/2018 3:45:00 PM', '4/13/2018 5:00:00 PM', '4/18/2018 8:22:31 PM', '5/3/2018 2:00:00 PM', '5/3/2018 10:00:00 AM', '5/7/2018 10:00:00 AM', '4/26/2018 2:00:00 PM', '5/6/2018 10:00:00 PM', '5/8/2018 6:02:54 PM', '5/7/2018 4:52:04 PM', '5/7/2018 2:57:00 PM', '5/3/2018 5:04:43 PM', '5/3/2018 1:06:18 PM', '5/3/2018 12:00:00 AM', '5/3/2018 12:00:00 AM', '5/2/2018 12:00:00 AM', '5/2/2018 12:00:00 AM', '5/2/2018 12:00:00 AM', '11/28/2017 8:00:00 AM', '11/30/2017 11:00:00 AM', '5/3/2018 11:00:00 AM', '5/3/2018 2:00:00 PM', '5/3/2018 12:00:00 PM', '12/11/2017 1:00:00 PM', '12/15/2017 8:00:00 AM', '1/9/2018 3:00:00 PM', '1/4/2017 12:30:00 PM', '4/13/2013 12:13:00 PM', '4/3/2018 10:00:00 AM', '4/3/2018 9:25:00 AM', '3/22/2018 1:00:28 PM', '3/15/2018 10:00:00 AM', '3/13/2018 2:00:00 PM']
new_result = list(filter(lambda x:re.findall('^\d+\.', x), quoteTitle))
Output:
['30. Loyalty', '29. Speed Scale', '28. Security', '27. Every Position', '26. Superior Brain Power', '25. A Long Line of Fighters', '24. Dwight Surveillance', '23. Friends ', '22. Pull the Plug ', '21. Second Life', '20. Accidentally vs. On Purpose', '19. Menstruation Wishes ', '18. Ideal Choice', '17. Healthcare in the Wild', '16. Superior Cousins', '15. Regular Ideas', '14. Immunity Logic', '13. The Person You Least, Medium and Most Suspect', '12. Real Heroes ', '11. Water Cooler Gossip', '10. Stress', '9. All These People!', '8. The \xe2\x80\x9cR\xe2\x80\x9d Sound', '7. A Woman\xe2\x80\x99s Defects ', '6.Werewolf Hunting Experience ', '5. An Ideal World ', '4. Attention', '3. The Thing About Bear Attacks ', '2. Resume Critiquing', '1. Yeast Infections ']
Edit: to find all data between the quotes, you can use .*?:
quote = ['i dont want this', '\r\n ', '\r\n ', ' "this is the quote i want to extract" ', '" and also this one"', '\r\n "and me"']
new_results = list(map(lambda x:x[0], filter(None, [re.findall('"(.*?)"', i) for i in quote])))
Output:
['this is the quote i want to extract', ' and also this one', 'and me']
Given this base date:
base_date = "10/29 06:58 AM"
I want to find a tuple within the list that contains the closest date to the base_date, but it must not be an earlier date.
list_date = [('10/30 02:18 PM', '-103', '-107'), ('10/30 02:17 PM', '+100', '-110'), \
('10/29 02:15 AM', '-101', '-109')
so here the output should be ('10/30 02:17 PM', '+100', '-110') (it can't be the 3rd tuple because the date there happened earlier than the base date)
My question is, does it exist any module for such date comparison? I tried to first change the data all to AM format and then compare but my code gets ugly with lots of slicing.
#edit:
Big list to test:
[('10/30 02:18 PM', '+13 -103', '-13 -107'), ('10/30 02:17 PM', '+13 +100', '-13 -110'), ('10/30 02:15 PM', '+13 -101', '-13 -109'), ('10/30 02:14 PM', '+13 -103', '-13 -107'), ('10/30 01:59 PM', '+13 -105', '-13 -105'), ('10/30 01:46 PM', '+13 -106', '-13 -104'), ('10/30 01:37 PM', '+13 -105', '-13 -105'), ('10/30 01:24 PM', '+13 -107', '-13 -103'), ('10/30 01:23 PM', '+13 -106', '-13 -104'), ('10/30 01:05 PM', '+13 -103', '-13 -107'), ('10/30 01:02 PM', '+13 -104', '-13 -106'), ('10/30 12:55 PM', '+13 -103', '-13 -107'), ('10/30 12:51 PM', '+13.5 -110', '-13.5 +100'), ('10/30 12:44 PM', '+13.5 -108', '-13.5 -102'), ('10/30 12:38 PM', '+13.5 -107', '-13.5 -103'), ('10/30 12:35 PM', '+13 -102', '-13 -108'), ('10/30 12:34 PM', '+13 -103', '-13 -107'), ('10/30 12:06 PM', '+13.5 -110', '-13.5 +100'), ('10/30 11:57 AM', '+13.5 -108', '-13.5 -102'), ('10/30 11:36 AM', '+13.5 -107', '-13.5 -103'), ('10/30 09:01 AM', '+13.5 -110', '-13.5 +100'), ('10/30 08:59 AM', '+13.5 -108', '-13.5 -102'), ('10/30 08:13 AM', '+13.5 -105', '-13.5 -105'), ('10/30 06:11 AM', '+13.5 +100', '-13.5 -110'), ('10/30 06:09 AM', '+13.5 -105', '-13.5 -105'), ('10/30 06:04 AM', '+13.5 -110', '-13.5 +100'), ('10/30 05:32 AM', '+13.5 -105', '-13.5 -105'), ('10/30 04:48 AM', '+13.5 -107', '-13.5 -103'), ('10/30 12:51 AM', '+13.5 -110', '-13.5 +100'), ('10/29 01:31 PM', '+13.5 -105', '-13.5 -105'), ('10/29 01:31 PM', '+13 +103', '-13 -113'), ('10/29 01:28 PM', '+13 -102', '-13 -108'), ('10/29 07:59 AM', '+13 -105', '-13 -105'), ('10/29 07:20 AM', '+13 -103', '-13 -107'), ('10/29 07:14 AM', '+13 -105', '-13 -105'), ('10/29 04:47 AM', '+13 +100', '-13 -110'), ('10/29 04:14 AM', '+13 -105', '-13 -105'), ('10/28 08:17 PM', '+12.5 +100', '-12.5 -110'), ('10/28 12:52 PM', '+12.5 -105', '-12.5 -105')]
Big list to test2:
[('10/30 04:30 PM', '+1.5 -111', '-1.5 +101'), ('10/30 04:24 PM', '+1.5 -110', '-1.5 +100'), ('10/30 04:21 PM', '+1.5 -111', '-1.5 +101'), ('10/30 04:15 PM', '+1.5 -112', '-1.5 +102'), ('10/30 04:14 PM', '+1.5 -110', '-1.5 +100'), ('10/30 03:57 PM', '+1.5 -111', '-1.5 +101'), ('10/30 03:40 PM', '+1.5 -110', '-1.5 +100'), ('10/30 03:31 PM', '+1.5 -111', '-1.5 +101'), ('10/30 03:30 PM', '+1.5 -109', '-1.5 -101'), ('10/30 03:25 PM', '+1.5 -107', '-1.5 -103'), ('10/30 03:24 PM', '+1.5 -110', '-1.5 +100'), ('10/30 03:23 PM', '+1.5 -108', '-1.5 -102'), ('10/30 03:22 PM', '+1.5 -106', '-1.5 -104'), ('10/30 02:14 PM', '+1.5 -104', '-1.5 -106'), ('10/30 01:41 PM', '+1.5 -105', '-1.5 -105'), ('10/30 01:37 PM', '+1.5 -107', '-1.5 -103'), ('10/30 01:36 PM', '+1.5 -105', '-1.5 -105'), ('10/30 01:06 PM', '+1.5 -103', '-1.5 -107'), ('10/30 12:56 PM', '+2 -111', '-2 +101'), ('10/30 12:53 PM', '+2 -110', '-2 +100'), ('10/30 12:50 PM', '+2 -113', '-2 +103'), ('10/30 12:49 PM', '+2 -112', '-2 +102'), ('10/30 12:46 PM', '+2 -113', '-2 +103'), ('10/30 12:45 PM', '+2 -110', '-2 +100'), ('10/30 12:43 PM', '+2 -108', '-2 -102'), ('10/30 12:38 PM', '+2.5 -116', '-2.5 +106'), ('10/30 12:38 PM', '+2.5 -113', '-2.5 +103'), ('10/30 12:37 PM', '+2.5 -110', '-2.5 +100'), ('10/30 10:30 AM', '+2.5 -105', '-2.5 -105'), ('10/30 10:07 AM', '+3 -113', '-3 +103'), ('10/30 09:55 AM', '+3 -112', '-3 +102'), ('10/30 09:51 AM', '+3 -110', '-3 +100'), ('10/30 09:32 AM', '+3 -109', '-3 -101'), ('10/30 06:04 AM', '+3 -110', '-3 +100'), ('10/30 03:16 AM', '+3 -107', '-3 -103'), ('10/30 03:14 AM', '+3.5 -116', '-3.5 +106'), ('10/30 01:03 AM', '+3.5 -115', '-3.5 +105'), ('10/30 12:17 AM', '+3.5 -110', '-3.5 +100'), ('10/29 08:52 PM', '+3.5 -108', '-3.5 -102'), ('10/29 01:31 PM', '+3.5 -105', '-3.5 -105'), ('10/29 06:48 AM', '+3.5 -110', '-3.5 +100'), ('10/29 06:47 AM', '+3.5 -109', '-3.5 -101'), ('10/29 05:39 AM', '+3.5 -113', '-3.5 +103'), ('10/29 03:34 AM', '+3.5 -108', '-3.5 -102'), ('10/29 12:44 AM', '+3.5 -110', '-3.5 +100'), ('10/29 12:41 AM', '+3.5 -107', '-3.5 -103'), ('10/29 12:40 AM', '+3.5 -105', '-3.5 -105'), ('10/28 12:52 PM', '+4 -105', '-4 -105')]
This can be done using datetime module, which is able to parse date string into datetime object, which supports comparison and arithmetic with dates:
from datetime import datetime
# function for parsing strings using specific format
get_datetime = lambda s: datetime.strptime(s, "%m/%d %I:%M %p")
base = get_datetime(base_date)
later = filter(lambda d: get_datetime(d[0]) > base, list_date)
closest_date = min(later, key = lambda d: get_datetime(d[0]))
>>> from datetime import timedelta, datetime
>>> base_date = "10/29 06:58 AM"
>>> b_d = datetime.strptime(base_date, "%m/%d %I:%M %p")
def func(x):
d = datetime.strptime(x[0], "%m/%d %I:%M %p")
delta = d - b_d if d > b_d else timedelta.max
return delta
...
>>> min(list_date, key = func)
('10/30 02:17 PM', '+100', '-110')
datetime.strptime converts the date to a datetime object, so b_d now looks something like this :
>>> b_d
datetime.datetime(1900, 10, 29, 6, 58)
Now we can write a function that can be passed to key parameter of min:
delta = d - b_d if d > b_d else timedelta.max
if d > b_d i.e if the date passed to min is greater than base_date then assign their difference to delta else assign timedelta.max to it.
>>> timedelta.max
datetime.timedelta(999999999, 86399, 999999)
Update:
>>> from datetime import timedelta, datetime
>>> base_date = '10/29 06:59 AM'
>>> b_d = datetime.strptime(base_date, "%m/%d %I:%M %p")
>>> def func(x):
... d = datetime.strptime(x[0], "%m/%d %I:%M %p")
... delta = d - b_d if d > b_d else timedelta.max
... return delta
...
>>> lis2 = [('10/30 04:30 PM', '+1.5 -111', '-1.5 +101'), ('10/30 04:24 PM', '+1.5 -110', '-1.5 +100'), ('10/30 04:21 PM', '+1.5 -111', '-1.5 +101'), ('10/30 04:15 PM', '+1.5 -112', '-1.5 +102'), ('10/30 04:14 PM', '+1.5 -110', '-1.5 +100'), ('10/30 03:57 PM', '+1.5 -111', '-1.5 +101'), ('10/30 03:40 PM', '+1.5 -110', '-1.5 +100'), ('10/30 03:31 PM', '+1.5 -111', '-1.5 +101'), ('10/30 03:30 PM', '+1.5 -109', '-1.5 -101'), ('10/30 03:25 PM', '+1.5 -107', '-1.5 -103'), ('10/30 03:24 PM', '+1.5 -110', '-1.5 +100'), ('10/30 03:23 PM', '+1.5 -108', '-1.5 -102'), ('10/30 03:22 PM', '+1.5 -106', '-1.5 -104'), ('10/30 02:14 PM', '+1.5 -104', '-1.5 -106'), ('10/30 01:41 PM', '+1.5 -105', '-1.5 -105'), ('10/30 01:37 PM', '+1.5 -107', '-1.5 -103'), ('10/30 01:36 PM', '+1.5 -105', '-1.5 -105'), ('10/30 01:06 PM', '+1.5 -103', '-1.5 -107'), ('10/30 12:56 PM', '+2 -111', '-2 +101'), ('10/30 12:53 PM', '+2 -110', '-2 +100'), ('10/30 12:50 PM', '+2 -113', '-2 +103'), ('10/30 12:49 PM', '+2 -112', '-2 +102'), ('10/30 12:46 PM', '+2 -113', '-2 +103'), ('10/30 12:45 PM', '+2 -110', '-2 +100'), ('10/30 12:43 PM', '+2 -108', '-2 -102'), ('10/30 12:38 PM', '+2.5 -116', '-2.5 +106'), ('10/30 12:38 PM', '+2.5 -113', '-2.5 +103'), ('10/30 12:37 PM', '+2.5 -110', '-2.5 +100'), ('10/30 10:30 AM', '+2.5 -105', '-2.5 -105'), ('10/30 10:07 AM', '+3 -113', '-3 +103'), ('10/30 09:55 AM', '+3 -112', '-3 +102'), ('10/30 09:51 AM', '+3 -110', '-3 +100'), ('10/30 09:32 AM', '+3 -109', '-3 -101'), ('10/30 06:04 AM', '+3 -110', '-3 +100'), ('10/30 03:16 AM', '+3 -107', '-3 -103'), ('10/30 03:14 AM', '+3.5 -116', '-3.5 +106'), ('10/30 01:03 AM', '+3.5 -115', '-3.5 +105'), ('10/30 12:17 AM', '+3.5 -110', '-3.5 +100'), ('10/29 08:52 PM', '+3.5 -108', '-3.5 -102'), ('10/29 01:31 PM', '+3.5 -105', '-3.5 -105'), ('10/29 06:48 AM', '+3.5 -110', '-3.5 +100'), ('10/29 06:47 AM', '+3.5 -109', '-3.5 -101'), ('10/29 05:39 AM', '+3.5 -113', '-3.5 +103'), ('10/29 03:34 AM', '+3.5 -108', '-3.5 -102'), ('10/29 12:44 AM', '+3.5 -110', '-3.5 +100'), ('10/29 12:41 AM', '+3.5 -107', '-3.5 -103'), ('10/29 12:40 AM', '+3.5 -105', '-3.5 -105'), ('10/28 12:52 PM', '+4 -105', '-4 -105')]
>>> min(lis2, key = func)
('10/29 01:31 PM', '+3.5 -105', '-3.5 -105')
Timing comparisons:
Script:
from datetime import datetime, timedelta
import sys
import time
list_date = [('10/30 04:30 PM', '+1.5 -111', '-1.5 +101'), ('10/30 04:24 PM', '+1.5 -110', '-1.5 +100'), ('10/30 04:21 PM', '+1.5 -111', '-1.5 +101'), ('10/30 04:15 PM', '+1.5 -112', '-1.5 +102'), ('10/30 04:14 PM', '+1.5 -110', '-1.5 +100'), ('10/30 03:57 PM', '+1.5 -111', '-1.5 +101'), ('10/30 03:40 PM', '+1.5 -110', '-1.5 +100'), ('10/30 03:31 PM', '+1.5 -111', '-1.5 +101'), ('10/30 03:30 PM', '+1.5 -109', '-1.5 -101'), ('10/30 03:25 PM', '+1.5 -107', '-1.5 -103'), ('10/30 03:24 PM', '+1.5 -110', '-1.5 +100'), ('10/30 03:23 PM', '+1.5 -108', '-1.5 -102'), ('10/30 03:22 PM', '+1.5 -106', '-1.5 -104'), ('10/30 02:14 PM', '+1.5 -104', '-1.5 -106'), ('10/30 01:41 PM', '+1.5 -105', '-1.5 -105'), ('10/30 01:37 PM', '+1.5 -107', '-1.5 -103'), ('10/30 01:36 PM', '+1.5 -105', '-1.5 -105'), ('10/30 01:06 PM', '+1.5 -103', '-1.5 -107'), ('10/30 12:56 PM', '+2 -111', '-2 +101'), ('10/30 12:53 PM', '+2 -110', '-2 +100'), ('10/30 12:50 PM', '+2 -113', '-2 +103'), ('10/30 12:49 PM', '+2 -112', '-2 +102'), ('10/30 12:46 PM', '+2 -113', '-2 +103'), ('10/30 12:45 PM', '+2 -110', '-2 +100'), ('10/30 12:43 PM', '+2 -108', '-2 -102'), ('10/30 12:38 PM', '+2.5 -116', '-2.5 +106'), ('10/30 12:38 PM', '+2.5 -113', '-2.5 +103'), ('10/30 12:37 PM', '+2.5 -110', '-2.5 +100'), ('10/30 10:30 AM', '+2.5 -105', '-2.5 -105'), ('10/30 10:07 AM', '+3 -113', '-3 +103'), ('10/30 09:55 AM', '+3 -112', '-3 +102'), ('10/30 09:51 AM', '+3 -110', '-3 +100'), ('10/30 09:32 AM', '+3 -109', '-3 -101'), ('10/30 06:04 AM', '+3 -110', '-3 +100'), ('10/30 03:16 AM', '+3 -107', '-3 -103'), ('10/30 03:14 AM', '+3.5 -116', '-3.5 +106'), ('10/30 01:03 AM', '+3.5 -115', '-3.5 +105'), ('10/30 12:17 AM', '+3.5 -110', '-3.5 +100'), ('10/29 08:52 PM', '+3.5 -108', '-3.5 -102'), ('10/29 01:31 PM', '+3.5 -105', '-3.5 -105'), ('10/29 06:48 AM', '+3.5 -110', '-3.5 +100'), ('10/29 06:47 AM', '+3.5 -109', '-3.5 -101'), ('10/29 05:39 AM', '+3.5 -113', '-3.5 +103'), ('10/29 03:34 AM', '+3.5 -108', '-3.5 -102'), ('10/29 12:44 AM', '+3.5 -110', '-3.5 +100'), ('10/29 12:41 AM', '+3.5 -107', '-3.5 -103'), ('10/29 12:40 AM', '+3.5 -105', '-3.5 -105'), ('10/28 12:52 PM', '+4 -105', '-4 -105')]
base_date = "10/29 06:58 AM"
def func1(list_date):
#http://stackoverflow.com/a/17249420/846892
get_datetime = lambda s: datetime.strptime(s, "%m/%d %I:%M %p")
base = get_datetime(base_date)
later = filter(lambda d: get_datetime(d[0]) > base, list_date)
return min(later, key = lambda d: get_datetime(d[0]))
def func2(list_date):
#http://stackoverflow.com/a/17249470/846892
b_d = datetime.strptime(base_date, "%m/%d %I:%M %p")
def func(x):
d = datetime.strptime(x[0], "%m/%d %I:%M %p")
delta = d - b_d if d > b_d else timedelta.max
return delta
return min(list_date, key = func)
def func3(list_date):
#http://stackoverflow.com/a/17249529/846892
fmt = '%m/%d %I:%M %p'
d = datetime.strptime(base_date, fmt)
def foo(x):
return (datetime.strptime(x[0],fmt)-d).total_seconds() > 0
return sorted(list_date, key=foo)[-1]
def func4(list_date):
#http://stackoverflow.com/a/17249441/846892
fmt = '%m/%d %I:%M %p'
base_d = datetime.strptime(base_date, fmt)
candidates = ((datetime.strptime(d, fmt), d, x, y) for d, x, y in list_date)
candidates = min((dt, d, x, y) for dt, d, x, y in candidates if dt > base_d)
return candidates[1:]
Results:
>>> from so import *
#check output irst
>>> func1(list_date)
('10/29 01:31 PM', '+3.5 -105', '-3.5 -105')
>>> func2(list_date)
('10/29 01:31 PM', '+3.5 -105', '-3.5 -105')
>>> func3(list_date)
('10/29 01:31 PM', '+3.5 -105', '-3.5 -105')
>>> func4(list_date)
('10/29 01:31 PM', '+3.5 -105', '-3.5 -105')
>>> %timeit func1(list_date)
100 loops, best of 3: 3.07 ms per loop
>>> %timeit func2(list_date)
100 loops, best of 3: 1.59 ms per loop #winner
>>> %timeit func3(list_date)
100 loops, best of 3: 1.91 ms per loop
>>> %timeit func4(list_date)
1000 loops, best of 3: 2.02 ms per loop
#increase the input size
>>> list_date = list_date *10**3
>>> len(list_date)
48000
>>> %timeit func1(list_date)
1 loops, best of 3: 3.6 s per loop
>>> %timeit func2(list_date) #winner
1 loops, best of 3: 1.99 s per loop
>>> %timeit func3(list_date)
1 loops, best of 3: 2.09 s per loop
>>> %timeit func4(list_date)
1 loops, best of 3: 2.02 s per loop
#increase the input size again
>>> list_date = list_date *10
>>> len(list_date)
480000
>>> %timeit func1(list_date)
1 loops, best of 3: 36.4 s per loop
>>> %timeit func2(list_date) #winner
1 loops, best of 3: 20.2 s per loop
>>> %timeit func3(list_date)
1 loops, best of 3: 22.8 s per loop
>>> %timeit func4(list_date)
1 loops, best of 3: 22.7 s per loop
decorate, filter, find the closest date, undecorate
>>> base_date = "10/29 06:58 AM"
>>> list_date = [
... ('10/30 02:18 PM', '-103', '-107'),
... ('10/30 02:17 PM', '+100', '-110'),
... ('10/29 02:15 AM', '-101', '-109')
... ]
>>> import datetime
>>> fmt = '%m/%d %H:%M %p'
>>> base_d = datetime.datetime.strptime(base_date, fmt)
>>> candidates = ((datetime.datetime.strptime(d, fmt), d, x, y) for d, x, y in list_date)
>>> candidates = min((dt, d, x, y) for dt, d, x, y in candidates if dt > base_d)
>>> print candidates[1:]
('10/30 02:17 PM', '+100', '-110')
You can consider putting the dates list into a Pandas index and then use 'truncate' or 'get_loc' function.
import pandas as pd
##Initial inputs
list_date = [('10/30 02:18 PM', '-103', '-107'),('10/29 02:15 AM', '-101', '-109') , ('10/30 02:17 PM', '+100', '-110'), \
] # reordered to show the method is input order insensitive
base_date = "10/29 06:58 AM"
##Make a data frame with data
df=pd.DataFrame(list_date)
df.columns=['date','val1','val2']
dateIndex=pd.to_datetime(df['date'], format='%m/%d %I:%M %p')
df=df.set_index(dateIndex)
df=df.sort_index(ascending=False) #earliest comes on top
##Find the result
base_dateObj=pd.to_datetime(base_date, format='%m/%d %I:%M %p')
result=df.truncate(after=base_dateObj).iloc[-1] #take the bottom value, or the 1st after the base date
(result['date'],result['val1'], result['val2']) # result is ('10/30 02:17 PM', '+100', '-110')
Reference: this link
Linear search?
import sys
import time
base_date = "10/29 06:58 AM"
def str_to_my_time(my_str):
return time.mktime(time.strptime(my_str, "%m/%d %I:%M %p"))
# assume year 1900...
base_dt = str_to_my_time(base_date)
list_date = [('10/30 02:18 PM', '-103', '-107'),
('10/30 02:17 PM', '+100', '-110'),
('10/29 02:15 AM', '-101', '-109')]
best_delta = sys.maxint
best_match = None
for t in list_date:
the_dt = str_to_my_time(t[0])
delta_sec = the_dt - base_dt
if (delta_sec >= 0) and (delta_sec < best_delta):
best_delta = delta_sec
best_match = t
print best_match, best_delta
Producing:
('10/30 02:17 PM', '+100', '-110') 112740.0
import time
import sys
#The Function
def to_sec(date_string):
return time.mktime(time.strptime(date_string, '%m/%d %I:%M %p'))
#The Test
base_date = "10/29 06:58 AM"
base_date_sec = to_sec(base_date)
result = None
difference = sys.maxint
list_date = [
('10/30 02:18 PM', '-103', '-107'),
('10/30 02:17 PM', '+100', '-110'),
('10/29 02:15 AM', '-101', '-109') ]
for date_str in list_date:
diff_sec = to_sec(date_str[0])-base_date_sec
if diff_sec >= 0 and diff_sec < difference:
result = date_str
difference = diff_sec
print result
import datetime
fmt = '%m/%d %H:%M %p'
d = datetime.datetime.strptime(base_date, fmt)
def foo(x):
return (datetime.datetime.strptime(x[0],fmt)-d).total_seconds() > 0
sorted(list_date, key=foo)[-1]
I was looking up this problem and found some answers, most of which check all elements.
I have my dates sorted (and assume most people do), so if you do as well, use numpy:
import numpy as np
// dates is a numpy array of np.datetime64 objects
dates = np.array([date1, date2, date3, ...], dtype=np.datetime64)
timestamp = np.datetime64('Your date')
np.searchsorted(dates, timestamp)
searchsorted uses binary search, which uses the fact the dates are sorted, and is thus very efficient.
If you use pandas, this is possible:
dates = df.index # df is a DatetimeIndex-ed dataframe
timestamp = pd.to_datetime('your date here', format='its format')
np.searchsorted(dates, timestamp)
The function returns the index of the closest date (if the searched date is included in dates, its index is returned [if that isn't wanted, use side='right' as an argument into the function]), so to get the date do this:
dates[np.searchsorted(dates, timestamp)]