hi I've to preprocess a column that has comma separated values and i can't apply .split(',\s*') because there are places where commas and spaces shouldn't be separate so therefore i'm looking for a regex pattern.
column:
0 12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun)
1 11 AM to 11 PM
2 11:30 AM to 4:30 PM, 6:30 PM to 11 PM
3 12 Noon to 2 AM
4 12noon to 11pm (Mon, Tue, Wed, Thu, Sun), 12no...
...
100 11 AM to 11 PM
101 10 AM to 10 PM (Mon-Thu), 8 AM to 10:30 PM (Fr...
102 12 Noon to 11 PM
103 8am to 12:30AM (Mon-Sun)
104 11:30 AM to 3 PM, 7 PM to 12 Midnight
what i've tried is
import re
pattern = '([\w+\:*\s*\w*(w{2})*]*\s*to\s*[\w+\:*\s*\w*(w{2})*]*\s*[\([a-zA-Z]*\-*\,*\s*
[a-zA-Z]*\s*\)]*)'
timing = data['timings'].str.lower().str.split(pattern).dropna().to_numpy()
output:
array([list(['12noon to 3:30pm,', ' 6:30pm to 11:30pm (mon-sun)', '']),
list(['11 am to 11 pm']),
list(['11:30 am to 4:30 pm, 6:30 pm to 11 pm']),
list(['12 noon to 2 am']),
list(['12noon to 11pm (mon, tue, wed, thu, sun),', ' 12noon to 12midnight (fri-sat)', '']),
list(['12noon to 3:30pm, 4pm to 6:30pm, 7pm to 11:30pm (mon, tue, wed, thu, sun), 12noon to 3:30pm, 4pm to 6:30pm,', ' 7pm to 12midnight (fri-sat)', '']),
list(['7 am to 10 pm']), list(['12 noon to 12 midnight']),
list(['12 noon to 12 midnight']),
list(['', '10 am to 1 am (mon-thu)', ',', ' 10 am to 1:30 am (fri-sun)', '']),
list(['12 noon to 3:30 pm, 7 pm to 10:30 pm']),
list(['12 noon to 3:30 pm, 6:30 pm to 11:30 pm']),
list(['11:30 am to 1 am']),
list(['', '12noon to 12midnight (mon-sun)', '']),
list(['12 noon to 4:30 pm, 6:30 pm to 11:30 pm']),
list(['11 am to 11 pm']), list(['12 noon to 10:30 pm']),
list(['11:30 am to 1 am']), list(['12 noon to 12 midnight']),
list(['12 noon to 11 pm']),
list(['', '12:30 pm to 10 pm (tue-sun)', ', mon closed']),
list(['11:30 am to 3 pm, 7 pm to 11 pm']),
list(['11am to 11:30pm (mon, tue, wed, thu, sun),', ' 11am to 12midnight (fri-sat)', '']),
list(['10 am to 5 am']),
list(['12 noon to 12 midnight (mon-thu, sun),', ' 12 noon to 1 am (fri-sat)', '']),
list(['', '12noon to 11pm (mon-thu)', ',', '12noon to 11:30pm (fri-sun)', '']),
list(['', '12 noon to 11:30 pm (mon-wed)', ',', ' 12 noon to 1 am (fri-sat)', ',', ' 12 noon to 12 midnight (sun)', ', thu closed']),
list(['12 noon to 4 pm, 6:30 pm to 11:30 pm']),
list(['10 am to 1 am']), list(['4:30 pm to 5:30 am']),
list(['11 am to 12 midnight']),
list(['12noon to 4pm,', ' 7pm to 12midnight (mon-sun)', '']),
list(['11 am to 12 midnight']),
list(['', '6am to 12midnight (mon-sun)', '']),
list(['12 noon to 11 pm']),
list(['12:30 pm to 3:30 pm, 7 pm to 10:40 pm']),
list(['12 noon to 4 pm, 7 pm to 11 pm']),
list(['12noon to 11pm (mon, tue, wed, thu, sun),', ' 12noon to 12midnight (fri-sat)', '']),
list(['12 noon to 10:30 pm']),
list(['', '12noon to 11pm (mon-sun)', '']),
list(['10 am to 10 pm']), list(['10 am to 10 pm']),
list(['7 am to 1 am']), list(['12 noon to 11:30 pm']),
list(['', '12noon to 11:30pm (mon-sun)', '']),
list(['12 noon to 11:30 pm']), list(['12 noon to 11 pm']),
list(['6 am to 10:30 pm']),
list(['11:30 am to 3:30 pm, 6:45 pm to 11:30 pm']),
list(['11:55 am to 4 pm, 7 pm to 11:15 pm']),
list(['12 noon to 11 pm']), list(['11 am to 11 pm']),
list(['12noon to 4:30pm, 6:30pm to 11:30pm (mon, tue, wed, fri, sat), closed (thu),', '12noon to 12midnight (sun)', '']),
list(['12noon to 12midnight (mon, tue, wed, thu, sun),', ' 12noon to 1am (fri-sat)', '']),
list(['8 am to 11:30 pm']),
list(['6:30am to 10:30am, 12:30pm to 3pm,', ' 7pm to 11pm (mon)', ',6:30am to 10:30am, 12:30pm to 3pm,', ' 7:30pm to 11pm (tue-sat)', ',6:30am to 10:30am, 12:30pm to 3:30pm,', ' 7pm to 11pm (sun)', '']),
list(['12 noon to 3 pm, 7 pm to 11:30 pm']),
list(['11:30 am to 1 am']), list(['9 am to 10 pm']),
list(['12 noon to 12 midnight (mon-thu, sun),', ' 12 noon to 1 am (fri-sat)', '']),
list(['', '5pm to 12midnight (mon-sun)', '']),
list(['11 am to 11:30 pm']),
list(['', '11:30am to 11pm (mon-sun)', '']),
list(['12 noon to 10:30 pm']), list(['1 pm to 11 pm']),
list(['11:30 am to 12 midnight']),
list(['12 noon to 12 midnight']),
list(['', '12noon to 12midnight (mon-sun)', '']),
list(['', '12noon to 11pm (mon-sun)', '']),
list(['12 noon to 3 pm, 7 pm to 11 pm']),
list(['12 noon to 3 pm, 7 pm to 11 pm']),
list(['', '11 am to 8 pm (mon-sat)', ', sun closed']),
list(['4 am to 12 midnight']), list(['9 am to 1 am']),
list(['10:30 am to 11 pm']), list(['7 am to 11 pm']),
list(['7 am to 10:30 am, 12:30 pm to 3:30 pm, 7 pm to 11 pm']),
list(['12 noon to 3:30 pm, 7 pm to 11:30 pm']),
list(['12 noon to 3:30 pm, 7 pm to 11 pm']),
list(['12noon to 12midnight (mon, tue, wed, thu, sun),', ' 12noon to 1am (fri-sat)', '']),
list(['', '11am to 11pm (mon-sun)', '']),
list(['6 am to 11:30 pm']), list(['11:30 am to 5 am']),
list(['12:30 pm to 3:30 pm, 7 pm to 11 pm']),
list(['', '6pm to 2am (mon-sun)', '']),......)
but what i want is something like this:
[['6pm to 2am (mon-sun)'], ['12 noon to 12 midnight (mon-thu, sun)'] .....] something like this
i think i've to design a better regex pattern in order to separated these values. so can anyone design a better regex pattern? Thanks in advance:).
Here's my attempt:
import re, pandas
data = pandas.read_excel('C:\\Users\\Administrator\\Desktop\\test.xls')
pattern = '(\d{1,2}(?:\:\d{1,2})? ?(?:\w{2,8}) to \d{1,2}(?:\:\d{1,2})? ?(?:\w{2,8}) ?(?:\(\w{3}(?:[ ,-]{1,3}\w{3}){0,6}\))?)'
re.findall(pattern, data["myData"].str.cat(sep=", "))
With the call to re.findall() my output was:
['12noon to 3:30pm', '6:30pm to 11:30pm (Mon-Sun)', '11 AM to 11 PM', '11:30 AM to 4:30 PM', '6:30 PM to 11 PM', '12 Noon to 2 AM', '11 AM to 11 PM', '10 AM to 10 PM (Mon-Thu)', '8 AM to 10:30 PM (Fri,Sat)', '12 Noon to 11 PM', '8am to 12:30AM (Mon-Sun)', '11:30 AM to 3 PM', '7 PM to 12 Midnight']
I have the following list which I need to sort in ascending order:
tlist = ['10:10 AM - 10:20 AM', '10:20 AM - 10:30 AM', '10:30 AM - 10:40 AM', '10:40 AM - 10:50 AM', '10:50 AM - 11:00 AM', '11:00 AM - 11:10 AM', '11:10 AM - 11:20 AM', '11:20 AM - 11:30 AM', '11:30 AM - 11:40 AM', '11:40 AM - 11:50 AM', '11:50 AM - 12:00 PM', '12:00 PM - 12:10 PM', '12:10 PM - 12:20 PM', '12:20 PM - 12:30 PM', '12:30 PM - 12:40 PM', '12:40 PM - 12:50 PM', '12:50 PM - 1:00 PM', '1:00 PM - 1:10 PM', '1:10 PM - 1:20 PM', '1:20 PM - 1:30 PM', '1:30 PM - 1:40 PM', '1:40 PM - 1:50 PM', '1:50 PM - 2:00 PM', '2:00 PM - 2:10 PM', '2:10 PM - 2:20 PM', '2:20 PM - 2:30 PM', '2:30 PM - 2:40 PM', '2:40 PM - 2:50 PM', '2:50 PM - 3:00 PM', '3:00 PM - 3:10 PM', '3:10 PM - 3:20 PM', '3:20 PM - 3:30 PM', '3:30 PM - 3:40 PM', '3:40 PM - 3:50 PM', '3:50 PM - 4:00 PM', '4:00 PM - 4:10 PM', '4:10 PM - 4:20 PM', '4:20 PM - 4:30 PM', '4:30 PM - 4:40 PM', '4:40 PM - 4:50 PM', '4:50 PM - 5:00 PM', '5:00 PM - 5:10 PM', '5:10 PM - 5:20 PM', '5:20 PM - 5:30 PM', '5:30 PM - 5:40 PM', '5:40 PM - 5:50 PM', '5:50 PM - 6:00 PM', '6:00 PM - 6:10 PM', '6:10 PM - 6:20 PM', '6:20 PM - 6:30 PM', '6:30 PM - 6:40 PM', '6:40 PM - 6:50 PM', '6:50 PM - 7:00 PM', '7:00 PM - 7:10 PM', '7:10 AM - 7:20 AM', '7:10 PM - 7:20 PM', '7:20 AM - 7:30 AM', '7:20 PM - 7:30 PM', '7:30 AM - 7:40 AM', '7:30 PM - 7:40 PM', '7:40 AM - 7:50 AM', '7:40 PM - 7:50 PM', '7:50 AM - 8:00 AM', '7:50 PM - 8:00 PM', '8:00 AM - 8:10 AM', '8:00 PM - 8:10 PM', '8:10 AM - 8:20 AM', '8:10 PM - 8:20 PM', '8:20 AM - 8:30 AM', '8:20 PM - 8:30 PM', '8:30 AM - 8:40 AM', '8:30 PM - 8:40 PM', '8:40 AM - 8:50 AM', '8:40 PM - 8:50 PM', '8:50 AM - 9:00 AM', '8:50 PM - 9:00 PM', '9:00 AM - 9:10 AM', '9:00 PM - 9:10 PM', '9:10 AM - 9:20 AM', '9:10 PM - 9:20 PM', '9:20 AM - 9:30 AM', '9:20 PM - 9:30 PM', '9:30 AM - 9:40 AM', '9:40 AM - 9:50 AM', '9:50 AM - 10:00 AM']
While attempting to do that, I had written an iterator to list each time string as a time object, but failing in conversion.
import time
tlist = ['10:10 AM - 10:20 AM', '10:20 AM - 10:30 AM', '10:30 AM - 10:40 AM', '10:40 AM - 10:50 AM', '10:50 AM - 11:00 AM', '11:00 AM - 11:10 AM', '11:10 AM - 11:20 AM', '11:20 AM - 11:30 AM', '11:30 AM - 11:40 AM', '11:40 AM - 11:50 AM', '11:50 AM - 12:00 PM', '12:00 PM - 12:10 PM', '12:10 PM - 12:20 PM', '12:20 PM - 12:30 PM', '12:30 PM - 12:40 PM', '12:40 PM - 12:50 PM', '12:50 PM - 1:00 PM', '1:00 PM - 1:10 PM', '1:10 PM - 1:20 PM', '1:20 PM - 1:30 PM', '1:30 PM - 1:40 PM', '1:40 PM - 1:50 PM', '1:50 PM - 2:00 PM', '2:00 PM - 2:10 PM', '2:10 PM - 2:20 PM', '2:20 PM - 2:30 PM', '2:30 PM - 2:40 PM', '2:40 PM - 2:50 PM', '2:50 PM - 3:00 PM', '3:00 PM - 3:10 PM', '3:10 PM - 3:20 PM', '3:20 PM - 3:30 PM', '3:30 PM - 3:40 PM', '3:40 PM - 3:50 PM', '3:50 PM - 4:00 PM', '4:00 PM - 4:10 PM', '4:10 PM - 4:20 PM', '4:20 PM - 4:30 PM', '4:30 PM - 4:40 PM', '4:40 PM - 4:50 PM', '4:50 PM - 5:00 PM', '5:00 PM - 5:10 PM', '5:10 PM - 5:20 PM', '5:20 PM - 5:30 PM', '5:30 PM - 5:40 PM', '5:40 PM - 5:50 PM', '5:50 PM - 6:00 PM', '6:00 PM - 6:10 PM', '6:10 PM - 6:20 PM', '6:20 PM - 6:30 PM', '6:30 PM - 6:40 PM', '6:40 PM - 6:50 PM', '6:50 PM - 7:00 PM', '7:00 PM - 7:10 PM', '7:10 AM - 7:20 AM', '7:10 PM - 7:20 PM', '7:20 AM - 7:30 AM', '7:20 PM - 7:30 PM', '7:30 AM - 7:40 AM', '7:30 PM - 7:40 PM', '7:40 AM - 7:50 AM', '7:40 PM - 7:50 PM', '7:50 AM - 8:00 AM', '7:50 PM - 8:00 PM', '8:00 AM - 8:10 AM', '8:00 PM - 8:10 PM', '8:10 AM - 8:20 AM', '8:10 PM - 8:20 PM', '8:20 AM - 8:30 AM', '8:20 PM - 8:30 PM', '8:30 AM - 8:40 AM', '8:30 PM - 8:40 PM', '8:40 AM - 8:50 AM', '8:40 PM - 8:50 PM', '8:50 AM - 9:00 AM', '8:50 PM - 9:00 PM', '9:00 AM - 9:10 AM', '9:00 PM - 9:10 PM', '9:10 AM - 9:20 AM', '9:10 PM - 9:20 PM', '9:20 AM - 9:30 AM', '9:20 PM - 9:30 PM', '9:30 AM - 9:40 AM', '9:40 AM - 9:50 AM', '9:50 AM - 10:00 AM']
for t in tlist:
f = t.split('-')[0]
print(f)
ft = time.strptime(f, "%I:%M %p")
print(f, ft)
I'm getting an error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-19-0a2d8195df22> in <module>()
4 f = t.split('-')[0]
5 print(f)
----> 6 ft = time.strptime(f, "%I:%M %p")
7 print(f, ft)
/usr/lib/python3.6/_strptime.py in _strptime_time(data_string, format)
557 """Return a time struct based on the input string and the
558 format string."""
--> 559 tt = _strptime(data_string, format)[0]
560 return time.struct_time(tt[:time._STRUCT_TM_ITEMS])
561
/usr/lib/python3.6/_strptime.py in _strptime(data_string, format)
363 if len(data_string) != found.end():
364 raise ValueError("unconverted data remains: %s" %
--> 365 data_string[found.end():])
366
367 iso_year = year = None
ValueError: unconverted data remains:
How can I fix this error? Is there an easier technique of sorting these other than the tedious method of looping over the list and transferring to an intermediary list?
By doing
f = t.split('-')[0]
ft = time.strptime(f, "%I:%M %p")
you end up with a space before and after each date string (eg '10:10 AM - 10:20 AM' becomes '10:10 AM ' and ' 10:20 AM').
This is also what the error message is saying:
ValueError: unconverted data remains:
strptime tried to apply the format %I:%M %p to f, but it got a leftover whitespace it did not know what to do with.
The solution is to either
split on ' - ': f = t.split(' - ')[0]
or
use strip (f = t.split('-')[0].strip()) (probably the better solution as it is a bit more generic)
You could also include the whitespace into the format (time.strptime(f, "%I:%M %p ")) but this will be a de-facto fix just waiting to break again in the future.
Change
f = t.split('-')[0]
To
f = t.split('-')[0].strip()
After split('-'), you will get 2 value exp: '10:10 AM ' and ' 10:20 AM'. So, It need to remove space in these value.
Using sorted
Ex:
import time
tlist = ['10:10 AM - 10:20 AM', '10:20 AM - 10:30 AM', '10:30 AM - 10:40 AM', '10:40 AM - 10:50 AM', '10:50 AM - 11:00 AM', '11:00 AM - 11:10 AM', '11:10 AM - 11:20 AM', '11:20 AM - 11:30 AM', '11:30 AM - 11:40 AM', '11:40 AM - 11:50 AM', '11:50 AM - 12:00 PM', '12:00 PM - 12:10 PM', '12:10 PM - 12:20 PM', '12:20 PM - 12:30 PM', '12:30 PM - 12:40 PM', '12:40 PM - 12:50 PM', '12:50 PM - 1:00 PM', '1:00 PM - 1:10 PM', '1:10 PM - 1:20 PM', '1:20 PM - 1:30 PM', '1:30 PM - 1:40 PM', '1:40 PM - 1:50 PM', '1:50 PM - 2:00 PM', '2:00 PM - 2:10 PM', '2:10 PM - 2:20 PM', '2:20 PM - 2:30 PM', '2:30 PM - 2:40 PM', '2:40 PM - 2:50 PM', '2:50 PM - 3:00 PM', '3:00 PM - 3:10 PM', '3:10 PM - 3:20 PM', '3:20 PM - 3:30 PM', '3:30 PM - 3:40 PM', '3:40 PM - 3:50 PM', '3:50 PM - 4:00 PM', '4:00 PM - 4:10 PM', '4:10 PM - 4:20 PM', '4:20 PM - 4:30 PM', '4:30 PM - 4:40 PM', '4:40 PM - 4:50 PM', '4:50 PM - 5:00 PM', '5:00 PM - 5:10 PM', '5:10 PM - 5:20 PM', '5:20 PM - 5:30 PM', '5:30 PM - 5:40 PM', '5:40 PM - 5:50 PM', '5:50 PM - 6:00 PM', '6:00 PM - 6:10 PM', '6:10 PM - 6:20 PM', '6:20 PM - 6:30 PM', '6:30 PM - 6:40 PM', '6:40 PM - 6:50 PM', '6:50 PM - 7:00 PM', '7:00 PM - 7:10 PM', '7:10 AM - 7:20 AM', '7:10 PM - 7:20 PM', '7:20 AM - 7:30 AM', '7:20 PM - 7:30 PM', '7:30 AM - 7:40 AM', '7:30 PM - 7:40 PM', '7:40 AM - 7:50 AM', '7:40 PM - 7:50 PM', '7:50 AM - 8:00 AM', '7:50 PM - 8:00 PM', '8:00 AM - 8:10 AM', '8:00 PM - 8:10 PM', '8:10 AM - 8:20 AM', '8:10 PM - 8:20 PM', '8:20 AM - 8:30 AM', '8:20 PM - 8:30 PM', '8:30 AM - 8:40 AM', '8:30 PM - 8:40 PM', '8:40 AM - 8:50 AM', '8:40 PM - 8:50 PM', '8:50 AM - 9:00 AM', '8:50 PM - 9:00 PM', '9:00 AM - 9:10 AM', '9:00 PM - 9:10 PM', '9:10 AM - 9:20 AM', '9:10 PM - 9:20 PM', '9:20 AM - 9:30 AM', '9:20 PM - 9:30 PM', '9:30 AM - 9:40 AM', '9:40 AM - 9:50 AM', '9:50 AM - 10:00 AM']
print(sorted(tlist, key=lambda x: time.strptime(x.split("-")[0].strip(), "%I:%M %p")))
Output:
['7:10 AM - 7:20 AM', '7:20 AM - 7:30 AM', '7:30 AM - 7:40 AM', '7:40 AM - 7:50 AM', '7:50 AM - 8:00 AM', '8:00 AM - 8:10 AM', '8:10 AM - 8:20 AM', '8:20 AM - 8:30 AM', '8:30 AM - 8:40 AM', '8:40 AM - 8:50 AM', '8:50 AM - 9:00 AM', '9:00 AM - 9:10 AM', '9:10 AM - 9:20 AM', '9:20 AM - 9:30 AM', '9:30 AM - 9:40 AM', '9:40 AM - 9:50 AM', '9:50 AM - 10:00 AM', '10:10 AM - 10:20 AM', '10:20 AM - 10:30 AM', '10:30 AM - 10:40 AM', '10:40 AM - 10:50 AM', '10:50 AM - 11:00 AM', '11:00 AM - 11:10 AM', '11:10 AM - 11:20 AM', '11:20 AM - 11:30 AM', '11:30 AM - 11:40 AM', '11:40 AM - 11:50 AM', '11:50 AM - 12:00 PM', '12:00 PM - 12:10 PM', '12:10 PM - 12:20 PM', '12:20 PM - 12:30 PM', '12:30 PM - 12:40 PM', '12:40 PM - 12:50 PM', '12:50 PM - 1:00 PM', '1:00 PM - 1:10 PM', '1:10 PM - 1:20 PM', '1:20 PM - 1:30 PM', '1:30 PM - 1:40 PM', '1:40 PM - 1:50 PM', '1:50 PM - 2:00 PM', '2:00 PM - 2:10 PM', '2:10 PM - 2:20 PM', '2:20 PM - 2:30 PM', '2:30 PM - 2:40 PM', '2:40 PM - 2:50 PM', '2:50 PM - 3:00 PM', '3:00 PM - 3:10 PM', '3:10 PM - 3:20 PM', '3:20 PM - 3:30 PM', '3:30 PM - 3:40 PM', '3:40 PM - 3:50 PM', '3:50 PM - 4:00 PM', '4:00 PM - 4:10 PM', '4:10 PM - 4:20 PM', '4:20 PM - 4:30 PM', '4:30 PM - 4:40 PM', '4:40 PM - 4:50 PM', '4:50 PM - 5:00 PM', '5:00 PM - 5:10 PM', '5:10 PM - 5:20 PM', '5:20 PM - 5:30 PM', '5:30 PM - 5:40 PM', '5:40 PM - 5:50 PM', '5:50 PM - 6:00 PM', '6:00 PM - 6:10 PM', '6:10 PM - 6:20 PM', '6:20 PM - 6:30 PM', '6:30 PM - 6:40 PM', '6:40 PM - 6:50 PM', '6:50 PM - 7:00 PM', '7:00 PM - 7:10 PM', '7:10 PM - 7:20 PM', '7:20 PM - 7:30 PM', '7:30 PM - 7:40 PM', '7:40 PM - 7:50 PM', '7:50 PM - 8:00 PM', '8:00 PM - 8:10 PM', '8:10 PM - 8:20 PM', '8:20 PM - 8:30 PM', '8:30 PM - 8:40 PM', '8:40 PM - 8:50 PM', '8:50 PM - 9:00 PM', '9:00 PM - 9:10 PM', '9:10 PM - 9:20 PM', '9:20 PM - 9:30 PM']