parse multiple dates using dateutil - python

I am trying to parse multiple dates from a string in Python with the help of this code,
from dateutil.parser import _timelex, parser
a = "Approve my leave from first half of 12/10/2012 to second half of 20/10/2012 "
p = parser()
info = p.info
def timetoken(token):
try:
float(token)
return True
except ValueError:
pass
return any(f(token) for f in (info.jump,info.weekday,info.month,info.hms,info.ampm,info.pertain,info.utczone,info.tzoffset))
def timesplit(input_string):
batch = []
for token in _timelex(input_string):
if timetoken(token):
if info.jump(token):
continue
batch.append(token)
else:
if batch:
yield " ".join(batch)
batch = []
if batch:
yield " ".join(batch)
for item in timesplit(a):
print "Found:", item
print "Parsed:", p.parse(item)
and the codes is taking second half from the string as second date and giving me this error,
raise ValueError, "unknown string format"
ValueError: unknown string format
when i change 'second half' to 'third half' or 'forth half' then it is working all fine.
Can any one help me to parse this string ?

Your parser couldn't handle the "second" found by timesplit,if you set the fuzzy param to be True, it doesn't break but nor does it produce anything meaningful.
from cStringIO import StringIO
for item in timesplit(StringIO(a)):
print "Found:", item
print "Parsed:", p.parse(StringIO(item),fuzzy=True)
out:
Found: 12 10 2012
Parsed: 2012-12-10 00:00:00
Found: second
Parsed: 2013-01-11 00:00:00
Found: 20 10 2012
Parsed: 2012-10-20 00:00:00
You have to fix the timesplitting or handle the errors:
opt1:
lose the info.hms from timetoken
opt2:
from cStringIO import StringIO
for item in timesplit(StringIO(a)):
print "Found:", item
try:
print "Parsed:", p.parse(StringIO(item))
except ValueError:
print 'Not Parsed!'
out:
Found: 12 10 2012
Parsed: 2012-12-10 00:00:00
Found: second
Not Parsed!
Parsed: Found: 20 10 2012
Parsed: 2012-10-20 00:00:00

If you need only dates, could extract it with regex and works with dates.
a = "Approve my leave from first half of 12/10/2012 to second half of 20/10/2012 "
import re
pattern = re.compile('\d{2}/\d{2}/\d{4}')
pattern.findall(a)
['12/10/2012', '20/10/2012']

Related

Parsing long form dates from string

I am aware that there are other solutions to similar problems on stack overflow but they don't work in my particular situation.
I have some strings -- here are some examples of them.
string_with_dates = "random non-date text, 22 May 1945 and 11 June 2004"
string2 = "random non-date text, 01/01/1999 & 11 June 2004"
string3 = "random non-date text, 01/01/1990, June 23 2010"
string4 = "01/2/2010 and 25th of July 2020"
string5 = "random non-date text, 01/02/1990"
string6 = "random non-date text, 01/02/2010 June 10 2010"
I need a parser that can determine how many date-like objects are in the string and then parse them into actual dates into a list. I can't find any solutions out there. Here is desired output:
['05/22/1945','06/11/2004']
Or as actual datetiem objects. Any ideas?
I have tried the solutions listed here but they don't work. How to parse multiple dates from a block of text in Python (or another language)
Here is what happens when I try the solutions suggested in that link:
import itertools
from dateutil import parser
jumpwords = set(parser.parserinfo.JUMP)
keywords = set(kw.lower() for kw in itertools.chain(
parser.parserinfo.UTCZONE,
parser.parserinfo.PERTAIN,
(x for s in parser.parserinfo.WEEKDAYS for x in s),
(x for s in parser.parserinfo.MONTHS for x in s),
(x for s in parser.parserinfo.HMS for x in s),
(x for s in parser.parserinfo.AMPM for x in s),
))
def parse_multiple(s):
def is_valid_kw(s):
try: # is it a number?
float(s)
return True
except ValueError:
return s.lower() in keywords
def _split(s):
kw_found = False
tokens = parser._timelex.split(s)
for i in xrange(len(tokens)):
if tokens[i] in jumpwords:
continue
if not kw_found and is_valid_kw(tokens[i]):
kw_found = True
start = i
elif kw_found and not is_valid_kw(tokens[i]):
kw_found = False
yield "".join(tokens[start:i])
# handle date at end of input str
if kw_found:
yield "".join(tokens[start:])
return [parser.parse(x) for x in _split(s)]
parse_multiple(string_with_dates)
Output:
ParserError: Unknown string format: 22 May 1945 and 11 June 2004
Another method:
from dateutil.parser import _timelex, parser
a = "I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928"
p = parser()
info = p.info
def timetoken(token):
try:
float(token)
return True
except ValueError:
pass
return any(f(token) for f in (info.jump,info.weekday,info.month,info.hms,info.ampm,info.pertain,info.utczone,info.tzoffset))
def timesplit(input_string):
batch = []
for token in _timelex(input_string):
if timetoken(token):
if info.jump(token):
continue
batch.append(token)
else:
if batch:
yield " ".join(batch)
batch = []
if batch:
yield " ".join(batch)
for item in timesplit(string_with_dates):
print "Found:", (item)
print "Parsed:", p.parse(item)
Output:
ParserError: Unknown string format: 22 May 1945 11 June 2004
Any ideas?
Okay sorry to anyone who spent time on this -- but I was able to answer my own question. Leaving this up in case anyone else has the same issue.
This package was able to work perfectly: https://pypi.org/project/datefinder/
import datefinder
def DatesToList(x):
dates = datefinder.find_dates(x)
lists = []
for date in dates:
lists.append(date)
return (lists)
dates = DateToList(string_with_dates)
Output:
[datetime.datetime(1945, 5, 22, 0, 0), datetime.datetime(2004, 6, 11, 0, 0)]

How can I adjust 'the time' in python with module Re

this is a funny question.
I try to find out the right time in some phrases.
I use try-except module and re module
but there is something wrong in my code that can't deal with some tough phrase
As is depicted belong, I input the rediculous time 1997-25-52 or 1996-42-120
it still can output an answer.
def regular_time(time):
"""
部分电影日期带有国家, 例如:'1994-09-10(加拿大)'
正则提取日期
"""
import re
pattern = '^(([1-2]\d{3})-(0[1-9]|1[0-2])-(0[1-9]|[1-2][0-9]|3[0-1]))'
try:
matches = re.match(pattern, time, flags=0).group()
return matches
except Exception as e:
try:
pattern = '^(([1-2]\d{3})-(0[1-9]|1[0-2]))'
matches = re.match(pattern, time, flags=0).group()+'-01'
return matches
except:
try:
pattern = '^(([1-2]\d{3}))'
matches = re.match(pattern, time, flags=0).group() + '-01-01'
return matches
except:
print('errors')
time='1996-12-58'
regular_time(time)
How can I deal with this problem? Many thanks if you could do me a favor
Question: Default date from invalid datestring
Using datetime handles also leap years!
datetime.datetime.strptime
datetime.date.strftime
For example:
import re
from datetime import datetime
def regular_time(time):
_t = time.split('-')
# allways 3 itmes
while len(_t) < 3:
_t.append('01')
# year month and day ranges
ymd = [(range(1900, 2099), '1900'),
(range(1, 13), '01'),
(range(1, 32), '01')
]
# validate ranges
for n in range(3):
if not int(_t[n]) in ymd[n][0]:
_t[n] = ymd[n][1]
_time = '-'.join(_t)
try:
date = datetime.strptime(_time, '%Y-%m-%d')
print('VALID:{} => {}'
.format(time, date.strftime('%Y-%m-%d')))
except ValueError as e:
if "day is out of range for month" in e:
print('{} for {}, change to 01'.format(e, time))
_t[2] = '01'
regular_time('-'.join(_t))
else:
print('INVALID[{}]:{}'.format(_time, e))
for time in ['1996', '1996-18', '2019-09-31', '2019-01-31',
'1996-12-58', '1997-25-52', '1996-42-120']:
regular_time(time)
Output:
VALID:1996 => 1996-01-01
VALID:1996-18 => 1996-01-01
day is out of range for month for 2019-09-31, change to 01
VALID:2019-09-01 => 2019-09-01
VALID:2019-01-31 => 2019-01-31
VALID:1996-12-58 => 1996-12-01
VALID:1997-25-52 => 1997-01-01
VALID:1996-42-120 => 1996-01-01
Tested with Python 3.6
Your test case returns "1996-12-01", which is that it hits second-level "try-except", since it matches pattern of correct year and month (first failed cause day is unrealistic), then it just simplifies it to the first day of the month by adding "-01".
If you want to keep all parts of the date realistic - don't overwrite original "pattern". But fail it in first step.

Python: How to add/subtract a number only to numeric characters in a string?

Say for example, I have the following strings and an input 4.0, which represents seconds:
John Time Made 11:05:20 in 2010
5.001 Kelly #1
6.005 Josh #8
And would like the following result:
John Time Made 11:05:24 in 2010 #Input 4.0 is added to the seconds of 11:05:20
1.001 Kelly #1 #4.0 is subtracted from the first number 5.001 = 1.001
2.005 Josh #8 #4.0 is subtracted from the first number 5.001 = 2.005
How can I recognize the hours:minutes:seconds in the first line, and #.### in the rest to add/subtract the input number?
Thank you in advance and will accept/upvote answer
This solution should work if your complete data has the same format as this particular sample you provided. You should have the data in the input.txt file.
val_to_add = 4
with open('input.txt') as fin:
# processing first line
first_line = fin.readline().strip()
splitted = first_line.split(' ')
# get hour, minute, second corresponding to time (11:05:20)
time_values = splitted[3].split(':')
# seconds is the last element
seconds = int(time_values[-1])
# add the value
new_seconds = seconds + val_to_add
# doing simple math to avoid having values >= 60 for minute and second
# this part probably can be solved with datetime or some other lib, but it's not that complex, so I did it in couple of lines
seconds = new_seconds % 60 # if we get > 59 seconds we only put the modulo as second and the other part goes to minute
new_minutes = int(time_values[1]) + new_seconds // 60 # if we have more than 60 s then here we'll add minutes produced by adding to the seconds
minutes = new_minutes % 60 # similarly as for seconds
hours = int(time_values[0]) + new_minutes // 60
# here I convert again to string so we could easily apply join operation (operates only on strings) and additionaly add zero in front for 1 digit numbers
time_values[0] = str(hours).rjust(2, '0')
time_values[1] = str(minutes).rjust(2, '0')
time_values[2] = str(seconds).rjust(2, '0')
new_time_val = ':'.join(time_values)# join the values to follow the HH:MM:SS format
splitted[3] = new_time_val# replace the old time with the new one (with the value added)
first_line_modified = ' '.join(splitted)# just join the modified list
print(first_line_modified)
# processing othe lines
for line in fin:
# here we only get the first (0th) value and subtract the val_to_add and round to 3 digits the response (to avoid too many decimal places)
stripped = line.strip()
splitted = stripped.split(' ')
splitted[0] = str(round(float(splitted[0]) - val_to_add, 3))
modified_line = ' '.join(splitted)
print(modified_line)
Although regex was discouraged in the comments, regex can be used to parse the time objects into datetime.time objects, perform the necessary calculations on them, then print them in the required format:
# datetime module for time calculations
import datetime
# regex module
import re
# seconds to add to time
myinp = 4
# List of data strings
# data = 'John Time Made 11:05:20 in 2010', '5.001 Kelly', '6.005 Josh'
with open('data.txt') as f:
data = f.readlines()
new_data = []
#iterate through the list of data strings
for time in data:
try:
# First check for 'HH:MM:SS' time format in data string
# regex taken from this question: http://stackoverflow.com/questions/8318236/regex-pattern-for-hhmmss-time-string
match = re.findall("([0-1]?\d|2[0-3]):([0-5]?\d):([0-5]?\d)", time)
# this regex returns a list of tuples as strings "[('HH', 'MM', 'SS')]",
# which we join back together with ':' (colon) separators
t = ':'.join(match[0])
# create a Datetime object from indexing the first matched time in the list,
# taken from this answer http://stackoverflow.com/questions/100210/what-is-the-standard-way-to-add-n-seconds-to-datetime-time-in-python
# May create an IndexError exception, which we catch in the `except` clause below
orig = datetime.datetime(100,1,1,int(match[0][0]), int(match[0][1]), int(match[0][2]))
# Add the number of seconds to the Datetime object,
# taken from this answer: http://stackoverflow.com/questions/656297/python-time-timedelta-equivalent
newtime = (orig + datetime.timedelta(0, myinp)).time()
# replace the time in the original data string with the newtime and print
new_data.append(time.replace(t, str(newtime)))
# catch an IndexError Exception, which we look for float-formatted seconds only
except IndexError:
# look for float-formatted seconds (s.xxx)
# taken from this answer: http://stackoverflow.com/questions/4703390/how-to-extract-a-floating-number-from-a-string
match = re.findall("\d+\.\d+", time)
# create a Datetime object from indexing the first matched time in the list,
# specifying only seconds, and microseconds, which we convert to milliseconds (micro*1000)
orig = datetime.datetime(100,1,1,second=int(match[0].split('.')[0]),microsecond=int(match[0].split('.')[1])*1000)
# Subtract the seconds from the Datetime object, similiar to the time addtion in the `try` clause above
newtime = orig - datetime.timedelta(0, myinp)
# format the newtime as `seconds` concatenated with the milliseconds converted from microseconds
newtime_fmt = newtime.second + newtime.microsecond/1000000.
# Get the seconds value (first value(index 0)) from splitting the original string at the `space` between the `seconds` and `name` strings
t = time.split(' ')[0]
# replace the time in the original data string with the newtime and print
new_data.append(time.replace(t , str(newtime_fmt)))
with open('new_data.txt', 'w') as nf:
for newline in new_data:
nf.write(newline)
new_data.txt file contents should read as:
John Time Made 11:05:24 in 2010
1.001 Kelly
2.005 Josh

Searching for unique dates in a log file (python)

I'm writing a script that parses log files and matches certain strings like "INFO", "WARN", "SEVERE", etc.
I can do this with out much trouble using the code below.
from sys import argv
from collections import OrderedDict
# Find and catalog each log line that matches these strings
match_strings = ["INFO", "WARN", "SEVERE"]
if len(argv) > 1:
files = argv[1:]
else:
print "ERROR: You must provide at least one log file to be processed."
print "Example:"
print "%s my.log" % argv[0]
exit(2)
for filename in files:
with open(filename) as f:
data = f.read().splitlines()
# Create data structure to handle results
matches = OrderedDict()
for string in match_strings:
matches[string] = []
for i, s in enumerate(data, 1):
for string in match_strings:
if string in s:
matches[string].append('Line %03d: %s' % (i, s,))
for string in matches:
print "\"%s\": %d" % (string, len(matches[string]))
Log files look like:
2014-05-26T15:06:14.597+0000 INFO...
2014-05-26T15:06:14.597+0000 WARN...
2014-05-27T15:06:14.597+0000 INFO...
2014-05-28T15:06:14.597+0000 SEVERE...
2014-05-29T15:06:14.597+0000 SEVERE...
Current output looks like:
"INFO": 2
"WARN": 1
"SEVERE": 2
However, what I'd rather do is have the script collate and print formatted output by date. So, rather than print a simple list (above) we could get something like the following using the sample from above:
Category 2014-05-26 2014-05-27 2014-05-28 2014-05-29
"INFO": 1 1 0 0
"WARN": 1 0 0 0
"SEVERE": 0 0 1 1
Are there any thoughts / suggestions how to accomplish this?
One way of doing this would be to make a class that has variables info, warn, and severe in it. Then make a dictionary where each element is this class with the key being the date. Then when you are parsing your log file you can just find the date and use that as the index for your dictionary and increment the info, warn, and severe as needed.

Removing first character from string

I am working on a CodeEval challenge and have a solution to a problem which takes a list of numbers as an input and then outputs the sum of the digits of each line. Here is my code to make certain you understand what I mean:
import sys
test_cases = open(sys.argv[1], 'r')
for test in test_cases:
if test:
num = int(test)
total =0
while num != 0:
total += num % 10
num /= 10
print total
test_cases.close()
I am attempting to rewrite this where it takes the number as a string, slices each 0-index, and then adds those together (curious to see what the time and memory differences are - totally new to coding and trying to find multiple ways to do things as well)
However, I am stuck on getting this to execute and have the following:
import sys
test_cases = open(sys.argv[1], 'r')
for test in test_cases:
sums = 0
while test:
sums = sums + int(str(test)[0])
test = test[1:]
print sums
test_cases.close()
I am receiving a "ValueError: invalid literal for int() with base 10: ''"
The sample input is a text file which looks like this:
3011
6890
8778
1844
42
8849
3847
8985
5048
7350
8121
5421
7026
4246
4439
6993
4761
3658
6049
1177
Thanks for any help you can offer!
Your issue is the newlines (eg. /n or /r/n) at the end of each line.
Change this line:
for test in test_cases:
into this to split out the newlines:
for test in test_cases.read().splitlines():
try this code:
tot = 0
with open(sys.argv[1], 'r') as f:
for line in f:
try:
tot += int(line)
except ValueError:
print "Not a number"
print tot
using the context manager (with...) the file is automatically closed.
casting to int filter any empty or not valid value
you can substitute print with any other statement optimal for you (raise or pass depending on your goals)

Categories