I have a log with entries in the following format:
1483528632 3 1 Wed Jan 4 11:17:12 2017 501040002 4
1533528768 4 2 Thu Jan 5 19:17:45 2017 534040012 3
...
How do I fetch only the timestamp component (eg. Wed Jan 4 11:17:12 2017) using regular expressions?
I have to implement the final product in python, but the requirement is to have part of an automated regression suite in bash/perl (with the final product eventually being in Python).
If the format is fixed in terms of space delimiters, you can simply split, get a slice of a date string and load it to datetime object via datetime.strptime():
In [1]: from datetime import datetime
In [2]: s = "1483528632 3 1 Wed Jan 4 11:17:12 2017 501040002 4"
In [3]: date_string = ' '.join(s.split()[3:8])
In [4]: datetime.strptime(date_string, "%a %b %d %H:%M:%S %Y")
Out[4]: datetime.datetime(2017, 1, 4, 11, 17, 12)
Grep is most often used in this scenario if you are working with syslog. But as the post is also tagged with Python. This example uses regular expressions with re:
import re
Define the pattern to match:
pat = "\w{3}\s\w{3}\s+\w\s\w{2}:\w{2}:\w{2}\s\w{4}"
Then use re.findall to return all non-overlapping matches of pattern in txt:
re.findall(pat,txt)
Output:
['Wed Jan 4 11:17:12 2017', 'Thu Jan 5 19:17:45 2017']
If you want to then use datetime:
import datetime
dates = re.findall(pat,txt)
datetime.datetime.strptime(dates[0], "%a %b %d %H:%M:%S %Y")
Output:
datetime.datetime(2017, 1, 4, 11, 17, 12)
You can then utilise these datetime objects:
dateObject = datetime.datetime.strptime(dates[0], "%a %b %d %H:%M:%S %Y").date()
timeObject = datetime.datetime.strptime(dates[0], "%a %b %d %H:%M:%S %Y").time()
print('The date is {} and time is {}'.format(dateObject,timeObject))
Output:
The date is 2017-01-04 and time is 11:17:12
The regex to match the timestamp is:
'[a-zA-Z]{3} +[a-zA-Z]{3} +\d{1,2} +\d{2}:\d{2}:\d{2} +\d{4}'.
With grep that can be used like this (if your log file was called log.txt):
$ grep -oE '[a-zA-Z]{3} +[a-zA-Z]{3} +\d{1,2} +\d{2}:\d{2}:\d{2} +\d{4}' log.txt
# Wed Jan 4 11:17:12 2017
# Thu Jan 5 19:17:45 2017
In python you can use that like so:
import re
log_entry = "1483528632 3 1 Wed Jan 4 11:17:12 2017 501040002 4"
pattern = '[a-zA-Z]{3} +[a-zA-Z]{3} +\d{1,2} +\d{2}:\d{2}:\d{2} +\d{4}'
compiled = re.compile(pattern)
match = compiled.search(log_entry)
match.group(0)
# 'Wed Jan 4 11:17:12 2017'
You can use this to get an actual datetime object from the string (expanding on above code):
from datetime import datetime
import re
log_entry = "1483528632 3 1 Wed Jan 4 11:17:12 2017 501040002 4"
pattern = '[a-zA-Z]{3} +[a-zA-Z]{3} +\d{1,2} +\d{2}:\d{2}:\d{2} +\d{4}'
compiled = re.compile(pattern)
match = compiled.search(log_entry)
log_time_str = match.group(0)
datetime.strptime(log_time_str, "%a %b %d %H:%M:%S %Y")
# datetime.datetime(2017, 1, 4, 11, 17, 12)
Two approaches: with and without using regular expressions
1) using re.findall() function:
with open('test.log', 'r') as fh:
lines = re.findall(r'\b[A-Za-z]{3}\s[A-Za-z]{3}\s{2}\d{1,2} \d{2}:\d{2}:\d{2} \d{4}\b',fh.read(), re.M)
print(lines)
2) usign str.split() and str.join() functions:
with open('test.log', 'r') as fh:
lines = [' '.join(d.split()[3:8]) for d in fh.readlines()]
print(lines)
The output in both cases will be a below:
['Wed Jan 4 11:17:12 2017', 'Thu Jan 5 19:17:45 2017']
grep -E '\b(Mon|Tue|Wed|Thu|Fri|Sat|Sun) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +[0-9]+ [0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{4}\b' dates
If you just wanted to list the dates, rather than grep, perhaps:
sed -nre 's/^.*([A-Za-z]{3}\s+[A-Za-z]{3}\s+[0-9]+\s+[0-9]+:[0-9]+:[0-9]+\s+[0-9]{4}).*$/\1/p' filename
Related
I am parsing emails through Gmail API and have got the following date format:
Sat, 21 Jan 2017 05:08:04 -0800
I want to convert it into ISO 2017-01-21 (yyyy-mm-dd) format for MySQL storage. I am not able to do it through strftime()/strptime() and am missing something. Can someone please help?
TIA
isoformat() in the dateutil.
import dateutil.parser as parser
text = 'Sat, 21 Jan 2017 05:08:04 -0800'
date = (parser.parse(text))
print(date.isoformat())
print (date.date())
Output :
2017-01-21T05:08:04-08:00
2017-01-21
You can do it with strptime():
import datetime
datetime.datetime.strptime('Sat, 21 Jan 2017 05:08:04 -0800', '%a, %d %b %Y %H:%M:%S %z')
That gives you:
datetime.datetime(2017, 1, 21, 5, 8, 4, tzinfo=datetime.timezone(datetime.timedelta(-1, 57600)))
You can even do it manually using simple split and dictionary.That way, you will have more control over formatting.
def dateconvertor(date):
date = date.split(' ')
month = {'Jan': 1, 'Feb': 2, 'Mar': 3}
print str(date[1]) + '-' + str(month[date[2]]) + '-' + str(date[3])
def main():
dt = "Sat, 21 Jan 2017 05:08:04 -0800"
dateconvertor(dt)
if __name__ == '__main__':
main()
Keep it simple.
from datetime import datetime
s="Sat, 21 Jan 2017 05:08:04 -0800"
d=(datetime.strptime(s,"%a, %d %b %Y %X -%f"))
print(datetime.strftime(d,"%Y-%m-%d"))
Output : 2017-01-21
I have a datetime data in this format,
08:15:54:012 12 03 2016 +0000 GMT+00:00
I need to extract only date,that is 12 03 2016 in python.
I have tried
datetime_object=datetime.strptime('08:15:54:012 12 03 2016 +0000 GMT+00:00','%H:%M:%S:%f %d %m %Y')
I get an
ValueError: unconverted data remains: +0000 GMT+00:00
If you don't mind using an external library, I find the dateparser module much more intuitive than pythons internal datetime. It can parse pretty much anything if you just do
>>> import dateparser
>>> dateparser.parse('08:15:54:012 12 03 2016 +0000 GMT+00:00')
It claims it can handle timezone offsets tho I haven't tested it.
If you need this as string then use slicing
text = '08:15:54:012 12 03 2016 +0000 GMT+00:00'
print(text[13:23])
# 12 03 2016
but you can also convert to datetime
from datetime import datetime
text = '08:15:54:012 12 03 2016 +0000 GMT+00:00'
datetime_object = datetime.strptime(text[13:23],'%d %m %Y')
print(datetime_object)
# datetime.datetime(2016, 3, 12, 0, 0)
BTW:
in your oryginal version you have to remove +0000 GMT+00:00 usinig slicing [:-16]
strptime('08:15:54:012 12 03 2016 +0000 GMT+00:00'[:-16], '%H:%M:%S:%f %d %m %Y')
You can also use split() and join()
>>> x = '08:15:54:012 12 03 2016 +0000 GMT+00:00'.split()
['08:15:54:012', '12', '03', '2016', '+0000', 'GMT+00:00']
>>> x[1:4]
['12', '03', '2016']
>>> ' '.join(x[1:4])
'12 03 2016'
You can do it like this:
d = '08:15:54:012 12 03 2016 +0000 GMT+00:00'
d = d[:23] #Remove the timezone details
from datetime import datetime
d = datetime.strptime(d, "%H:%M:%S:%f %m %d %Y") #parse the string
d.strftime('%m %d %Y') #format the string
You get:
'12 03 2016'
I have date in string:
Tue Oct 04 2016 12:13:00 GMT+0200 (CEST)
and I use (according to https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior):
datetime.strptime(datetime_string, '%a %b %m %Y %H:%M:%S %z %Z')
but I get error:
ValueError: 'z' is a bad directive in format '%a %b %m %Y %H:%M:%S %z %Z'
How to do it correctly?
%z is the +0200, %Z is CEST. Therefore:
>>> s = "Tue Oct 04 2016 12:13:00 GMT+0200 (CEST)"
>>> datetime.strptime(s, '%a %b %d %Y %H:%M:%S GMT%z (%Z)')
datetime.datetime(2016, 10, 4, 12, 13, tzinfo=datetime.timezone(datetime.timedelta(0, 7200), 'CEST'))
I also replaced your %m with %d; %m is the month, numerically, so in your case 04 would be parsed as April.
python datetime can't parse the GMT part (You might want to specify it manually in your format). You can use dateutil instead:
In [16]: s = 'Tue Oct 04 2016 12:13:00 GMT+0200 (CEST)'
In [17]: from dateutil import parser
In [18]: parser.parse(s)
Out[18]: d = datetime.datetime(2016, 10, 4, 12, 13, tzinfo=tzoffset(u'CEST', -7200))
In [30]: d.utcoffset()
Out[30]: datetime.timedelta(-1, 79200)
In [31]: d.tzname()
Out[31]: 'CEST'
Simpler way to achieve this without taking care of datetime formatting identifiers will be the usage of dateutil.parser(). For example:
>>> import dateutil.parser
>>> date_string = 'Tue Oct 04 2016 12:13:00 GMT+0200 (CEST)'
>>> dateutil.parser.parse(date_string)
datetime.datetime(2016, 10, 4, 12, 13, tzinfo=tzoffset(u'CEST', -7200))
If you want to parse all you datetime data in a column in pandas DataFrame, you can use apply method to apply together with dateutil.parser.parse to parse whole column:
from dateutil.parser import parse
df['col_name'] = df['col_name'].apply(parse)
I want to convert below mentioned string to date object:
string_time = "06:13:19 25 March 2016 GMT (Europe/Ireland)"
date_object = datetime.strptime(string_time, "%H:%M:%S %d %B %Y %Z")
The only thing i am not able to convert is (Europe/Ireland)
Any hint would be highly appreciated.
Thanks
Use dateutil.parser.parse:
>>> import dateutil.parser
>>> string_time = "06:13:19 25 March 2016 GMT (Europe/Ireland)"
>>> dateutil.parser.parse(string_time.split('(')[0])
datetime.datetime(2016, 3, 25, 6, 13, 19, tzinfo=tzutc())
UPDATE
to add an hour to the time and display it in the original format: Use datetime.datetime.strftime and add timezone part.
>>> import datetime
>>> import dateutil.parser
>>> string_time = "06:13:19 25 March 2016 GMT (Europe/Ireland)"
>>> tz_part = string_time.split(None, 4)[-1]
>>> d = dateutil.parser.parse(string_time.rsplit(None, 1)[0])
>>> d2 = d + datetime.timedelta(hours=1)
>>> d2.strftime('%H:%M:%S %d %B %Y ') + tz_part
'07:13:19 25 March 2016 GMT (Europe/Ireland)'
I have a line after split like in here:
lineaftersplit=Jan 31 00:57:07 2012 GMT
How do I get only year 2012 from this and compare if it falls between (2010) and (2013)
If lineaftersplit is a string value, you can use the datetime module to parse out the information, including the year:
import datetime
parsed_date = datetime.datetime.strptime(lineaftersplit, '%b %d %H:%M:%S %Y %Z')
if 2010 <= parsed_date.year <= 2013:
# year between 2010 and 2013.
This has the advantage that you can do further tests on the datetime object, including sorting and date arithmetic.
Demo:
>>> import datetime
>>> lineaftersplit="Jan 31 00:57:07 2012 GMT"
>>> parsed_date = datetime.datetime.strptime(lineaftersplit, '%b %d %H:%M:%S %Y %Z')
>>> parsed_date
datetime.datetime(2012, 1, 31, 0, 57, 7)
>>> parsed_date.year
2012
You can use str.rsplit:
>>> strs = 'Jan 31 00:57:07 2012 GMT'
str.rstrip will return a list like this:
>>> strs.rsplit(None,2)
['Jan 31 00:57:07', '2012', 'GMT']
Now we need the second item:
>>> year = strs.rsplit(None,2)[1]
>>> year
'2012'
>>> if 2010 <= int(year) <= 2013: #apply int() to get the integer value
... #do something
...
Try this:
st="Jan 31 00:57:07 2012 GMT".split()
year=int(st[3])
This actually works if the string is always of this format
str='Jan 31 00:57:07 2012 GMT'
str.split()[3]