Python regex is not extracting a substring from my log file - python

I'm using
date = re.findall(r"^(?:\w{3} ){2}\d{2} (?:[\d]{2}:){2}\d{2} \d{4}$", message)
in Python 2.7 to extract the substrings:
Wed Feb 04 13:29:49 2015
Thu Feb 05 13:45:08 2015
from a log file like this:
1424,Wed Feb 04 13:29:49 2015,51
1424,Thu Feb 05 13:45:08 2015,29
It is not working, and I'm required to use regex for this task, otherwise I would have split() it. What am I doing wrong?

As your sub-strings doesn't began from the first part of your string you dont need to assert position at start and end of the string so you can remove ^ and $
:
>>> s ="""
1424,Wed Feb 04 13:29:49 2015,51
1424,Thu Feb 05 13:45:08 2015,29"""
>>> date = re.findall(r"(?:\w{3} ){2}\d{2} (?:[\d]{2}:){2}\d{2} \d{4}", s)
>>> date
['Wed Feb 04 13:29:49 2015', 'Thu Feb 05 13:45:08 2015']
Also as an alternative proposition you can just use a positive look-behind :
>>> date = re.findall(r"(?<=\d{4},).*", s)
>>> date
['Wed Feb 04 13:29:49 2015,51', 'Thu Feb 05 13:45:08 2015,29']
or without using regex you can use str.split() and str.partition() for such tasks :
>>> s ="""
1424,Wed Feb 04 13:29:49 2015,51
1424,Thu Feb 05 13:45:08 2015,29"""
>>> [i.partition(',')[-1] for i in s.split('\n')]
['Wed Feb 04 13:29:49 2015,51', 'Thu Feb 05 13:45:08 2015,29']

a simple way to do this is just match by the commas
message = '1424,Wed Feb 04 13:29:49 2015,51 1424,Thu Feb 05 13:45:08 2015,29'
date = re.findall(r",(.*?),", message)
print date
>>> ['Wed Feb 04 13:29:49 2015', 'Thu Feb 05 13:45:08 2015']
DEMO

You dont need regex, use split.
line = "1424,Wed Feb 04 13:29:49 2015,51"
date = line.split(",")[1]
print date
>>>Wed Feb 04 13:29:49 2015

Related

Check for blanks at specified positions in a string

I have the following problem, which I have been able to solve in a very long way and I would like to know if there is any other way to solve it. I have the following string structure:
text = 01 ARA 22 - 02 GAG 23
But due to processing sometimes the spaces are not added properly and it may look like this:
text = 04 GOR23- 02 OER 23
text = 04 ORO 21-02 RRO 24
text = 04 DRE25- 12 RIS21
When they should look as follows:
text = 04 GOR 23 - 02 OER 23
text = 04 ORO 21 - 02 RRO 24
text = 04 DRE 25 - 12 RIS 21
To add the space in those specific positions, basically I check if in that position of the string the space exists, if it does not exist I add it.
Is there another way in python to do it more efficiently?
I appreciate any advice.
You can use a regex to capture each of the components in the text, and then replace any missing spaces with a space:
import re
regex = re.compile(r'(\d{2})\s*([A-Z]{3})\s*(\d{2})\s*-\s*(\d{2})\s*([A-Z]{3})\s*(\d{2})')
text = ['04 GOR23- 02 OER 23',
'04 ORO 21-02 RRO 24',
'04 DRE25- 12 RIS21']
[regex.sub(r'\1 \2 \3 - \4 \5 \6', t) for t in text]
Output:
['04 GOR 23 - 02 OER 23',
'04 ORO 21 - 02 RRO 24',
'04 DRE 25 - 12 RIS 21']
Here is another way to do so:
data = '04 GOR23- 02 OER 23'
new_data = "".join(char if not i in [2, 5, 7, 8, 10, 13] else f" {char}" for i, char in enumerate(data.replace(" ", "")))

Python gzip gives null bytes

I'm trying to parse some log files in Python, but my responses always return only null bytes.
I've confirmed that the file in question does contain data:
$ zcat Events.log.gz | wc -c
188371128
$ zcat Events.log.gz | head
17 Jan 2018 08:10:35,863: {"deviceType":"A16ZV8BU3SN1N3",[REDACTED]}
17 Jan 2018 08:10:35,878: {"deviceType":"A1CTGXB4BA274T",[REDACTED]}
17 Jan 2018 08:10:35,886: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
17 Jan 2018 08:10:35,911: {"deviceType":"A2CZFJ2RKY7SE2",[REDACTED]}
17 Jan 2018 08:10:35,937: {"deviceType":"A2JTEGS8GUPDOF",[REDACTED]}
17 Jan 2018 08:10:35,963: {"appOtaState":"ota",[REDACTED]}
17 Jan 2018 08:10:35,971: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
17 Jan 2018 08:10:36,006: {"deviceType":"A2JTEGS8GUPDOF",[REDACTED]}
17 Jan 2018 08:10:36,013: {"deviceType":"A1CTGXB4BA274T",[REDACTED]}
17 Jan 2018 08:10:36,041: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
But attempting to read it in Python gives only null bytes:
$ python
Python 2.6.9 (unknown, Sep 14 2016, 17:46:59)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> filename = 'Events.log.gz'
>>> import gzip
>>> content = gzip.open(filename).read()
>>> len(content)
188371128
>>> for i in range(10):
... content[i*10000:(i*10000)+10]
...
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
I've tried explicitly setting 'mode' to either 'r' or 'rb', with no difference in result.
I've also tried subprocess.Popen(['zcat', filename], stdout=subprocess.PIPE).stdout.read(), with the same response.
Perhaps relevantly, when I tried to zcat the file to another file, the output was a binary file:
$ zcat Events.log.gz > /tmp/logoutput
$ less /tmp/logoutput
"/tmp/logoutput" may be a binary file. See it anyway?
[y]
^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#...
$ head /tmp/logoutput
17 Jan 2018 08:10:35,863: {"deviceType":"A16ZV8BU3SN1N3",[REDACTED]}
17 Jan 2018 08:10:35,878: {"deviceType":"A1CTGXB4BA274T",[REDACTED]}
17 Jan 2018 08:10:35,886: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
17 Jan 2018 08:10:35,911: {"deviceType":"A2CZFJ2RKY7SE2",[REDACTED]}
17 Jan 2018 08:10:35,937: {"deviceType":"A2JTEGS8GUPDOF",[REDACTED]}
17 Jan 2018 08:10:35,963: {"appOtaState":"ota",[REDACTED]}
17 Jan 2018 08:10:35,971: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
17 Jan 2018 08:10:36,006: {"deviceType":"A2JTEGS8GUPDOF",[REDACTED]}
17 Jan 2018 08:10:36,013: {"deviceType":"A1CTGXB4BA274T",[REDACTED]}
17 Jan 2018 08:10:36,041: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}

Using re.search in python filter function

I am not able to use re.search inside a filter expression.
I am trying to use re.search to extract the href values from a list where each element is a html line.
Here is what I am doing:
>>> filter(lambda html_line: re.search('.*a href=\"([^\"]*).*', html_line), data)
[u'Directory Feb 28 23:57 <b>2014.02.28</b>'
u'Directory Mar 01 23:59 <b>2014.03.01</b>'
u'Directory Mar 02 23:50 <b>2014.03.02</b>'
u'Directory Mar 03 23:59 <b>2014.03.03</b>'
u'Directory Mar 04 23:50 <b>2014.03.04</b>'
u'Directory Mar 05 23:50 <b>2014.03.05</b>'
u'Directory Mar 06 23:50 <b>2014.03.06</b>'
u'Directory Mar 07 23:50 <b>2014.03.07</b>'
u'Directory Mar 08 23:50 <b>2014.03.08</b>']
My re.search call seems to be working correctly.
For example, this works:
>>> for html_line in data:
print re.search('.*a href=\"([^\"]*).*', html_line).group(1)
/MyApp/LogBrowser?type=crawler/2014.02.28
/MyApp/LogBrowser?type=crawler/2014.03.01
/MyApp/LogBrowser?type=crawler/2014.03.02
/MyApp/LogBrowser?type=crawler/2014.03.03
/MyApp/LogBrowser?type=crawler/2014.03.04
/MyApp/LogBrowser?type=crawler/2014.03.05
/MyApp/LogBrowser?type=crawler/2014.03.06
/MyApp/LogBrowser?type=crawler/2014.03.07
/MyApp/LogBrowser?type=crawler/2014.03.08
filter will only filter the items it won't return the href value, you can use a list comprehension for this:
r = re.compile(r'.*a href=\"([^\"]*).*')
data = [x.group(1) for x in (r.search(html_line) for html_line in data)
if x is not None]

Order a sequence of dates as they occur in calendar year

I've got a series of pipes to convert dates in a text file into unique, human readable output and pull out MM DD. Now I would like to resort the output so that the dates display in the order in which they occur during the year. Anybody know a good technique using the standard shell or with a readily installable package on *nix?
Feb 4
Feb 5
Feb 6
Feb 7
Feb 8
Jan 1
Jan 10
Jan 11
Jan 12
Jan 13
Jan 2
Jan 25
Jan 26
Jan 27
Jan 28
Jan 29
Jan 3
Jan 30
Jan 31
Jan 4
Jan 5
Jan 6
Jan 7
Jan 8
Jan 9
There is a utility called sort with an option -M for sorting by month. If you have it installed, you could use that. For instance:
sort -k1 -M test.txt
-k1: First column
-M: Sort by month
Edited per twalberg's suggestion below:
sort -k1,1M -k2,2n test.txt
In two steps:
$ while read line; do date -d "$line" "+%Y%m%d"; done < file | sort -n > temp
$ while read line; do date -d "$line" "+%b %d"; done < temp > file
Firstly we convert dates to YYYYMMDD and order them:
$ while read line; do date -d "$line" "+%Y%m%d"; done < file | sort -n > temp
$ cat temp
20130101
20130102
20130103
20130104
20130105
20130106
20130107
20130108
20130109
20130110
20130111
20130112
20130113
20130125
20130126
20130127
20130128
20130129
20130130
20130131
20130204
20130205
20130206
20130207
20130208
Then we print them back to previous format %b %d:
$ while read line; do date -d "$line" "+%b %d"; done < temp > file
$ cat file
Jan 01
Jan 02
Jan 03
Jan 04
Jan 05
Jan 06
Jan 07
Jan 08
Jan 09
Jan 10
Jan 11
Jan 12
Jan 13
Jan 25
Jan 26
Jan 27
Jan 28
Jan 29
Jan 30
Jan 31
Feb 04
Feb 05
Feb 06
Feb 07
Feb 08
and sed -n "1 {
H
x
s/.(\n)./01 Jan\102 Feb\103 Mar\104 Apr\105 May\106 Jun\107 Jul\105 Aug\109 Sep\110 Oct\111 Nov\112 Dec/
x
}
s/^\(.\{3\}\) \([0-9]\) *$/\1 0\2/
H
$ {
x
t subs
: subs
s/^\([0-9]\{2\}\) \([[:alpha:]]\{3\}\)\(\n\)\(.*\)\n\2/\1 \2\3\4\3\1 \2/
t subs
s/^[0-9]\{2\} [[:alpha:]]\{3\}\n//
t subs
p
}
" | sort | sed "s/^[0-9][0-9] //"
still need a sort (or a lot more complex sed for sorting) and when sort -M doesn't work

Best way to extract datetime from string in python

I have a script that is parsing out fields within email headers that represent dates and times. Some examples of these strings are as follows:
Fri, 10 Jun 2011 11:04:17 +0200 (CEST)
Tue, 1 Jun 2011 11:04:17 +0200
Wed, 8 Jul 1992 4:23:11 -0200
Wed, 8 Jul 1992 4:23:11 -0200 EST
Before I was confronted with the CEST/EST portions at the ends of some the strings I had things working pretty well just using datetime.datetime.strptime like this:
msg['date'] = 'Wed, 8 Jul 1992 4:23:11 -0200'
mail_date = datetime.datetime.strptime(msg['date'][:-6], '%a, %d %b %Y %H:%M:%S')
I tried to put a regex together to match the date portions of the string while excluding the timezone information at the end, but I was having issues with the regex (I couldn't match a colon).
Is using a regex the best way to parse all of the examples above? If so, could someone share a regex that would match these examples? In the end I am looking to have a datetime object.
From python time to age part 2, timezones:
from email import utils
utils.parsedate_tz('Fri, 10 Jun 2011 11:04:17 +0200 (CEST)')
utils.parsedate_tz('Fri, 10 Jun 2011 11:04:17 +0200')
utils.parsedate_tz('Fri, 10 Jun 2011 11:04:17')
The output is:
(2011, 6, 10, 11, 4, 17, 0, 1, -1, 7200)
(2011, 6, 10, 11, 4, 17, 0, 1, -1, 7200)
(2011, 6, 10, 11, 4, 17, 0, 1, -1, None)
Perhaps I misunderstood your question, but wont a simple split suffice?
#!/usr/bin/python
d = ["Fri, 10 Jun 2011 11:04:17 +0200 (CEST)", "Tue, 1 Jun 2011 11:04:17 +0200",
"Wed, 8 Jul 1992 4:23:11 -0200", "Wed, 8 Jul 1992 4:23:11 -0200 EST"]
for i in d:
print " ".join(i.split()[0:5])
Fri, 10 Jun 2011 11:04:17
Tue, 1 Jun 2011 11:04:17
Wed, 8 Jul 1992 4:23:11
Wed, 8 Jul 1992 4:23:11

Categories