I have a script that is parsing out fields within email headers that represent dates and times. Some examples of these strings are as follows:
Fri, 10 Jun 2011 11:04:17 +0200 (CEST)
Tue, 1 Jun 2011 11:04:17 +0200
Wed, 8 Jul 1992 4:23:11 -0200
Wed, 8 Jul 1992 4:23:11 -0200 EST
Before I was confronted with the CEST/EST portions at the ends of some the strings I had things working pretty well just using datetime.datetime.strptime like this:
msg['date'] = 'Wed, 8 Jul 1992 4:23:11 -0200'
mail_date = datetime.datetime.strptime(msg['date'][:-6], '%a, %d %b %Y %H:%M:%S')
I tried to put a regex together to match the date portions of the string while excluding the timezone information at the end, but I was having issues with the regex (I couldn't match a colon).
Is using a regex the best way to parse all of the examples above? If so, could someone share a regex that would match these examples? In the end I am looking to have a datetime object.
From python time to age part 2, timezones:
from email import utils
utils.parsedate_tz('Fri, 10 Jun 2011 11:04:17 +0200 (CEST)')
utils.parsedate_tz('Fri, 10 Jun 2011 11:04:17 +0200')
utils.parsedate_tz('Fri, 10 Jun 2011 11:04:17')
The output is:
(2011, 6, 10, 11, 4, 17, 0, 1, -1, 7200)
(2011, 6, 10, 11, 4, 17, 0, 1, -1, 7200)
(2011, 6, 10, 11, 4, 17, 0, 1, -1, None)
Perhaps I misunderstood your question, but wont a simple split suffice?
#!/usr/bin/python
d = ["Fri, 10 Jun 2011 11:04:17 +0200 (CEST)", "Tue, 1 Jun 2011 11:04:17 +0200",
"Wed, 8 Jul 1992 4:23:11 -0200", "Wed, 8 Jul 1992 4:23:11 -0200 EST"]
for i in d:
print " ".join(i.split()[0:5])
Fri, 10 Jun 2011 11:04:17
Tue, 1 Jun 2011 11:04:17
Wed, 8 Jul 1992 4:23:11
Wed, 8 Jul 1992 4:23:11
Related
I'm trying to parse some log files in Python, but my responses always return only null bytes.
I've confirmed that the file in question does contain data:
$ zcat Events.log.gz | wc -c
188371128
$ zcat Events.log.gz | head
17 Jan 2018 08:10:35,863: {"deviceType":"A16ZV8BU3SN1N3",[REDACTED]}
17 Jan 2018 08:10:35,878: {"deviceType":"A1CTGXB4BA274T",[REDACTED]}
17 Jan 2018 08:10:35,886: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
17 Jan 2018 08:10:35,911: {"deviceType":"A2CZFJ2RKY7SE2",[REDACTED]}
17 Jan 2018 08:10:35,937: {"deviceType":"A2JTEGS8GUPDOF",[REDACTED]}
17 Jan 2018 08:10:35,963: {"appOtaState":"ota",[REDACTED]}
17 Jan 2018 08:10:35,971: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
17 Jan 2018 08:10:36,006: {"deviceType":"A2JTEGS8GUPDOF",[REDACTED]}
17 Jan 2018 08:10:36,013: {"deviceType":"A1CTGXB4BA274T",[REDACTED]}
17 Jan 2018 08:10:36,041: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
But attempting to read it in Python gives only null bytes:
$ python
Python 2.6.9 (unknown, Sep 14 2016, 17:46:59)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> filename = 'Events.log.gz'
>>> import gzip
>>> content = gzip.open(filename).read()
>>> len(content)
188371128
>>> for i in range(10):
... content[i*10000:(i*10000)+10]
...
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
I've tried explicitly setting 'mode' to either 'r' or 'rb', with no difference in result.
I've also tried subprocess.Popen(['zcat', filename], stdout=subprocess.PIPE).stdout.read(), with the same response.
Perhaps relevantly, when I tried to zcat the file to another file, the output was a binary file:
$ zcat Events.log.gz > /tmp/logoutput
$ less /tmp/logoutput
"/tmp/logoutput" may be a binary file. See it anyway?
[y]
^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#...
$ head /tmp/logoutput
17 Jan 2018 08:10:35,863: {"deviceType":"A16ZV8BU3SN1N3",[REDACTED]}
17 Jan 2018 08:10:35,878: {"deviceType":"A1CTGXB4BA274T",[REDACTED]}
17 Jan 2018 08:10:35,886: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
17 Jan 2018 08:10:35,911: {"deviceType":"A2CZFJ2RKY7SE2",[REDACTED]}
17 Jan 2018 08:10:35,937: {"deviceType":"A2JTEGS8GUPDOF",[REDACTED]}
17 Jan 2018 08:10:35,963: {"appOtaState":"ota",[REDACTED]}
17 Jan 2018 08:10:35,971: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
17 Jan 2018 08:10:36,006: {"deviceType":"A2JTEGS8GUPDOF",[REDACTED]}
17 Jan 2018 08:10:36,013: {"deviceType":"A1CTGXB4BA274T",[REDACTED]}
17 Jan 2018 08:10:36,041: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
I have the following dataframe:
name Jan Feb Mar Apr May Jun Jul Aug \
0 IBM 156.08 160.01 159.81 165.22 172.25 167.15 164.75 152.77
1 MSFT 45.51 43.08 42.13 43.47 47.53 45.96 45.61 45.51
2 GOOGLE 512.42 537.99 559.72 540.50 535.24 532.92 590.09 636.84
3 APPLE 110.64 125.43 125.97 127.29 128.76 127.81 125.34 113.39
Sep Oct Nov Dec
0 145.36 146.11 137.21 137.96
1 43.56 48.70 53.88 55.40
2 617.93 663.59 735.39 755.35
3 112.80 113.36 118.16 111.73
Which I want to transform into the following:
Month AAPL GOOG IBM
0 Jan 117.160004 534.522445 153.309998
1 Feb 128.460007 558.402511 161.940002
2 Mar 124.430000 548.002468 160.500000
3 Apr 125.150002 537.340027 171.289993
4 May 130.279999 532.109985 169.649994
I've been fiddling around with melt and pivot but have no idea how to get this to work.
Any advice would be appreciated.
Thanks
Let's use 'set_index','rename_axis', and T for transpose.
df.set_index('name')\
.rename_axis(None).T\
.rename_axis('Month')\
.reset_index()
Output:
Month IBM MSFT GOOGLE APPLE
0 Jan 156.08 45.51 512.42 110.64
1 Feb 160.01 43.08 537.99 125.43
2 Mar 159.81 42.13 559.72 125.97
3 Apr 165.22 43.47 540.50 127.29
4 May 172.25 47.53 535.24 128.76
5 Jun 167.15 45.96 532.92 127.81
6 Jul 164.75 45.61 590.09 125.34
7 Aug 152.77 45.51 636.84 113.39
Creative Way
pd.DataFrame({'Month': df.columns[1:]}).assign(**{c: v for c, *v in df.values})
Month APPLE GOOGLE IBM MSFT
0 Jan 110.64 512.42 156.08 45.51
1 Feb 125.43 537.99 160.01 43.08
2 Mar 125.97 559.72 159.81 42.13
3 Apr 127.29 540.50 165.22 43.47
4 May 128.76 535.24 172.25 47.53
5 Jun 127.81 532.92 167.15 45.96
6 Jul 125.34 590.09 164.75 45.61
7 Aug 113.39 636.84 152.77 45.51
8 Sep 112.80 617.93 145.36 43.56
9 Oct 113.36 663.59 146.11 48.70
10 Nov 118.16 735.39 137.21 53.88
11 Dec 111.73 755.35 137.96 55.40
Set_index basically changes the group by element and groups data according to the index specified by you so If you set_index to name of the group this should solve your issue.
I'm using
date = re.findall(r"^(?:\w{3} ){2}\d{2} (?:[\d]{2}:){2}\d{2} \d{4}$", message)
in Python 2.7 to extract the substrings:
Wed Feb 04 13:29:49 2015
Thu Feb 05 13:45:08 2015
from a log file like this:
1424,Wed Feb 04 13:29:49 2015,51
1424,Thu Feb 05 13:45:08 2015,29
It is not working, and I'm required to use regex for this task, otherwise I would have split() it. What am I doing wrong?
As your sub-strings doesn't began from the first part of your string you dont need to assert position at start and end of the string so you can remove ^ and $
:
>>> s ="""
1424,Wed Feb 04 13:29:49 2015,51
1424,Thu Feb 05 13:45:08 2015,29"""
>>> date = re.findall(r"(?:\w{3} ){2}\d{2} (?:[\d]{2}:){2}\d{2} \d{4}", s)
>>> date
['Wed Feb 04 13:29:49 2015', 'Thu Feb 05 13:45:08 2015']
Also as an alternative proposition you can just use a positive look-behind :
>>> date = re.findall(r"(?<=\d{4},).*", s)
>>> date
['Wed Feb 04 13:29:49 2015,51', 'Thu Feb 05 13:45:08 2015,29']
or without using regex you can use str.split() and str.partition() for such tasks :
>>> s ="""
1424,Wed Feb 04 13:29:49 2015,51
1424,Thu Feb 05 13:45:08 2015,29"""
>>> [i.partition(',')[-1] for i in s.split('\n')]
['Wed Feb 04 13:29:49 2015,51', 'Thu Feb 05 13:45:08 2015,29']
a simple way to do this is just match by the commas
message = '1424,Wed Feb 04 13:29:49 2015,51 1424,Thu Feb 05 13:45:08 2015,29'
date = re.findall(r",(.*?),", message)
print date
>>> ['Wed Feb 04 13:29:49 2015', 'Thu Feb 05 13:45:08 2015']
DEMO
You dont need regex, use split.
line = "1424,Wed Feb 04 13:29:49 2015,51"
date = line.split(",")[1]
print date
>>>Wed Feb 04 13:29:49 2015
i have a list which goes as follows:
-------------------------------------------------------------------------------------------
www.mydomain.de UP Thu May 8 09:10:57 2014
HTTPS OK Thu May 8 09:10:08 2014
HTTPS-Cert OK Thu May 8 09:10:55 2014
-------------------------------------------------------------------------------------------
www.someotherdomain.de UP Thu May 8 09:09:17 2014
HTTPS OK Thu May 8 09:09:30 2014
HTTPS-Cert OK Thu May 8 09:11:10 2014
-------------------------------------------------------------------------------------------
www.somedifferentdomain.at UP Thu May 8 09:08:47 2014
HTTPS OK Thu May 8 09:10:26 2014
HTTPS-Cert OK Thu May 8 09:11:13 2014
-------------------------------------------------------------------------------------------
www.foobladomain.de UP Thu May 8 09:09:17 2014
HTTPS OK Thu May 8 09:09:30 2014
HTTPS-Cert OK Thu May 8 09:11:08 2014
-------------------------------------------------------------------------------------------
www.snafudomain.at UP Thu May 8 09:09:17 2014
HTTP OK Thu May 8 09:09:42 2014
HTTPS OK Thu May 8 09:10:10 2014
HTTPS-Cert OK Thu May 8 09:10:09 2014
-------------------------------------------------------------------------------------------
www.lolnotanotherdomain.de UP Thu May 8 09:06:57 2014
HTTP OK Thu May 8 09:11:10 2014
HTTPS OK Thu May 8 09:11:16 2014
HTTPS-Cert OK Thu May 8 09:11:10 2014
and i have a function which takes the hostname as parameter and prints it out:
please enter hostname to search for: www.snafudomain.at
www.snafudomain.at UP Thu May 8 09:09:17 2014
but what i want to archive is that the following lines after the hostname are printed out until the delimiter line "-----" the function i right now looks like this:
def getChecks(self,hostname):
re0 = "%s" % hostname
mylist = open('myhostlist', 'r')
for i in mylist:
if re.findall("^%s" % re0, str(i)):
print i
else:
continue
is there some easy way to do this? If something is unclear please comment. Thanks in advance
edit
to clarify the output should look like this:
www.mydomain.de UP Thu May 8 09:10:57 2014
HTTPS OK Thu May 8 09:10:08 2014
HTTPS-Cert OK Thu May 8 09:10:55 2014
-------------------------------------------------------------------------------------
just want to print out the lines from the searched domain name till the line with only minuses.
How about not using regex at all?
def get_checks(self, hostname):
record = False
with open('myhostlist', 'r') as file_h:
for line in file_h:
if line.startswith(hostname):
record = True
print(line)
elif line.startswith("---"):
record = False
print(line)
elif record:
print(line)
import re
def get_checks(hostname):
pattern = re.compile(r"{}.*?(?=---)".format(re.escape(hostname)), re.S)
with open("Input.txt") as in_file:
return re.search(pattern, in_file.read())
print get_checks("www.snafudomain.at").group()
This will returns all the lines starting with www.snafudomain.at till it finds ---. The pattern generated will be like this
www\.snafudomain\.at.*?(?=---)
Online Demo
We use re.escape because your hostname has . in it. Since . has special meaning in the Regular Expressions, we just want the RegEx engine to treat . as literal dot.
I've got a series of pipes to convert dates in a text file into unique, human readable output and pull out MM DD. Now I would like to resort the output so that the dates display in the order in which they occur during the year. Anybody know a good technique using the standard shell or with a readily installable package on *nix?
Feb 4
Feb 5
Feb 6
Feb 7
Feb 8
Jan 1
Jan 10
Jan 11
Jan 12
Jan 13
Jan 2
Jan 25
Jan 26
Jan 27
Jan 28
Jan 29
Jan 3
Jan 30
Jan 31
Jan 4
Jan 5
Jan 6
Jan 7
Jan 8
Jan 9
There is a utility called sort with an option -M for sorting by month. If you have it installed, you could use that. For instance:
sort -k1 -M test.txt
-k1: First column
-M: Sort by month
Edited per twalberg's suggestion below:
sort -k1,1M -k2,2n test.txt
In two steps:
$ while read line; do date -d "$line" "+%Y%m%d"; done < file | sort -n > temp
$ while read line; do date -d "$line" "+%b %d"; done < temp > file
Firstly we convert dates to YYYYMMDD and order them:
$ while read line; do date -d "$line" "+%Y%m%d"; done < file | sort -n > temp
$ cat temp
20130101
20130102
20130103
20130104
20130105
20130106
20130107
20130108
20130109
20130110
20130111
20130112
20130113
20130125
20130126
20130127
20130128
20130129
20130130
20130131
20130204
20130205
20130206
20130207
20130208
Then we print them back to previous format %b %d:
$ while read line; do date -d "$line" "+%b %d"; done < temp > file
$ cat file
Jan 01
Jan 02
Jan 03
Jan 04
Jan 05
Jan 06
Jan 07
Jan 08
Jan 09
Jan 10
Jan 11
Jan 12
Jan 13
Jan 25
Jan 26
Jan 27
Jan 28
Jan 29
Jan 30
Jan 31
Feb 04
Feb 05
Feb 06
Feb 07
Feb 08
and sed -n "1 {
H
x
s/.(\n)./01 Jan\102 Feb\103 Mar\104 Apr\105 May\106 Jun\107 Jul\105 Aug\109 Sep\110 Oct\111 Nov\112 Dec/
x
}
s/^\(.\{3\}\) \([0-9]\) *$/\1 0\2/
H
$ {
x
t subs
: subs
s/^\([0-9]\{2\}\) \([[:alpha:]]\{3\}\)\(\n\)\(.*\)\n\2/\1 \2\3\4\3\1 \2/
t subs
s/^[0-9]\{2\} [[:alpha:]]\{3\}\n//
t subs
p
}
" | sort | sed "s/^[0-9][0-9] //"
still need a sort (or a lot more complex sed for sorting) and when sort -M doesn't work