I'd like to use Python to analyse /var/log/monthly.out on OS X to export user accounting totals. The log file looks like this:
Mon Feb 1 09:12:41 GMT 2016
Rotating fax log files:
Doing login accounting:
total 688.31
example 401.12
_mbsetupuser 287.10
root 0.05
admin 0.04
-- End of monthly output --
Tue Feb 16 14:27:21 GMT 2016
Rotating fax log files:
Doing login accounting:
total 0.00
-- End of monthly output --
Thu Mar 3 09:37:31 GMT 2016
Rotating fax log files:
Doing login accounting:
total 377.92
example 377.92
-- End of monthly output --
I was able to extract the username / totals pairs with this regex:
\t(\w*)\W*(\d*\.\d{2})
In Python:
>>> import re
>>> re.findall(r'\t(\w*)\W*(\d*\.\d{2})', open('/var/log/monthly.out', 'r').read())
[('total', '688.31'), ('example', '401.12'), ('_mbsetupuser', '287.10'), ('root', '0.05'), ('admin', '0.04'), ('total', '0.00'), ('total', '377.92'), ('example', '377.92')]
But I can't figure out how to extract the date line in such a way where it's attached to the username / totals pairs for that month.
Use str.split().
import re
re_user_amount = r'\s+(\w+)\s+(\d*\.\d{2})'
re_date = r'\w{3}\s+\w{3}\s+\d+\s+\d\d:\d\d:\d\d \w+ \d{4}'
with open('/var/log/monthly.out', 'r') as f:
content = f.read()
sections = content.split('-- End of monthly output --')
for section in sections:
date = re.findall(re_date, section)
matches = re.findall(re_user_amount, section)
print(date, matches)
If you want to turn the date string into an actual datetime, check out Converting string into datetime.
Well, there's rarely a magical cure for everything based on regex. The regex are a great tool
for simple string parsing, but it shall not replace good old programming!
So if you look at your data, you'll notice that it always start with a date, and ends with the
-- End of monthly output -- line. So a nice way to handle that would be to split your data
by each monthly output.
Let's start with your data:
>>> s = """\
... Mon Feb 1 09:12:41 GMT 2016
...
... Rotating fax log files:
...
... Doing login accounting:
... total 688.31
... example 401.12
... _mbsetupuser 287.10
... root 0.05
... admin 0.04
...
... -- End of monthly output --
...
... Tue Feb 16 14:27:21 GMT 2016
...
... Rotating fax log files:
...
... Doing login accounting:
... total 0.00
...
... -- End of monthly output --
...
... Thu Mar 3 09:37:31 GMT 2016
...
... Rotating fax log files:
...
... Doing login accounting:
... total 377.92
... example 377.92
...
... -- End of monthly output --"""
And let's split it based ont that end of month line:
>>> reports = s.split('-- End of monthly output --')
>>> reports
['Mon Feb 1 09:12:41 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n total 688.31\n example 401.12\n _mbsetupuser 287.10\n root 0.05\n admin 0.04\n\n', '\n\nTue Feb 16 14:27:21 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n total 0.00\n\n', '\n\nThu Mar 3 09:37:31 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n total 377.92\n example 377.92\n\n', '']
Then you can separate the accounting data from the rest of the log:
>>> report = reports[0]
>>> head, tail = report.split('Doing login accounting:')
Now let's extract the date line:
>>> date_line = head.strip().split('\n')[0]
And fill up a dict with those username/totals pairs:
>>> accounting = dict(zip(tail.split()[::2], tail.split()[1::2]))
the trick here is to use zip() to create pairs out of iterators on tail. The "left"
side of the pair being an iterator starting at index 0, iterating every 2 items, the ~right~
side of the pair being an iterator starting at index 1, iterating every 2 items. Which makes:
{'admin': '0.04', 'root': '0.05', 'total': '688.31', '_mbsetupuser': '287.10', 'example': '401.12'}
So now that's done, you can do that in a for loop:
import datetime
def parse_monthly_log(log_path='/var/log/monthly.out'):
with open(log_path, 'r') as log:
reports = log.read().strip('\n ').split('-- End of monthly output --')
for report in filter(lambda it: it, reports):
head, tail = report.split('Doing login accounting:')
date_line = head.strip().split('\n')[0]
accounting = dict(zip(tail.split()[::2], tail.split()[1::2]))
yield {
'date': datetime.datetime.strptime(date_line.replace(' ', ' 0'), '%a %b %d %H:%M:%S %Z %Y'),
'accounting': accounting
}
>>> import pprint
>>> pprint.pprint(list(parse_monthly_log()), indent=2)
[ { 'accounting': { '_mbsetupuser': '287.10',
'admin': '0.04',
'example': '401.12',
'root': '0.05',
'total': '688.31'},
'date': datetime.datetime(2016, 2, 1, 9, 12, 41)},
{ 'accounting': { 'total': '0.00'},
'date': datetime.datetime(2016, 2, 16, 14, 27, 21)},
{ 'accounting': { 'example': '377.92', 'total': '377.92'},
'date': datetime.datetime(2016, 3, 3, 9, 37, 31)}]
And there you go with a pythonic solution without a single regex.
N.B.: I had to do a little trick with the datetime, because the log contains a day number filled with space and not zero (as expects strptime), I used string .replace() to change a double space into a 0 within the date string
N.B.: the filter() and the split() used in the for report… loop is used to remove leading and trailing empty reports, depending on how the log file starts or ends.
Here's something shorter:
with open("/var/log/monthly.out") as f:
months = map(str.strip, f.read().split("-- End of monthly output --"))
for sec in filter(None, y):
date = sec.splitlines()[0]
accs = re.findall("\n\s+(\w+)\s+([\d\.]+)", sec)
print(date, accs)
This divides the file content into months, extracts the date of each month and searches for all accounts in each month.
You may want to try the following regex, which is not so elegant though:
import re
string = """
Mon Feb 1 09:12:41 GMT 2016
Rotating fax log files:
Doing login accounting:
total 688.31
example 401.12
_mbsetupuser 287.10
root 0.05
admin 0.04
-- End of monthly output --
Tue Feb 16 14:27:21 GMT 2016
Rotating fax log files:
Doing login accounting:
total 0.00
-- End of monthly output --
Thu Mar 3 09:37:31 GMT 2016
Rotating fax log files:
Doing login accounting:
total 377.92
example 377.92
-- End of monthly output --
"""
pattern = '(\w+\s+\w+\s+[\d:\s]+[A-Z]{3}\s+\d{4})[\s\S]+?((?:\w+)\s+(?:[0-9.]+))\s+(?:((?:\w+)\s*(?:[0-9.]+)))?\s+(?:((?:\w+)\s*(?:[0-9.]+)))?\s*(?:((?:\w+)\s+(?:[0-9.]+)))?\s*(?:((?:\w+)\s*(?:[0-9.]+)))?'
print re.findall(pattern, string)
Output:
[('Mon Feb 1 09:12:41 GMT 2016', 'total 688.31', 'example 401.12', '_mbsetupuser 287.10', 'root 0.05', 'admin 0.04'),
('Tue Feb 16 14:27:21 GMT 2016', 'total 0.00', '', '', '', ''),
('Thu Mar 3 09:37:31 GMT 2016', 'total 377.92', 'example 377.92', '', '', '')]
REGEX DEMO.
Related
I have large sentence as shown below,
how are you
On Tue, Dec 21, 2021 at 1:51 PM <abc<http://localhost>> wrote:
-------------------------------------------------------------
NOTE: Please do not remove email address from the \"To\" line of this email when replying. This address is used to capture the email and report it. Please do not remove or change the subject line of this email. The subject line of this email contains information to refer this correspondence back to the originating discrepancy.
I want the date and time specified in the sentence (Tue, Dec 21, 2021 at 1:51 PM).
How to extract that from the sentence?
Use a regular expression to extract the date and time.
import re
text = '''how are you
On Tue, Dec 21, 2021 at 1:51 PM <abc<http://localhost>> wrote:
...
'''
match = re.search('(Mon|Tue|Wed|Thu|Fri|Sat|Sun).*?(AM|PM)', text)
match_date_and_time = match.group() # Tue, Dec 21, 2021 at 1:51 PM
Use datetime.strptime to parse the date and time.
import datetime
datetime.strptime(match_date_and_time, '%a, %b %d, %Y at %I:%M %p')
The way to go here is to use regular expressions but for simplicity and if the format of the text is always the same, you can get the date string by looking for the line the looks like this On SOME DATE <Someone<someone's email address>> wrote:. Here is an example implementation:
email = """how are you
On Tue, Dec 21, 2021 at 1:51 PM <abc<http://localhost>> wrote:
-------------------------------------------------------------
NOTE: Please do not remove email address from the \"To\" line of this email when replying. This address is used to capture the email and report it. Please do not remove or change the subject line of this email. The subject line of this email contains information to refer this correspondence back to the originating discrepancy."""
for line in email.splitlines():
if line.startswith("On ") and line.endswith(" wrote:"):
date_string = line[3 : line.index(" <")]
print(f"Found the date: {date_string!r}")
break
else:
print("Could not find the date.")
Very dirty:
string = """how are you \r\n\r\nOn Tue, Dec 21, 2021 at 1:51 PM
<abchttp://localhost> wrote:\r\n\r\n\r\n---------------------------------
----------------------------\r\nNOTE: Please do not remove email address
from the"To" line of this email when replying.This address is used to
capture the email and report it.Please do not remove or change the
subject line of this email.The subject line of this email contains
information to refer this correspondence back to the originating
discrepancy.\r\n"""
string = string.split("\r\n\r\n")
date = ' '.join(string[1].split(' ')[:8])
print(date)
I'm trying to parse a date string using the following code:
from dateutil.parser import parse
datestring = 'Thu Jul 25 15:13:16 GMT+06:00 2019'
d = parse(datestring)
print (d)
The parsed date is:
datetime.datetime(2019, 7, 25, 15, 13, 16, tzinfo=tzoffset(None, -21600))
As you can see, instead of adding 6 hours to GMT, it actually subtracted 6 hours.
What's wrong I'm doing here? Any help on how can I parse datestring in this format?
There's a comment in the source: https://github.com/dateutil/dateutil/blob/cbcc0871792e7eed4a42cc62630a08ec7a78be30/dateutil/parser/_parser.py#L803.
# Check for something like GMT+3, or BRST+3. Notice
# that it doesn't mean "I am 3 hours after GMT", but
# "my time +3 is GMT". If found, we reverse the
# logic so that timezone parsing code will get it
# right.
Important parts
Notice that it doesn't mean "I am 3 hours after GMT", but "my time +3 is GMT"
If found, we reverse the logic so that timezone parsing code will get it right
Last sentence in that comment (and 2nd bullet point above) explains why 6 hours are subtracted. Hence, Thu Jul 25 15:13:16 GMT+06:00 2019 means Thu Jul 25 09:13:16 2019 GMT.
Take a look at http://www.timebie.com/tz/timediff.php?q1=Universal%20Time&q2=GMT%20+6%20Time for more context.
dateutil.parse converts every time into GMT. The input is being read as 15:13:16 in GMT+06:00 time. Naturally, it becomes 15:13:16-06:00 in GMT.
I have this data extracted from Email body
Data=("""-------- Forwarded Message --------
Subject: Sample Report
Date: Thu, 6 Apr 2017 16:39:19 +0000
From: test1#abc.com
To: test2#xyz.com""")
I want to extract this particular date and month , and copy it in the variables
Need output as
Date = 6
Month = "Apr"
Can anyone please help with this using regular expressions?
You can use this regex with multiline mode m:
^Date:[^,]+,\ (\d+) (\w+)
This will capture the date and the month in groups 1 and 2 respectively, so the match can easily be unpacked into two variables like so:
date, month = re.search("^Date:[^,]+,\ (\d+) (\w+)", Data, re.MULTILINE).groups()
date = int(date)
print(date, month)
# output: 6 Apr
Adding to the solution of #Rakesh,
import re
from datetime import datetime
data1 = re.sub(' ', '', data)
res = re.search(r'Date(.*)$', data1, re.MULTILINE).group()
res2 = datetime.strptime(res, 'Date:%a,%d%b%Y%X%z')
print(res2.day, res2.month)
You can use regex to extract the date
Ex:
import re
from dateutil.parser import parse
s = """-------- Forwarded Message --------
Subject: Sample Report
Date: Thu, 6 Apr 2017 16:39:19 +0000
From: test1#abc.com
To: test2#xyz.com"""
date = re.search("Date(.*)$", s, re.MULTILINE)
if date:
date = date.group().replace("Date:", "").strip()
d = parse(date)
Date = d.day
Month = d.strftime("%b")
print(Date, Month)
Output:
6 Apr
I have a large txt file (log file), where each entry starts with timestamp such as Sun, 17 Mar 2013 18:58:06
I want to split the file into multiple txt by mm/yy and and sorted
The general code I planned is below, but I do not know how to implement such. I know how to split a file by number of lines etc, but not by specified timestamp
import re
f = open("log.txt", "r")
my_regex = re.compile('regex goes here')
body = []
for line in f:
if my_regex.match(line):
if body:
write_one(body)
body = []
body.append(line)
f.close()
example of lines from txt
2Sun, 17 Mar 2013 18:58:06 Pro IDS2.0 10E22E37-B2A1-4D55-BE20-84661D420196 nCWgKUtjalmYx053ykGeobwgWW V3
3Sun, 17 Mar 2013 19:17:33 <AwaitingDHKey c i FPdk 1:0 pt 0 Mrse> 0000000000000000000000000000000000000000 wo>
HomeKit keychain state:HomeKit: mdat=2017-01-01 01:41:47 +0000,cdat=2017-01-01 01:41:47 +0000,acct=HEDF3,class=genp,svce=AirPort,labl=HEDF3
4Sun, 13 Apr 2014 19:10:26 values in decoded form...
oak: <C: gen:'[ 21:10 5]' ak>
<PI#0x7fc01dc05d90: [name: Bourbon] [--SrbK-] [spid: zP8H/Rpy] [os: 15G31] [devid: 49645DA6] [serial: C17J9LGKDTY3] -
5Sun, 16 Feb 2014 18:59:41 tLastKVSKeyCleanup:
ak|nCWgKUtjalmYx053ykGeobwgWW:sk1Kv+37Clci7VwR2IGa+DNVEA: DHMessage (0x02): 112
You could use regex (such as [0-9]{4} ([01]\d|2[0123]):([012345]\d):([012345]\d) ) but in the example posted the date is always in the beginning of the string. If that is the case, you could just use the position of the string to parse the date.
import datetime
lines =[]
lines.append("2Sun, 17 Mar 2013 18:58:06 Pro IDS2.0 10E22E37-B2A1-4D55-BE20-84661D420196 nCWgKUtjalmYx053ykGeobwgWW V3")
lines.append("3Sun, 17 Mar 2013 19:17:33 <AwaitingDHKey c i FPdk 1:0 pt 0 Mrse> 0000000000000000000000000000000000000000 wo> HomeKit keychain state:HomeKit: mdat=2017-01-01 01:41:47 +0000,cdat=2017-01-01 01:41:47 +0000,acct=HEDF3,class=genp,svce=AirPort,labl=HEDF3")
lines.append("4Sun, 13 Apr 2014 19:10:26 values in decoded form... oak: <C: gen:'[ 21:10 5]' ak> <PI#0x7fc01dc05d90: [name: Bourbon] [--SrbK-] [spid: zP8H/Rpy] [os: 15G31] [devid: 49645DA6] [serial: C17J9LGKDTY3] -")
for l in lines:
datetime_object = datetime.datetime.strptime(l[6:26], '%d %b %Y %H:%M:%S')
print(datetime_object)
Which gives the correct output for the three examples you provided
2013-03-17 18:58:06
2013-03-17 19:17:33
2014-04-13 19:10:26
The datetime object has attributed such as month() and year() so you can use a simple equality to check whether two dates are in the same month and/or year.
I have gone through multiple links before posting this question so please read through and below are the two answers which have solved 90% of my problem:
parse multiple dates using dateutil
How to parse multiple dates from a block of text in Python (or another language)
Problem: I need to parse multiple dates in multiple formats in Python
Solution by Above Links: I am able to do so but there are still certain formats which I am not able to do so.
Formats which still can't be parsed are:
text ='I want to visit from May 16-May 18'
text ='I want to visit from May 16-18'
text ='I want to visit from May 6 May 18'
I have tried regex also but since dates can come in any format,so ruled out that option because the code was getting very complex. Hence, Please suggest me modifications on the code presented on the link, so that above 3 formats can also be handled on the same.
This kind of problem is always going to need tweeking with new edge cases, but the following approach is fairly robust:
from itertools import groupby, izip_longest
from datetime import datetime, timedelta
import calendar
import string
import re
def get_date_part(x):
if x.lower() in month_list:
return x
day = re.match(r'(\d+)(\b|st|nd|rd|th)', x, re.I)
if day:
return day.group(1)
return False
def month_full(month):
try:
return datetime.strptime(month, '%B').strftime('%b')
except:
return datetime.strptime(month, '%b').strftime('%b')
tests = [
'I want to visit from May 16-May 18',
'I want to visit from May 16-18',
'I want to visit from May 6 May 18',
'May 6,7,8,9,10',
'8 May to 10 June',
'July 10/20/30',
'from June 1, july 5 to aug 5 please',
'2nd March to the 3rd January',
'15 march, 10 feb, 5 jan',
'1 nov 2017',
'27th Oct 2010 until 1st jan',
'27th Oct 2010 until 1st jan 2012'
]
cur_year = 2017
month_list = [m.lower() for m in list(calendar.month_name) + list(calendar.month_abbr) if len(m)]
remove_punc = string.maketrans(string.punctuation, ' ' * len(string.punctuation))
for date in tests:
date_parts = [get_date_part(part) for part in date.translate(remove_punc).split() if get_date_part(part)]
days = []
months = []
years = []
for k, g in groupby(sorted(date_parts, key=lambda x: x.isdigit()), lambda y: not y.isdigit()):
values = list(g)
if k:
months = map(month_full, values)
else:
for v in values:
if 1900 <= int(v) <= 2100:
years.append(int(v))
else:
days.append(v)
if days and months:
if years:
dates_raw = [datetime.strptime('{} {} {}'.format(m, d, y), '%b %d %Y') for m, d, y in izip_longest(months, days, years, fillvalue=years[0])]
else:
dates_raw = [datetime.strptime('{} {}'.format(m, d), '%b %d').replace(year=cur_year) for m, d in izip_longest(months, days, fillvalue=months[0])]
years = [cur_year]
# Fix for jumps in year
dates = []
start_date = datetime(years[0], 1, 1)
next_year = years[0] + 1
for d in dates_raw:
if d < start_date:
d = d.replace(year=next_year)
next_year += 1
start_date = d
dates.append(d)
print "{} -> {}".format(date, ', '.join(d.strftime("%d/%m/%Y") for d in dates))
This converts the test strings as follows:
I want to visit from May 16-May 18 -> 16/05/2017, 18/05/2017
I want to visit from May 16-18 -> 16/05/2017, 18/05/2017
I want to visit from May 6 May 18 -> 06/05/2017, 18/05/2017
May 6,7,8,9,10 -> 06/05/2017, 07/05/2017, 08/05/2017, 09/05/2017, 10/05/2017
8 May to 10 June -> 08/05/2017, 10/06/2017
July 10/20/30 -> 10/07/2017, 20/07/2017, 30/07/2017
from June 1, july 5 to aug 5 please -> 01/06/2017, 05/07/2017, 05/08/2017
2nd March to the 3rd January -> 02/03/2017, 03/01/2018
15 march, 10 feb, 5 jan -> 15/03/2017, 10/02/2018, 05/01/2019
1 nov 2017 -> 01/11/2017
27th Oct 2010 until 1st jan -> 27/10/2010, 01/01/2011
27th Oct 2010 until 1st jan 2012 -> 27/10/2010, 01/01/2012
This works as follows:
First create a list of valid months names, i.e. both full and abbreviated.
Make a translation table to make it easy to quickly remove any punctuation from the text.
Split the text, and extract only the date parts by using a function with a regular expression to spot days or months.
Sort the list based on whether or not the part is a digit, this will group months to the front and digits to the end.
Take the first and last part of each list. Convert months into full form e.g. Aug to August and convert each into datetime objects.
If a date appears to be before the previous one, add a whole year.