I have a large txt file (log file), where each entry starts with timestamp such as Sun, 17 Mar 2013 18:58:06
I want to split the file into multiple txt by mm/yy and and sorted
The general code I planned is below, but I do not know how to implement such. I know how to split a file by number of lines etc, but not by specified timestamp
import re
f = open("log.txt", "r")
my_regex = re.compile('regex goes here')
body = []
for line in f:
if my_regex.match(line):
if body:
write_one(body)
body = []
body.append(line)
f.close()
example of lines from txt
2Sun, 17 Mar 2013 18:58:06 Pro IDS2.0 10E22E37-B2A1-4D55-BE20-84661D420196 nCWgKUtjalmYx053ykGeobwgWW V3
3Sun, 17 Mar 2013 19:17:33 <AwaitingDHKey c i FPdk 1:0 pt 0 Mrse> 0000000000000000000000000000000000000000 wo>
HomeKit keychain state:HomeKit: mdat=2017-01-01 01:41:47 +0000,cdat=2017-01-01 01:41:47 +0000,acct=HEDF3,class=genp,svce=AirPort,labl=HEDF3
4Sun, 13 Apr 2014 19:10:26 values in decoded form...
oak: <C: gen:'[ 21:10 5]' ak>
<PI#0x7fc01dc05d90: [name: Bourbon] [--SrbK-] [spid: zP8H/Rpy] [os: 15G31] [devid: 49645DA6] [serial: C17J9LGKDTY3] -
5Sun, 16 Feb 2014 18:59:41 tLastKVSKeyCleanup:
ak|nCWgKUtjalmYx053ykGeobwgWW:sk1Kv+37Clci7VwR2IGa+DNVEA: DHMessage (0x02): 112
You could use regex (such as [0-9]{4} ([01]\d|2[0123]):([012345]\d):([012345]\d) ) but in the example posted the date is always in the beginning of the string. If that is the case, you could just use the position of the string to parse the date.
import datetime
lines =[]
lines.append("2Sun, 17 Mar 2013 18:58:06 Pro IDS2.0 10E22E37-B2A1-4D55-BE20-84661D420196 nCWgKUtjalmYx053ykGeobwgWW V3")
lines.append("3Sun, 17 Mar 2013 19:17:33 <AwaitingDHKey c i FPdk 1:0 pt 0 Mrse> 0000000000000000000000000000000000000000 wo> HomeKit keychain state:HomeKit: mdat=2017-01-01 01:41:47 +0000,cdat=2017-01-01 01:41:47 +0000,acct=HEDF3,class=genp,svce=AirPort,labl=HEDF3")
lines.append("4Sun, 13 Apr 2014 19:10:26 values in decoded form... oak: <C: gen:'[ 21:10 5]' ak> <PI#0x7fc01dc05d90: [name: Bourbon] [--SrbK-] [spid: zP8H/Rpy] [os: 15G31] [devid: 49645DA6] [serial: C17J9LGKDTY3] -")
for l in lines:
datetime_object = datetime.datetime.strptime(l[6:26], '%d %b %Y %H:%M:%S')
print(datetime_object)
Which gives the correct output for the three examples you provided
2013-03-17 18:58:06
2013-03-17 19:17:33
2014-04-13 19:10:26
The datetime object has attributed such as month() and year() so you can use a simple equality to check whether two dates are in the same month and/or year.
Related
Trying to make an RSS feed reader using django, feedparser and dateutil
Getting this error: can't compare offset-naive and offset-aware datetimes
I just have five feeds right now. These are the datetimes from the feeds..
Sat, 10 Sep 2022 23:08:59 -0400
Sun, 11 Sep 2022 04:08:30 +0000
Sun, 11 Sep 2022 13:12:18 +0000
2022-09-10T01:01:16+00:00
Sat, 17 Sep 2022 11:27:15 EDT
I was able to order the first four feeds and then I got the error when I added the last one.
## create a list of lists - each inner list holds entries from a feed
parsed_feeds = [feedparser.parse(url)['entries'] for url in feed_urls]
## put all entries in one list
parsed_feeds2 = [item for feed in parsed_feeds for item in feed]
## sort entries by date
parsed_feeds2.sort(key=lambda x: dateutil.parser.parse(x['published']), reverse=True)
How can I make all the datetimes from the feeds the same so they can be ordered?
I am trying to parse the response of an API. Takes a batch of phone numbers, and returns information on their status i.e. active or not.
This is what the response looks like:
# API call
s.get('https://api/data/stuff')
# response
',MSISDN,Status,Error Code,Error Text,Original Network,Current Network,Current Country,Roaming
Country,Type,Date Checked\n447541255456,447541255456,Undelivered,27,Absent Subscriber,O2
(UK),,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000 (UTC)\n447856999555,447856999555,Undelivered,1,Dead,O2
(UK),,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000
(UTC)\n447854111222,447854111222,Undelivered,1,Dead,Orange,,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000
(UTC)\n'
I can see that MSISDN,Status,Error Code,Error Text,Original Network,Current Network,Current Country,Roaming
Country,Type,Date Checked are headers, and the rest are the rows.
But I can't get this into a structure I can read easily, such as a dataframe.
There were some suggested answers while typing this question, which use import io and pd.read_table etc. but I couldn't get any of them to work.
I guess I could save it as a txt file then read it back in as a comma separated csv. But is there a native pandas or other easier way to do this?
Here's the response string pasted directly into stack overflow with no tidying:
',MSISDN,Status,Error Code,Error Text,Original Network,Current Network,Current Country,Roaming Country,Type,Date Checked\n447541255456,447541255456,Undelivered,27,Absent Subscriber,O2 (UK),,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000 (UTC)\n447856999555,447856999555,Undelivered,1,Dead,O2 (UK),,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000 (UTC)\n447854111222,447854111222,Undelivered,1,Dead,Orange,,,,Mobile,Wed Oct 9 2019 12:26:51 GMT+0000 (UTC)\n'
I believe you need:
from io import StringIO
df = pd.read_csv(StringIO(s.get('https://api/data/stuff')))
Or try:
df = pd.read_csv('https://api/data/stuff')
I have gone through multiple links before posting this question so please read through and below are the two answers which have solved 90% of my problem:
parse multiple dates using dateutil
How to parse multiple dates from a block of text in Python (or another language)
Problem: I need to parse multiple dates in multiple formats in Python
Solution by Above Links: I am able to do so but there are still certain formats which I am not able to do so.
Formats which still can't be parsed are:
text ='I want to visit from May 16-May 18'
text ='I want to visit from May 16-18'
text ='I want to visit from May 6 May 18'
I have tried regex also but since dates can come in any format,so ruled out that option because the code was getting very complex. Hence, Please suggest me modifications on the code presented on the link, so that above 3 formats can also be handled on the same.
This kind of problem is always going to need tweeking with new edge cases, but the following approach is fairly robust:
from itertools import groupby, izip_longest
from datetime import datetime, timedelta
import calendar
import string
import re
def get_date_part(x):
if x.lower() in month_list:
return x
day = re.match(r'(\d+)(\b|st|nd|rd|th)', x, re.I)
if day:
return day.group(1)
return False
def month_full(month):
try:
return datetime.strptime(month, '%B').strftime('%b')
except:
return datetime.strptime(month, '%b').strftime('%b')
tests = [
'I want to visit from May 16-May 18',
'I want to visit from May 16-18',
'I want to visit from May 6 May 18',
'May 6,7,8,9,10',
'8 May to 10 June',
'July 10/20/30',
'from June 1, july 5 to aug 5 please',
'2nd March to the 3rd January',
'15 march, 10 feb, 5 jan',
'1 nov 2017',
'27th Oct 2010 until 1st jan',
'27th Oct 2010 until 1st jan 2012'
]
cur_year = 2017
month_list = [m.lower() for m in list(calendar.month_name) + list(calendar.month_abbr) if len(m)]
remove_punc = string.maketrans(string.punctuation, ' ' * len(string.punctuation))
for date in tests:
date_parts = [get_date_part(part) for part in date.translate(remove_punc).split() if get_date_part(part)]
days = []
months = []
years = []
for k, g in groupby(sorted(date_parts, key=lambda x: x.isdigit()), lambda y: not y.isdigit()):
values = list(g)
if k:
months = map(month_full, values)
else:
for v in values:
if 1900 <= int(v) <= 2100:
years.append(int(v))
else:
days.append(v)
if days and months:
if years:
dates_raw = [datetime.strptime('{} {} {}'.format(m, d, y), '%b %d %Y') for m, d, y in izip_longest(months, days, years, fillvalue=years[0])]
else:
dates_raw = [datetime.strptime('{} {}'.format(m, d), '%b %d').replace(year=cur_year) for m, d in izip_longest(months, days, fillvalue=months[0])]
years = [cur_year]
# Fix for jumps in year
dates = []
start_date = datetime(years[0], 1, 1)
next_year = years[0] + 1
for d in dates_raw:
if d < start_date:
d = d.replace(year=next_year)
next_year += 1
start_date = d
dates.append(d)
print "{} -> {}".format(date, ', '.join(d.strftime("%d/%m/%Y") for d in dates))
This converts the test strings as follows:
I want to visit from May 16-May 18 -> 16/05/2017, 18/05/2017
I want to visit from May 16-18 -> 16/05/2017, 18/05/2017
I want to visit from May 6 May 18 -> 06/05/2017, 18/05/2017
May 6,7,8,9,10 -> 06/05/2017, 07/05/2017, 08/05/2017, 09/05/2017, 10/05/2017
8 May to 10 June -> 08/05/2017, 10/06/2017
July 10/20/30 -> 10/07/2017, 20/07/2017, 30/07/2017
from June 1, july 5 to aug 5 please -> 01/06/2017, 05/07/2017, 05/08/2017
2nd March to the 3rd January -> 02/03/2017, 03/01/2018
15 march, 10 feb, 5 jan -> 15/03/2017, 10/02/2018, 05/01/2019
1 nov 2017 -> 01/11/2017
27th Oct 2010 until 1st jan -> 27/10/2010, 01/01/2011
27th Oct 2010 until 1st jan 2012 -> 27/10/2010, 01/01/2012
This works as follows:
First create a list of valid months names, i.e. both full and abbreviated.
Make a translation table to make it easy to quickly remove any punctuation from the text.
Split the text, and extract only the date parts by using a function with a regular expression to spot days or months.
Sort the list based on whether or not the part is a digit, this will group months to the front and digits to the end.
Take the first and last part of each list. Convert months into full form e.g. Aug to August and convert each into datetime objects.
If a date appears to be before the previous one, add a whole year.
I have a .txt file called dates.txt. It has code like this in it like this
Fri Jan 31 05:51:59 +0000 2014
Fri Jan 31 05:01:39 +0000 2014
Thu Jan 30 14:31:21 +0000 2014
Sat Feb 01 06:53:10 +0000 2014
How do i sort these by dates by oldest to newest? I'm pretty sure you have to use datetime and strptime functions.
from datetime import datetime as dt
def sortFile(infilepath, outfilepath, fmt):
lines = []
with open(infilepath) as infile:
for line in infile:
lines.append(dt.strptime(line, fmt)) # parse the time, and read in the file
lines.sort() # sort the datetime objects
with open(outfilepath, 'w') as outfile:
for line in lines:
outfile.write(line.stftime(fmt)) # write out the datetime objects with the parsing format
Now, you can call it like this:
sortFile('path/to/input', /path/to/output', "%a %b %d %H:%M:%S %z %Y\n")
I'd like to use Python to analyse /var/log/monthly.out on OS X to export user accounting totals. The log file looks like this:
Mon Feb 1 09:12:41 GMT 2016
Rotating fax log files:
Doing login accounting:
total 688.31
example 401.12
_mbsetupuser 287.10
root 0.05
admin 0.04
-- End of monthly output --
Tue Feb 16 14:27:21 GMT 2016
Rotating fax log files:
Doing login accounting:
total 0.00
-- End of monthly output --
Thu Mar 3 09:37:31 GMT 2016
Rotating fax log files:
Doing login accounting:
total 377.92
example 377.92
-- End of monthly output --
I was able to extract the username / totals pairs with this regex:
\t(\w*)\W*(\d*\.\d{2})
In Python:
>>> import re
>>> re.findall(r'\t(\w*)\W*(\d*\.\d{2})', open('/var/log/monthly.out', 'r').read())
[('total', '688.31'), ('example', '401.12'), ('_mbsetupuser', '287.10'), ('root', '0.05'), ('admin', '0.04'), ('total', '0.00'), ('total', '377.92'), ('example', '377.92')]
But I can't figure out how to extract the date line in such a way where it's attached to the username / totals pairs for that month.
Use str.split().
import re
re_user_amount = r'\s+(\w+)\s+(\d*\.\d{2})'
re_date = r'\w{3}\s+\w{3}\s+\d+\s+\d\d:\d\d:\d\d \w+ \d{4}'
with open('/var/log/monthly.out', 'r') as f:
content = f.read()
sections = content.split('-- End of monthly output --')
for section in sections:
date = re.findall(re_date, section)
matches = re.findall(re_user_amount, section)
print(date, matches)
If you want to turn the date string into an actual datetime, check out Converting string into datetime.
Well, there's rarely a magical cure for everything based on regex. The regex are a great tool
for simple string parsing, but it shall not replace good old programming!
So if you look at your data, you'll notice that it always start with a date, and ends with the
-- End of monthly output -- line. So a nice way to handle that would be to split your data
by each monthly output.
Let's start with your data:
>>> s = """\
... Mon Feb 1 09:12:41 GMT 2016
...
... Rotating fax log files:
...
... Doing login accounting:
... total 688.31
... example 401.12
... _mbsetupuser 287.10
... root 0.05
... admin 0.04
...
... -- End of monthly output --
...
... Tue Feb 16 14:27:21 GMT 2016
...
... Rotating fax log files:
...
... Doing login accounting:
... total 0.00
...
... -- End of monthly output --
...
... Thu Mar 3 09:37:31 GMT 2016
...
... Rotating fax log files:
...
... Doing login accounting:
... total 377.92
... example 377.92
...
... -- End of monthly output --"""
And let's split it based ont that end of month line:
>>> reports = s.split('-- End of monthly output --')
>>> reports
['Mon Feb 1 09:12:41 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n total 688.31\n example 401.12\n _mbsetupuser 287.10\n root 0.05\n admin 0.04\n\n', '\n\nTue Feb 16 14:27:21 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n total 0.00\n\n', '\n\nThu Mar 3 09:37:31 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n total 377.92\n example 377.92\n\n', '']
Then you can separate the accounting data from the rest of the log:
>>> report = reports[0]
>>> head, tail = report.split('Doing login accounting:')
Now let's extract the date line:
>>> date_line = head.strip().split('\n')[0]
And fill up a dict with those username/totals pairs:
>>> accounting = dict(zip(tail.split()[::2], tail.split()[1::2]))
the trick here is to use zip() to create pairs out of iterators on tail. The "left"
side of the pair being an iterator starting at index 0, iterating every 2 items, the ~right~
side of the pair being an iterator starting at index 1, iterating every 2 items. Which makes:
{'admin': '0.04', 'root': '0.05', 'total': '688.31', '_mbsetupuser': '287.10', 'example': '401.12'}
So now that's done, you can do that in a for loop:
import datetime
def parse_monthly_log(log_path='/var/log/monthly.out'):
with open(log_path, 'r') as log:
reports = log.read().strip('\n ').split('-- End of monthly output --')
for report in filter(lambda it: it, reports):
head, tail = report.split('Doing login accounting:')
date_line = head.strip().split('\n')[0]
accounting = dict(zip(tail.split()[::2], tail.split()[1::2]))
yield {
'date': datetime.datetime.strptime(date_line.replace(' ', ' 0'), '%a %b %d %H:%M:%S %Z %Y'),
'accounting': accounting
}
>>> import pprint
>>> pprint.pprint(list(parse_monthly_log()), indent=2)
[ { 'accounting': { '_mbsetupuser': '287.10',
'admin': '0.04',
'example': '401.12',
'root': '0.05',
'total': '688.31'},
'date': datetime.datetime(2016, 2, 1, 9, 12, 41)},
{ 'accounting': { 'total': '0.00'},
'date': datetime.datetime(2016, 2, 16, 14, 27, 21)},
{ 'accounting': { 'example': '377.92', 'total': '377.92'},
'date': datetime.datetime(2016, 3, 3, 9, 37, 31)}]
And there you go with a pythonic solution without a single regex.
N.B.: I had to do a little trick with the datetime, because the log contains a day number filled with space and not zero (as expects strptime), I used string .replace() to change a double space into a 0 within the date string
N.B.: the filter() and the split() used in the for report… loop is used to remove leading and trailing empty reports, depending on how the log file starts or ends.
Here's something shorter:
with open("/var/log/monthly.out") as f:
months = map(str.strip, f.read().split("-- End of monthly output --"))
for sec in filter(None, y):
date = sec.splitlines()[0]
accs = re.findall("\n\s+(\w+)\s+([\d\.]+)", sec)
print(date, accs)
This divides the file content into months, extracts the date of each month and searches for all accounts in each month.
You may want to try the following regex, which is not so elegant though:
import re
string = """
Mon Feb 1 09:12:41 GMT 2016
Rotating fax log files:
Doing login accounting:
total 688.31
example 401.12
_mbsetupuser 287.10
root 0.05
admin 0.04
-- End of monthly output --
Tue Feb 16 14:27:21 GMT 2016
Rotating fax log files:
Doing login accounting:
total 0.00
-- End of monthly output --
Thu Mar 3 09:37:31 GMT 2016
Rotating fax log files:
Doing login accounting:
total 377.92
example 377.92
-- End of monthly output --
"""
pattern = '(\w+\s+\w+\s+[\d:\s]+[A-Z]{3}\s+\d{4})[\s\S]+?((?:\w+)\s+(?:[0-9.]+))\s+(?:((?:\w+)\s*(?:[0-9.]+)))?\s+(?:((?:\w+)\s*(?:[0-9.]+)))?\s*(?:((?:\w+)\s+(?:[0-9.]+)))?\s*(?:((?:\w+)\s*(?:[0-9.]+)))?'
print re.findall(pattern, string)
Output:
[('Mon Feb 1 09:12:41 GMT 2016', 'total 688.31', 'example 401.12', '_mbsetupuser 287.10', 'root 0.05', 'admin 0.04'),
('Tue Feb 16 14:27:21 GMT 2016', 'total 0.00', '', '', '', ''),
('Thu Mar 3 09:37:31 GMT 2016', 'total 377.92', 'example 377.92', '', '', '')]
REGEX DEMO.