Trying to parse based on the grouping, below is the input file to parse.
Cannot able to aggregate multiple groups from my regex which produces expected output. Need some recommendations to print data in expected output. (Note Group2 can have different other (strings) in the actual log-file)
#Parse out the timedate stamp Jan 20 03:25:08 to capture two groups
Example groups
1.) Jan 20 03:25 2.) logrotate
1.) Jan 20 05:03 2.) ntpd
logfile= """Jan 20 03:25:08 fakehost logrotate: ALERT exited abnormally with [1]
Jan 20 03:25:08 fakehost run-parts(/etc/cron.daily)[20447]: finished logrotate
Jan 20 03:26:21 fakehost anacron[28969]: Job 'cron.daily' terminated
Jan 20 03:26:21 fakehost anacron[28969]: Normal exit (1 job run)
Jan 20 03:30:01 fakehost CROND[31462]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Jan 20 03:30:01 fakehost CROND[31461]: (root) CMD (/var/system/bin/sys-cmd -F
Jan 20 05:03:03 fakehost ntpd[3705]: synchronized to time.faux.biz, stratum 2
"""
Expected output:
minute,total_count,logrotate,CROND,ntpd,anacron,run-parts
Jan 20 03:25,2,1,0,0,0,1
Jan 20 03:26,2,0,2,0,1,1
Jan 20 03:30,2,0,2,0,0,0
Jan 20 05:03,1,0,0,1,0,0
This is my code:
import re
output = {}
regex = re.compile(r'^(\w+ \d+ \d+:\d+):\d+ \w+ (\w+).*$')
with open("logfile", "r+") as myfile:
for log_line in myfile:
match = regex.match(log_line)
if match:
if match.group(1) and match.group(2):
print(match.groups())
# Struck here to arrange the data
output[match.group(1)]['total_count'] += 1
output[match.group(1)][match.group(2)] += 1
for k, v in output.items():
print('{0} {1}'.format(k, v))
import re
output = []
regex = re.compile(r'^(\w+ \d+ \d+:\d+):\d+ \w+ (\w+).*$')
with open("logfile.txt", "r+") as myfile:
for log_line in myfile:
match = regex.match(log_line)
if match:
if match.group(1) and match.group(2):
dataDict = {'minute': match.group(1), 'total_count': 1}
dataDict[match.group(2)] = 1
lastInsertedIndex = len(output) - 1
if (len(output) > 0): # data exist, check if same minute data exist or not
# same minute, update existing data
if (output[lastInsertedIndex]['minute'] == match.group(1)):
lastInsertedIndexDict = output[lastInsertedIndex]
if (match.group(2) in lastInsertedIndexDict):
lastInsertedIndexDict[match.group(2)] = lastInsertedIndexDict[match.group(2)] + 1 # updating group(2)
else:
lastInsertedIndexDict[match.group(2)] = 1
# updating total count
lastInsertedIndexDict['total_count'] = lastInsertedIndexDict['total_count'] + 1
output[lastInsertedIndex] = lastInsertedIndexDict
else: # new minute, simply append
output.append(dataDict)
else: # output list is empty
output.append(dataDict)
for data in output:
print(data)
Here the idea is after we have match.groups(), create a dictionary with minute as a key & for total_count value as 1. Then for match.group(1) set value as 1 with the new found key.
As the data would be in increasing order of time, so check if previously inserted data is of same minute or different minute.
If same minute then increase the dictionary's total_count & match.group(2) values by 1
If different minute then simply append the dictionary to output list
Currently output list would print keys & values. Incase if you want to print only values then instead of print(data) in the last line, you can change to print(data.values())
Just to mention, I have assumed that you are not facing any issue in regex & that whatever regex you have provided is fulfilling your requirement.
In case you face any issue in regex or need help in that do let me know in comment.
Related
this is my code:
tasks_file = open("tasks.txt", "r+")
for line in tasks_file:
info = "Assigned to | Task | Task description | Date assigned | Due date | Task complete"
string1 = info.split("|")
for i in string1:
print(i)
string2 = line.split(",")
for x in string2:
print(x)
output = i + ":" + x
print(output)
So basically what I'm trying to do is combine each separate item in string1 with each separate item in string2. But the code only combines the last separate items and not the whole thing. Please assist in any way that you can. Thanks
task file contents:
admin, Register Users with taskManager.py, Use taskManager.py to add the usernames and passwords for all team members that will be using this program., 10 Oct 2019, 20 Oct 2019, No
admin, Assign initial tasks, Use taskManager.py to assign each team member with appropriate tasks, 10 Oct 2019, 25 Oct 2019, No
desired output:(an example of how the output should be like)
Task: Assign initial task
Assigned to: Admin
Date assigned: 10 oct 2019
Due date: 25 oct 2019
Task complete: No
outputs = []
for i, x in zip(string1, string2):
outputs.append(i + ":" + x)
print(outputs)
outputs will be a list of the concatenated strings. You can combine the individual outputs together if you like with the str.join() method:
",".join(outputs)
will combine all the outputs together, with commas in between.
If you want to be fancy, you can do it all on nearly one line:
info = "Assigned to | Task | Task description | Date assigned | Due date | Task complete".split('|')
with open("tasks.txt") as fp:
for line in fp:
print(",".join(i + ":" + x for i, x in zip(info, line.split(','))))
But only if that doesn't confuse you or anyone else who reads your code.
I have converted a PDF bank statement to a txt file. Here is a snippet of the .txt file:
15 Apr 20DDOPEN 100.00DDBENNON WATER SRVCS29.00DDBG BUSINESS106.00BPC BOB PETROL MINISTRY78.03BPC BARBARA STREAMING DATA30.50CRPAYPAL Z4J22FR450.00CRPAYNAL AAWDL4Z4J22222KHMG30.0019,028.4917 Apr 20CRCASH IN AT HSBC BANK
What is the easiest way of re-writing the text file in python to create a new line at certain points. i.e. after a number ‘xx.xx’ there in a new date such as ‘xx APR’
For example the text to become:
15 Apr 20DDOPEN 100.00
BENNON WATER SRVCS29.00
DDBG BUSINESS106.00...(etc)
I am just trying to make a PDF more readable and useful when working amongst my other files.
If you know of another PDF to txt python converter which works better, I would also be interested.
Thanks for your help
First step would be getting the text file into Python
with open(“file.txt”) as file:
data = file.read()
This next part, initially, I thought you wouldn't be able to do, but in your example, each part contains a number XX.XX The important thing to notice here is that there is a '.' in each number.
Using Python's string find command, you can iteratively look for that '.' and add a newline character two characters later. You can change my indices below to remove the DD as well if you want.
index = 0
while(index != -1):
index = data.find('.', index)
if index != -1:
data = data[:index+3] + '\n' + data[index+3:]
Then you need to write the new data back to the file.
file = open('ValidEmails.txt','w')
file.write(data)
For the given input the following should work:
import re
counter = 0
l = "15 Apr 20DDOPEN 100.00DDBENNON WATER SRVCS29.00DDBG BUSINESS106.00BPC BOB PETROL MINISTRY78.03BPC BARBARA STREAMING DATA30.50CRPAYPAL Z4J22FR450.00CRPAYNAL AAWDL4Z4J22222KHMG30.0019,028.4917 Apr 20CRCASH IN AT HSBC BANK"
nums = re.finditer("[\d]+[\.][\d]+", l)
for elem in nums:
idx = elem.span()[1] + counter
l = l[:idx] + '\n' + l[idx:]
counter += 1
print(l)
The output is:
15 Apr 20DDOPEN 100.00
DDBENNON WATER SRVCS29.00
DDBG BUSINESS106.00
BPC BOB PETROL MINISTRY78.03
BPC BARBARA STREAMING DATA30.50
CRPAYPAL Z4J22FR450.00
CRPAYNAL AAWDL4Z4J22222KHMG30.0019
,028.4917
Apr 20CRCASH IN AT HSBC BANK
Then you should easily able to write line by line to a file.
Say for example, I have the following strings and an input 4.0, which represents seconds:
John Time Made 11:05:20 in 2010
5.001 Kelly #1
6.005 Josh #8
And would like the following result:
John Time Made 11:05:24 in 2010 #Input 4.0 is added to the seconds of 11:05:20
1.001 Kelly #1 #4.0 is subtracted from the first number 5.001 = 1.001
2.005 Josh #8 #4.0 is subtracted from the first number 5.001 = 2.005
How can I recognize the hours:minutes:seconds in the first line, and #.### in the rest to add/subtract the input number?
Thank you in advance and will accept/upvote answer
This solution should work if your complete data has the same format as this particular sample you provided. You should have the data in the input.txt file.
val_to_add = 4
with open('input.txt') as fin:
# processing first line
first_line = fin.readline().strip()
splitted = first_line.split(' ')
# get hour, minute, second corresponding to time (11:05:20)
time_values = splitted[3].split(':')
# seconds is the last element
seconds = int(time_values[-1])
# add the value
new_seconds = seconds + val_to_add
# doing simple math to avoid having values >= 60 for minute and second
# this part probably can be solved with datetime or some other lib, but it's not that complex, so I did it in couple of lines
seconds = new_seconds % 60 # if we get > 59 seconds we only put the modulo as second and the other part goes to minute
new_minutes = int(time_values[1]) + new_seconds // 60 # if we have more than 60 s then here we'll add minutes produced by adding to the seconds
minutes = new_minutes % 60 # similarly as for seconds
hours = int(time_values[0]) + new_minutes // 60
# here I convert again to string so we could easily apply join operation (operates only on strings) and additionaly add zero in front for 1 digit numbers
time_values[0] = str(hours).rjust(2, '0')
time_values[1] = str(minutes).rjust(2, '0')
time_values[2] = str(seconds).rjust(2, '0')
new_time_val = ':'.join(time_values)# join the values to follow the HH:MM:SS format
splitted[3] = new_time_val# replace the old time with the new one (with the value added)
first_line_modified = ' '.join(splitted)# just join the modified list
print(first_line_modified)
# processing othe lines
for line in fin:
# here we only get the first (0th) value and subtract the val_to_add and round to 3 digits the response (to avoid too many decimal places)
stripped = line.strip()
splitted = stripped.split(' ')
splitted[0] = str(round(float(splitted[0]) - val_to_add, 3))
modified_line = ' '.join(splitted)
print(modified_line)
Although regex was discouraged in the comments, regex can be used to parse the time objects into datetime.time objects, perform the necessary calculations on them, then print them in the required format:
# datetime module for time calculations
import datetime
# regex module
import re
# seconds to add to time
myinp = 4
# List of data strings
# data = 'John Time Made 11:05:20 in 2010', '5.001 Kelly', '6.005 Josh'
with open('data.txt') as f:
data = f.readlines()
new_data = []
#iterate through the list of data strings
for time in data:
try:
# First check for 'HH:MM:SS' time format in data string
# regex taken from this question: http://stackoverflow.com/questions/8318236/regex-pattern-for-hhmmss-time-string
match = re.findall("([0-1]?\d|2[0-3]):([0-5]?\d):([0-5]?\d)", time)
# this regex returns a list of tuples as strings "[('HH', 'MM', 'SS')]",
# which we join back together with ':' (colon) separators
t = ':'.join(match[0])
# create a Datetime object from indexing the first matched time in the list,
# taken from this answer http://stackoverflow.com/questions/100210/what-is-the-standard-way-to-add-n-seconds-to-datetime-time-in-python
# May create an IndexError exception, which we catch in the `except` clause below
orig = datetime.datetime(100,1,1,int(match[0][0]), int(match[0][1]), int(match[0][2]))
# Add the number of seconds to the Datetime object,
# taken from this answer: http://stackoverflow.com/questions/656297/python-time-timedelta-equivalent
newtime = (orig + datetime.timedelta(0, myinp)).time()
# replace the time in the original data string with the newtime and print
new_data.append(time.replace(t, str(newtime)))
# catch an IndexError Exception, which we look for float-formatted seconds only
except IndexError:
# look for float-formatted seconds (s.xxx)
# taken from this answer: http://stackoverflow.com/questions/4703390/how-to-extract-a-floating-number-from-a-string
match = re.findall("\d+\.\d+", time)
# create a Datetime object from indexing the first matched time in the list,
# specifying only seconds, and microseconds, which we convert to milliseconds (micro*1000)
orig = datetime.datetime(100,1,1,second=int(match[0].split('.')[0]),microsecond=int(match[0].split('.')[1])*1000)
# Subtract the seconds from the Datetime object, similiar to the time addtion in the `try` clause above
newtime = orig - datetime.timedelta(0, myinp)
# format the newtime as `seconds` concatenated with the milliseconds converted from microseconds
newtime_fmt = newtime.second + newtime.microsecond/1000000.
# Get the seconds value (first value(index 0)) from splitting the original string at the `space` between the `seconds` and `name` strings
t = time.split(' ')[0]
# replace the time in the original data string with the newtime and print
new_data.append(time.replace(t , str(newtime_fmt)))
with open('new_data.txt', 'w') as nf:
for newline in new_data:
nf.write(newline)
new_data.txt file contents should read as:
John Time Made 11:05:24 in 2010
1.001 Kelly
2.005 Josh
I have a large number of text files to read from in Python. Each file is structured as the following sample:
------------------------------------------------------------------------------
\\
Paper: some_integer
From: <some_email_address>
Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
Date (revised v2): Tue, 8 May 2001 10:39:33 GMT (27kb)
Title: some_title
Authors: name_1, name_2
Comments: 28 pages, JHEP latex
\\
blablabla (this is a multiline abstract of the paper)
blablabla
blablabla
\\
I would like to automatically extract and store (e.g., as a list) the Title, Authors, and abstract (the text between the second and third \\ - note that it starts with an indent) from each text file. Also note that the white line between Date (revised) and Title is really there (it is not a typo that I introduced).
My attempts so far have involved (I am showing the steps for a single text file, say the first file in the list):
filename = os.listdir(path)[0]
test = pd.read_csv(filename, header=None, delimiter="\t")
Which gives me:
0
0 ----------------------------------------------...
1 \\
2 Paper: some_integer
3 From: <some_email_address>
4 Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
5 Date (revised v2): Tue, 8 May 2001 10:39:33 G...
6 Title: some_title...
7 Authors: name_1, name_2
8 Comments: 28 pages, JHEP latex
9 Report-no: DUKE-CGTP-00-01
10 \\
11 blabla...
12 blabla...
13 blabla...
14 \\
I can then select a given row (e.g., the one featuring the title) with:
test[test[0].str.contains("Title")].to_string()
But it is truncated, it is not a clean string (some attributes show up) and I find this entire pandas-based approach quite tedious actually... There must be an easier way to directly select the rows of interest from the text file using regex. At least I hope so...
you could process line by line.
import re
data = {}
temp_s = match = ''
with open('myfile.txt', 'r') as infile:
for line in infile:
if ":" in line:
line = line.split(':')
data[line[0]] = line[1]
elif re.search(r'.*\w+', line):
match = re.search(r'(\w.*)', line)
match = match.group(1)
temp_s += match
while 1:
line = infile.next()
if re.search(r'.*\w+', line):
match = re.search(r'(\w.*)', line)
temp_s += match.group(1)
else:
break
data['abstract'] = temp_s
How about iterating over each line in the file and split by the first : if it is present in line, collecting the result of the split in a dictionary:
with open("input.txt") as f:
data = dict(line.strip().split(": ", 1) for line in f if ": " in line)
As a result, the data would contain:
{
'Comments': '28 pages, JHEP latex',
'Paper': 'some_integer',
'From': '<some_email_address>',
'Date (revised v2)': 'Tue, 8 May 2001 10:39:33 GMT (27kb)',
'Title': 'some_title',
'Date': 'Wed, 4 Apr 2001 12:08:13 GMT (27kb)',
'Authors': 'name_1, name_2'
}
If your files really always have the same structure, you could come up with:
# -*- coding: utf-8> -*-
import re
string = """
------------------------------------------------------------------------------
\\
Paper: some_integer
From: <some_email_address>
Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
Date (revised v2): Tue, 8 May 2001 10:39:33 GMT (27kb)
Title: some_title
Authors: name_1, name_2
Comments: 28 pages, JHEP latex
\\
blablabla (this is the abstract of the paper)
\\
"""
rx = re.compile(r"""
^Title:\s(?P<title>.+)[\n\r] # Title at the beginning of a line
Authors:\s(?P<authors>.+)[\n\r] # Authors: ...
Comments:\s(?P<comments>.+)[\n\r] # ... and so on ...
.*[\n\r]
(?P<abstract>.+)""",
re.MULTILINE|re.VERBOSE) # so that the caret matches any line
# + verbose for this explanation
for match in rx.finditer(string):
print match.group('title'), match.group('authors'), match.group('abstract')
# some_title name_1, name_2 blablabla (this is the abstract of the paper)
This approach takes Title as the anchor (beginning of a line) and skims the text afterwards. The named groups may not really be necessary but make the code easier to understand. The pattern [\n\r] looks for newline characters.
See a demo on regex101.com.
This pattern will get you started:
\\[^\\].*[^\\]+Title:\s+(\S+)\s+Authors:\s+(.*)[^\\]+\\+\s+([^\\]*)\n\\
Assume 'txtfile.txt' is of the format as shown on the top. If using python 2.7x:
import re
with open('txtfile.txt', 'r') as f:
input_string = f.read()
p = r'\\[^\\].*[^\\]+Title:\s+(\S+)\s+Authors:\s+(.*)[^\\]+\\+\s+([^\\]*)\n\\'
print re.findall(p, input_string)
Output:
[('some_title', 'name_1, name_2', 'blablabla (this is a multiline abstract of the paper)\n blablabla\n blablabla')]
I've read all of the articles I could find, even understood a few of them but as a Python newb I'm still a little lost and hoping for help :)
I'm working on a script to parse items of interest out of an application specific log file, each line begins with a time stamp which I can match and I can define two things to identify what I want to capture, some partial content and a string that will be the termination of what I want to extract.
My issue is multi-line, in most cases every log line is terminated with a newline but some entries contain SQL that may have new lines within it and therefore creates new lines in the log.
So, in a simple case I may have this:
[8/21/13 11:30:33:557 PDT] 00000488 SystemOut O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,' $AAAA ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc (execution took 2083 milliseconds)
This all appears as one line which I can match with this:
re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2}).*(milliseconds)')
However in some cases there may be line breaks in the SQL, as such I want to still capture it (and potentially replace the line breaks with spaces). I am currently reading the file a line at a time which obviously isn't going to work so...
Do I need to process the whole file in one go? They are typically 20mb in size. How do I read the entire file and iterate through it looking for single or multi-line blocks?
How would I write a multi-line RegEx that would match either the whole thing on one line or of it is spread across multiple lines?
My overall goal is to parameterize this so I can use it for extracting log entries that match different patterns of the starting string (always the start of a line), the ending string (where I want to capture to) and a value that is between them as an identifier.
Thanks in advance for any help!
Chris.
import sys, getopt, os, re
sourceFolder = 'C:/MaxLogs'
logFileName = sourceFolder + "/Test.log"
lines = []
print "--- START ----"
lineStartsWith = re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )')
lineContains = re.compile('.*BMXAA6720W.*')
lineEndsWith = re.compile('(?:.*milliseconds.*)')
lines = []
with open(logFileName, 'r') as f:
for line in f:
if lineStartsWith.match(line) and lineContains.match(line):
if lineEndsWith.match(line) :
print 'Full Line Found'
print line
print "- Record Separator -"
else:
print 'Partial Line Found'
print line
print "- Record Separator -"
print "--- DONE ----"
Next step, for my partial line I'll continue reading until I find lineEndsWith and assemble the lines in to one block.
I'm no expert so suggestions are always welcome!
UPDATE - So I have it working, thanks to all the responses that helped direct things, I realize it isn't pretty and I need to clean up my if / elif mess and make it more efficient but IT's WORKING! Thanks for all the help.
import sys, getopt, os, re
sourceFolder = 'C:/MaxLogs'
logFileName = sourceFolder + "/Test.log"
print "--- START ----"
lineStartsWith = re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )')
lineContains = re.compile('.*BMXAA6720W.*')
lineEndsWith = re.compile('(?:.*milliseconds.*)')
lines = []
multiLine = False
with open(logFileName, 'r') as f:
for line in f:
if lineStartsWith.match(line) and lineContains.match(line) and lineEndsWith.match(line):
lines.append(line.replace("\n", " "))
elif lineStartsWith.match(line) and lineContains.match(line) and not multiLine:
#Found the start of a multi-line entry
multiLineString = line
multiLine = True
elif multiLine and not lineEndsWith.match(line):
multiLineString = multiLineString + line
elif multiLine and lineEndsWith.match(line):
multiLineString = multiLineString + line
multiLineString = multiLineString.replace("\n", " ")
lines.append(multiLineString)
multiLine = False
for line in lines:
print line
Do I need to process the whole file in one go? They are typically 20mb in size. How do I read the entire file and iterate through it looking for single or multi-line blocks?
There are two options here.
You could read the file block by block, making sure to attach any "leftover" bit at the end of each block to the start of the next one, and search each block. Of course you will have to figure out what counts as "leftover" by looking at what your data format is and what your regex can match, and in theory it's possible for multiple blocks to all count as leftover…
Or you could just mmap the file. An mmap acts like a bytes (or like a str in Python 2.x), and leaves it up to the OS to handle paging blocks in and out as necessary. Unless you're trying to deal with absolutely huge files (gigabytes in 32-bit, even more in 64-bit), this is trivial and efficient:
with open('bigfile', 'rb') as f:
with mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as m:
for match in compiled_re.finditer(m):
do_stuff(match)
In older versions of Python, mmap isn't a context manager, so you'll need to wrap contextlib.closing around it (or just use an explicit close if you prefer).
How would I write a multi-line RegEx that would match either the whole thing on one line or of it is spread across multiple lines?
You could use the DOTALL flag, which makes the . match newlines. You could instead use the MULTILINE flag and put appropriate $ and/or ^ characters in, but that makes simple cases a lot harder, and it's rarely necessary. Here's an example with DOTALL (using a simpler regexp to make it more obvious):
>>> s1 = """[8/21/13 11:30:33:557 PDT] 00000488 SystemOut O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,' $AAAA ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc (execution took 2083 milliseconds)"""
>>> s2 = """[8/21/13 11:30:33:557 PDT] 00000488 SystemOut O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and
(exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,' $AAAA ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc (execution took 2083 milliseconds)"""
>>> r = re.compile(r'\[(.*?)\].*?milliseconds\)', re.DOTALL)
>>> r.findall(s1)
['8/21/13 11:30:33:557 PDF']
>>> r.findall(s2)
['8/21/13 11:30:33:557 PDF']
As you can see the second .*? matched the newline just as easily as a space.
If you're just trying to treat a newline as whitespace, you don't need either; '\s' already catches newlines.
For example:
>>> s1 = 'abc def\nghi\n'
>>> s2 = 'abc\ndef\nghi\n'
>>> r = re.compile(r'abc\s+def')
>>> r.findall(s1)
['abc def']
>>> r.findall(s2)
['abc\ndef']
You can read an entire file into a string and then you can use re.split to make a list of all the entries separated by times. Here's an example:
f = open(...)
allLines = ''.join(f.readlines())
entries = re.split(regex, allLines)