Regex vs readline for text processing

Regex vs readline for text processing - python

I have a text to process (router output) and generate useful data structure (dictionary having keys as iface name and values as packet counts) from it. I have two approaches to do the same task. I would like to know which one should I use for efficiency and which one looks more prone to fail for bigger data samples.
Readline1 gets a list from readline and processes output and writes into the dictionary with key as interface name and values as next three items.
Readline2 uses re module and match the groups and from groups it writes to dictionary keys and values.
input self.output to these functions will be something like this:
message =
"""
Interface 1/1\n\t
input : 1234\n\t
output : 3456\n\t
dropped : 12\n
\n
Interface 1/2\n\t
input : 7123\n\t
output : 2345\n\t
dropped : 31\n\t
"""
def ReadLine1(self):
lines = self.output.splitlines()
for index, line in enumerate(lines):
if "Interface" in line:
valuelist = []
for i in [1,2,3]:
valuelist.append((lines[index+i].split(':'))[1].strip())
self.IFlist[line.split()[1]] = valuelist
return self.IFlist
def Readline2(self):
#print repr(self.output)
n = re.compile(r"\n*Interface (./.)\n\s*input : ([0-9]+)\n\s*output : ([0-9]+)\n\s*dropped : ([0-9]+)",re.MULTILINE|re.DOTALL)
blocks = self.output.split('\n\n')
for block in blocks:
m_object = re.match(n, block)
self.IFlist[m_object.group(1)] = [m_object.group(i) for i in (2,3,4)]

Both of your methods use specific aspects of the format to achieve the parsing you are trying to do, and if that format was changed / broken one of the methods could also break...
For example if you added a space in the empty line between the two entries (which you cannot see) then the blocks = self.output.split('\n\n') would fail to find two consecutive newline characters and the regex version would miss out on the second entry:
{'1/1': ['1234', '3456', '13']}
Or if you added an extra newline between input and output like this:
Interface 1/2
input : 7123
output : 2345
dropped : 31
The regex \s* would deal with the extra space fine but the non-regex parsing would assume that lines[index+i].split(':') has an indice [1] so it would raise an IndexError with that data
Or if you added some extra space at the end of any line then the regex would fail to see the newline right after the content and re.match(n, lock) would return None so the next line would raise an AttributeError: 'NoneType' object has no attribute 'group'
Or if you changed Interface to interface for one of the entries (no longer capital I) then the regex would raise the same error as above but the non-regex would simply ignore that entry.
While I was testing it I found that the regex was easier to mess up with small edits to the sample message, but I also found that the version I made using a generator expression and str.partition was significantly more robust then both of them:
def readline3():
gen_lines = (line for line in self.output.splitlines()
if line and not line.isspace())
try:
while True: #ended when next() throws a StopIteration
start,_,key = next(gen_lines).partition(" ")
if start == "Interface":
IFlist[key] = [next(gen_lines).rpartition(" : ")[2]
for _ in "123"]
except StopIteration: # reached end of output
return self.IFlist
This succeeded in every case mentioned above and a few more, and since the only method this is relying on is str.partition which alway returns a 3 item tuple there is nothing to raise any unexpected errors unless self.output is something other then a string.
Also running a benchmark using timeit your readline1 consistently was faster then readline2 and my readline3 was usually slightly more then readline1:
#using the default 1000000 loops using 'message'
<function readline1 at 0x100756f28>
11.225649802014232
<function readline2 at 0x1057e3950>
14.838601427007234
<function readline3 at 0x1057e39d8>
11.693351223017089

Related

Removing \n from readlines with rstrip("\n")

im reading 2 files per lines that contains strings
and grouping per other one here in my code im grooping 1 line from proxyfile with 3 lines from websitefiles
major problem when i try to print the lines from 2nd file i cant remove the \n
i have tried to use rstrip("\n") but i got this exception
AttributeError: 'tuple' object has no attribute 'rstrip'
this is my code :
proxy = open("proxy.txt",'r')
readproxy = proxy.readlines()
sites = open("websites.txt",'r')
readsites = sites.readlines()
for i,j in zip(readproxy, zip(*(iter(readsites),) * 3)):
try:
i = i.rstrip("\n")
#j = j.rstrip("\n")
print(i,j)
except ValueError:
print 'whatever'
i removed the \n from "i" successfully , but coudent remove it from "j"
this screenshot can explain all i think

A tuple doesn't have a method that magically propagates to all string elements.
To do that you have to apply rstrip for all items in a list comprehension to create a list.
j = [x.rstrip("\n") for x in j]
that rebuilds a list with items stripped. Convert to tuple if really needed:
j = tuple(x.rstrip("\n") for x in j)
note that you could avoid this post processing by doing:
readsites = sites.read().splitlines()
instead of sites.readlines() since splitlines splits into lines while removing the termination character if no argument is specified. The only drawback is in the case when the file is huge. In that case, you could encounter a memory issue (memory needed by read + split is doubled). In which case, you could do:
with open("websites.txt",'r') as sites:
readsites = [line.rstrip("\n") for line in sites]

converting matrix from logfile

I have a matrix written in this format inside a log file:
2014-09-08 14:10:20,107 - root - INFO - [[ 8.30857546 0.69993454 0.20645551
77.01797674 13.76705776]
[ 8.35205432 0.53417203 0.19969048 76.78598173 14.12810144]
[ 8.37066492 0.64428449 0.18623849 76.4181809 14.3806312 ]
[ 8.50493296 0.5110043 0.19731849 76.45838604 14.32835821]
[ 8.18900791 0.4955451 0.22524777 76.96966663 14.12053259]]
...some text
2014-09-08 14:12:22,211 - root - INFO - [[ 3.25142253e+01 1.11788106e+00 1.51065008e-02 6.16496299e+01
4.70315726e+00]
[ 3.31685887e+01 9.53522041e-01 1.49767860e-02 6.13449154e+01
4.51799710e+00]
[ 3.31101827e+01 1.09729703e+00 5.03347259e-03 6.11818594e+01
4.60562742e+00]
[ 3.32506957e+01 1.13837592e+00 1.51783456e-02 6.08651657e+01
4.73058437e+00]
[ 3.26809490e+01 1.06617279e+00 1.00110121e-02 6.17429172e+01
4.49994994e+00]]
I am writing this matrix using the python logging package:
logging.info(conf_mat)
However, logging.info does not show me a method to write the matrix in a float %.3f format. So I decided to parse the log file this way:
conf_mat = [[]]
cf = '[+-]?(?=\d*[.eE])(?=\.?\d)\d*\.?\d*(?:[eE][+-]?\d+)?'
with open(sys.argv[1]) as f:
for line in f:
epoch = re.findall(ep, line) # find lines starting with epoch for other stuff
if epoch:
error_line = next(f) # grab the next line, which is the error line
error_value = error_line[error_line.rfind('=')+1:]
data_points.append(map(float,epoch[0]+(error_value,))) #get the error value for the specific epoch
for i in range(N):
cnf_mline = next(f)
match = re.findall(cf, cnf_mline)
if match:
conf_mat[count].append(map(float,match))
else:
conf_mat.append([])
count += 1
However, the regex does not catch the break in the line when looking at the matrix, when I try to convert the matrix using
conf_mtx = np.array(conf_mat)

Your regex string cf needs to be a raw string literal:
cf = r'[+-]?(?=\d*[.eE])(?=\.?\d)\d*\.?\d*(?:[eE][+-]?\d+)?'
in order to work properly. Backslash \ characters are interpreted as escape sequences in "regular" strings, but should not be in regexes. You can read about raw string literals at the top of the re module's documentation, and in this excellent SO answer. Alex Martelli explains them quite well, so I won't repeat everything he says here. Suffice it to say that were you not to use a raw literal, you'd have to escape each and every one of your backslashes with another backslash, and that just gets ugly and annoying fast.
As for the rest of your code, it won't run without more information. The N in for i in range(N): is undefined, as is count a few lines later. Calling cnf_mline = next(f) really doesn't make sense at all, because you're going to run out of lines in the file (by calling next repeatedly) before you can iterate over all of them using the for line in f: command. It's unclear whether your data really has that line break in the second half where one of the members of the list is on the next line, I assume that's the case because of the next attempt.
I think you should first try to clean up your input file into a regular format, then you'll have a much easier time running regular expressions on it. In order to work on subsequent lines and not run out your generator expression with excessive uses of next(), check out itertools.tee(). It returns n independent generators from a single iterable, allowing you to advance the second a line ahead of the first. Alternatively, you could read your file's lines into a list, and just operate using indices of i, i+1. Just strip each line, join them together, and write to a new file or list. You can then go ahead and rewrite your matching loop to simply pull each number of the appropriate format out and insert it into your matrix at the correct position. The good news is your regex caught everything I threw at it, so you won't need to modify anything there.
Good luck!

Loop the remaining elements within a loop

I have the following text:
ERROR: <C:\Includes\Library1.inc:123> This is the Error
Call Trace:
<C:\Includes\Library2.inc:456>
<C:\Includes\Library2.inc:789>
<C:\Code\Main.ext:12>
<Line:1>
ERROR: <C:\Includes\Library2.inc:2282> Another Error
Call Trace:
<C:\Code\Main.ext:34>
<C:\Code\Main.ext:56>
<C:\Code\Main.ext:78>
<Line:1>
ERROR: <C:\Code\Main.ext:90> Error Three
I would like to extract the following information:
line, Error = 12, This is the Error
line, Error = 34, Another Error
line, Error = 90, Error Three
Here is how far I got:
theText = 'ERROR: ...'
ERROR_RE = re.compile(r'^ERROR: <(?P<path>.*):(?P<line>[0-9]+)> (?P<error>.*)$')
mainName = '\Main.ext'
# Go through each line
for fullline in theText.splitlines():
match = self.ERROR_RE.match(fullline)
if match:
path, line, error = match.group('path'), match.group('line'), match.group('error')
if path.endswith(mainName):
callSomething(line, error)
# else check next line for 'Call Trace:'
# check next lines for mainName and get the linenumber
# callSomething(linenumber, error)
What is the pythonic way to loop the remaining elements within a loop?
Solution:
http://codepad.org/BcYmybin

The direct answer to your question, regarding how to loop over remaining lines, is: change the first line of the loop to
lines = theText.splitlines()
for (linenum, fullline) in enumerate(lines):
Then, after a match, you can get at the remaining lines by looking at lines[j] in an inner loop where j starts at linenum+1 and runs until the next match.
However, a slicker way to solve the problem is to first split the text into blocks. There are many ways to do this, however, being a former perl user, my impulse is to use regular expressions.
# Split into blocks that start with /^ERROR/ and run until either the next
# /^ERROR/ or until the end of the string.
#
# (?m) - lets '^' and '$' match the beginning/end of each line
# (?s) - lets '.' match newlines
# ^ERROR - triggers the beginning of the match
# .*? - grab characters in a non-greedy way, stopping when the following
# expression matches
# (?=^ERROR|$(?!\n)) - match until the next /^ERROR/ or the end of string
# $(?!\n) - match end of string. Normally '$' suffices but since we turned
# on multiline mode with '(?m)' we have to use '(?!\n)$ to prevent
# this from matching end-of-line.
blocks = re.findall('(?ms)^ERROR.*?(?=^ERROR|$(?!\n))', theText)

Replace this:
# else check next line for 'Call Trace:'
# check next lines for mainName and get the linenumber
# callSomething(linenumber, error)
With this:
match = stackframe_re.match(fullline)
if match and error: # if error is defined from earlier when you matched ERROR_RE
path, line = match.group('path'), match.group('line')
if path.endsWith(mainName):
callSomething(line, error)
error = None # don't report this error again if you see main again
Note the indentation. Also initialize error = None before the loop begins and set error = None after the first call to callSomething. In general, the code I've suggested should work for properly-formatted data, but you might want to improve it so it doesn't give misleading results if the data does not match the format you expect.
You will have to write stackframe_re, but it should be an RE that matches, for example,
<C:\Includes\Library2.inc:789>
I don't really understand what you mean when you say "loop the remaining elements within a loop". A loop continues to the remaining elements by default.

Conversion of Multiple Strings To ASCII

This seems fairly trivial but I can't seem to work it out
I have a text file with the contents:
B>F
I am reading this with the code below, stripping the '>' and trying to convert the strings into their corresponding ASCII value, minus 65 to give me a value that will correspond to another list index
def readRoute():
routeFile = open('route.txt', 'r')
for line in routeFile.readlines():
route = line.strip('\n' '\r')
route = line.split('>')
#startNode, endNode = route
startNode = ord(route[0])-65
endNode = ord(route[1])-65
# Debug (this comment was for my use to explain below the print values)
print 'Route Entered:'
print line
print startNode, ',', endNode, '\n'
return[startNode, endNode]
However I am having slight trouble doing the conversion nicely, because the text file only contains one line at the moment but ideally I need it to be able to support more than one line and run an amount of code for each line.
For example it could contain:
B>F
A>D
C>F
E>D
So I would want to run the same code outside this function 4 times with the different inputs
Anyone able to give me a hand
Edit:
Not sure I made my issue that clear, sorry
What I need it do it parse the text file (possibly containing one line or multiple lines like above. I am able to do it for one line with the lines
startNode = ord(route[0])-65
endNode = ord(route[1])-65
But I get errors when trying to do more than one line because the ord() is expecting different inputs
If I have (below) in the route.txt
B>F
A>D
This is the error it gives me:
line 43, in readRoute endNode = ord(route[1])-65
TypeError: ord() expected a character, but string of length 2 found
My code above should read the route.txt file and see that B>F is the first route, strip the '>' - convert the B & F to ASCII, so 66 & 70 respectively then minus 65 from both to give 1 & 5 (in this example)
The 1 & 5 are corresponding indexes for another "array" (list of lists) to do computations and other things on
Once the other code has completed it can then go to the next line in route.txt which could be A>D and perform the above again

Perhaps this will work for you. I turned the fileread into a generator so you can do as you please with the parsed results in the for-i loop.
def readRoute(file_name):
with open(file_name, 'r') as r:
for line in r:
yield (ord(line[0])-65, ord(line[2])-65)
filename = 'route.txt'
for startnode, endnode in readRoute(filename):
print startnode, endnode

If you can't change readRoute, change the contents of the file before each call. Better yet, make readRoute take the filename as a parameter (default it to 'route.txt' to preserve the current behavior) so you can have it process other files.

What about something like this? It takes the routes defined in your file and turns them into path objects with start and end member variables. As an added bonus PathManager.readFile() allows you to load multiple route files without overwriting the existing paths.
import re
class Path:
def __init__(self, start, end):
self.start = ord(start) - 65 # Scale the values as desired
self.end = ord(end) - 65 # Scale the values as desired
class PathManager:
def __init__(self):
self.expr = re.compile("^([A-Za-z])[>]([A-Za-z])$") # looks for string "C>C"
# where C is a char
self.paths = []
def do_logic_routine(self, start, end):
# Do custom logic here that will execute before the next line is read
# Return True for 'continue reading' or False to stop parsing file
return True
def readFile(self, path):
file = open(path,"r")
for line in file:
item = self.expr.match(line.strip()) # strip whitespaces before parsing
if item:
'''
item.group(0) is *not* used here; it matches the whole expression
item.group(1) matches the first parenthesis in the regular expression
item.group(2) matches the second
'''
self.paths.append(Path(item.group(1), item.group(2)))
if not do_logic_routine(self.paths[-1].start, self.paths[-1].end):
break
# Running the example
MyManager = PathManager()
MyManager.readFile('route.txt')
for path in MyManager.paths:
print "Start: %s End: %s" % (path.start, path.end)
Output is:
Start: 1 End: 5
Start: 0 End: 3
Start: 2 End: 5
Start: 4 End: 3

Python: Read large file in chunks

Hey there, I have a rather large file that I want to process using Python and I'm kind of stuck as to how to do it.
The format of my file is like this:
0 xxx xxxx xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
1 xxx xxxx xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
So I basically want to read in the chunk up from 0-1, do my processing on it, then move on to the chunk between 1 and 2.
So far I've tried using a regex to match the number and then keep iterating, but I'm sure there has to be a better way of going about this. Any suggestion/info would be greatly appreciated.

If they are all within the same line, that is there are no line breaks between "1." and "2." then you can iterate over the lines of the file like this:
for line in open("myfile.txt"):
#do stuff
The line will be disposed of and overwritten at each iteration meaning you can handle large file sizes with ease. If they're not on the same line:
for line in open("myfile.txt"):
if #regex to match start of new string
parsed_line = line
else:
parsed_line += line
and the rest of your code.

Why don't you just read the file char by char using file.read(1)?
Then, you could - in each iteration - check whether you arrived at the char 1. Then you have to make sure that storing the string is fast.

If the "N " can only start a line, then why not use use the "simple" solution? (It sounds like this already being done, I am trying to reinforce/support it ;-))
That is, just reading a line at a time, and build up the data representing the current N object. After say N=0, and N=1 are loaded, process them together, then move onto the next pair (N=2, N=3). The only thing that is even remotely tricky is making sure not to throw out a read line. (The line read that determined the end condition -- e.g. "N " -- also contain the data for the next N).
Unless seeking is required (or IO caching is disabled or there is an absurd amount of data per item), there is really no reason not to use readline AFAIK.
Happy coding.
Here is some off-the-cuff code, which likely contains multiple errors. In any case, it shows the general idea using a minimized side-effect approach.
# given an input and previous item data, return either
# [item_number, data, next_overflow] if another item is read
# or None if there are no more items
def read_item (inp, overflow):
data = overflow or ""
# this can be replaced with any method to "read the header"
# the regex is just "the easiest". the contract is just:
# given "N ....", return N. given anything else, return None
def get_num(d):
m = re.match(r"(\d+) ", d)
return int(m.groups(1)) if m else None
for line in inp:
if data and get_num(line) ne None:
# already in an item (have data); current line "overflows".
# item number is still at start of current data
return [get_num(data), data, line]
# not in item, or new item not found yet
data += line
# and end of input, with data. only returns above
# if a "new" item was encountered; this covers case of
# no more items (or no items at all)
if data:
return [get_num(data), data, None]
else
return None
And usage might be akin to the following, where f represents an open file:
# check for error conditions (e.g. None returned)
# note feed-through of "overflow"
num1, data1, overflow = read_item(f, None)
num2, data2, overflow = read_item(f, overflow)

If the format is fixed, why not just read 3 lines at a time with readline()

If the file is small, you could read the whole file in and split() on number digits (might want to use strip() to get rid of whitespace and newlines), then fold over the list to process each string in the list. You'll probably have to check that the resultant string you are processing on is not initially empty in case two digits were next to each other.

If the file's content can be loaded in memory, and that's what you answered, then the following code (needs to have filename defined) may be a solution.
import re
regx = re.compile('^((\d+).*?)(?=^\d|\Z)',re.DOTALL|re.MULTILINE)
with open(filename) as f:
text = f.read()
def treat(inp,regx=regx):
m1 = regx.search(inp)
numb,chunk = m1.group(2,1)
li = [chunk]
for mat in regx.finditer(inp,m1.end()):
n,ch = mat.group(2,1)
if int(n) == int(numb) + 1:
yield ''.join(li)
numb = n
li = []
li.append(ch)
chunk = ch
yield ''.join(li)
for y in treat(text):
print repr(y)
This code, run on a file containing :
1 mountain
orange 2
apple
produce
2 gas
solemn
enlightment
protectorate
3 grimace
song
4 snow
wheat
51 guludururu
kelemekinonoto
52asabi dabada
5 yellow
6 pink
music
air
7 guitar
blank 8
8 Canada
9 Rimini
produces:
'1 mountain\norange 2\napple\nproduce\n'
'2 gas\nsolemn\nenlightment\nprotectorate\n'
'3 grimace\nsong\n'
'4 snow\nwheat\n51 guludururu\nkelemekinonoto\n52asabi dabada\n'
'5 yellow\n'
'6 pink \nmusic\nair\n'
'7 guitar\nblank 8\n'
'8 Canada\n'
'9 Rimini'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.