Python string manipulation -- performance problems - python

I have the following piece of code that I execute around 2 million times in my application to parse that many records. This part seems to be the bottleneck and I was wondering if anyone could help me by suggesting some nifty tricks that could make these simple string manipulations faster.
try:
data = []
start = 0
end = 0
for info in self.Columns():
end = start + (info.columnLength)
slice = line[start:end]
if slice == '' or len(slice) != info.columnLength:
raise 'Wrong Input'
if info.hasSignage:
if(slice[0:1].strip() != '+' and slice[0:1].strip() != '-'):
raise 'Wrong Input'
if not info.skipColumn:
data.append(slice)
start = end
parsedLine = data
except:
parsedLine = False

def fubarise(data):
try:
if nasty(data):
raise ValueError("Look, Ma, I'm doing a big fat GOTO ...") # sheesh #1
more_of_the_same()
parsed_line = data
except ValueError:
parsed_line = False
# so it can be a "data" or False -- sheesh #2
return parsed_line
There is no point in having different error messages in the raise statement; they are never seen. Sheesh #3.
Update: Here is a suggested improvement which uses struct.unpack to partition input lines rapidly. It also illustrates better exception handling, under the assumption that the writer of the code is also running it and stopping on the first error is acceptable. A robust implementation which logs all errors in all columns of all lines for a user audience is another matter. Note that typically the error checking for each column would be much more extensive e.g. checking for a leading sign but not checking whether the column contains a valid number seems a little odd.
import struct
def unpacked_records(self):
cols = self.Columns()
unpack_fmt = ""
sign_checks = []
start = 0
for colx, info in enumerate(cols, 1):
clen = info.columnLength
if clen < 1:
raise ValueError("Column %d: Bad columnLength %r" % (colx, clen))
if info.skipColumn:
unpack_fmt += str(clen) + "x"
else:
unpack_fmt += str(clen) + "s"
if info.hasSignage:
sign_checks.append(start)
start += clen
expected_len = start
unpack = struct.Struct(unpack_fmt).unpack
for linex, line in enumerate(self.whatever_the_list_of_lines_is, 1):
if len(line) != expected_len:
raise ValueError(
"Line %d: Actual length %d, expected %d"
% (linex, len(line), expected_len))
if not all(line[i] in '+-' for i in sign_checks):
raise ValueError("Line %d: At least one column fails sign check" % linex)
yield unpack(line) # a tuple

what about (using some classes to have an executable example):
class Info(object):
columnLength = 5
hasSignage = True
skipColumn = False
class Something(object):
def Columns(self):
return [Info()]*4
def bottleneck(self):
try:
data = []
start = 0
end = 0
line = '+this-is just a line for testing'
for info in self.Columns():
start = end
collength = info.columnLength
end = start + collength
if info.skipColumn: # start with this
continue
elif collength == 0:
raise ValueError('Wrong Input')
slice = line[start:end] # only now slicing, because it
# is probably most expensive part
if len(slice) != collength:
raise ValueError('Wrong Input')
elif info.hasSignage and slice[0] not in '+-': # bit more compact
raise ValueError('Wrong Input')
else:
data.append(slice)
parsedLine = data
except:
parsedLine = False
Something().bottleneck()
edit:
when length of slice is 0, slice[0] does not exist, so if collength == 0 has to be checked for first
edit2:
You are using this bit of code for many many lines, but the column info does not change, right? That allows you, to
pre-calculate a list of start points of each colum (no more need to calculate start, end)
knowing start-end in advance, .Columns() only needs to return columns that are not skipped and have a columnlength >0 (or do you really need to raise an input for length==0 at each line??)
the manditory length of each line is known and equal or each line and can be checked before looping over the column infos
edit3:
I wonder how you will know what data index belongs to which column if you use 'skipColumn'...

EDIT: I'm changing this answer a bit. I'll leave the original answer below.
In my other answer I commented that the best thing would be to find a built-in Python module that would do the unpacking. I couldn't think of one, but perhaps I should have Google searched for one. #John Machin provided an answer that showed how to do it: use the Python struct module. Since that is written in C, it should be faster than my pure Python solution. (I haven't actually measured anything so it is a guess.)
I do agree that the logic in the original code is "un-Pythonic". Returning a sentinel value isn't best; it's better to either return a valid value or raise an exception. The other way to do it is to return a list of valid values, plus another list of invalid values. Since #John Machin offered code to yield up valid values, I thought I'd write a version here that returns two lists.
NOTE: Perhaps the best possible answer would be to take #John Machin's answer and modify it to save the invalid values to a file for possible later review. His answer yields up answers one at a time, so there is no need to build a large list of parsed records; and saving the bad lines to disk means there is no need to build a possibly-large list of bad lines.
import struct
def parse_records(self):
"""
returns a tuple: (good, bad)
good is a list of valid records (as tuples)
bad is a list of tuples: (line_num, line, err)
"""
cols = self.Columns()
unpack_fmt = ""
sign_checks = []
start = 0
for colx, info in enumerate(cols, 1):
clen = info.columnLength
if clen < 1:
raise ValueError("Column %d: Bad columnLength %r" % (colx, clen))
if info.skipColumn:
unpack_fmt += str(clen) + "x"
else:
unpack_fmt += str(clen) + "s"
if info.hasSignage:
sign_checks.append(start)
start += clen
expected_len = start
unpack = struct.Struct(unpack_fmt).unpack
good = []
bad = []
for line_num, line in enumerate(self.whatever_the_list_of_lines_is, 1):
if len(line) != expected_len:
bad.append((line_num, line, "bad length"))
continue
if not all(line[i] in '+-' for i in sign_checks):
bad.append((line_num, line, "sign check failed"))
continue
good.append(unpack(line))
return good, bad
ORIGINAL ANSWER TEXT:
This answer should be a lot faster if the self.Columns() information is identical over all the records. We do the processing of the self.Columns() information one time, and build a couple of lists that contain just what we need to process a record.
This code shows how to compute parsedList but doesn't actually yield it up or return it or do anything with it. Obviously you would need to change that.
def parse_records(self):
cols = self.Columns()
slices = []
sign_checks = []
start = 0
for info in cols:
if info.columnLength < 1:
raise ValueError, "bad columnLength"
end = start + info.columnLength
if not info.skipColumn:
tup = (start, end)
slices.append(tup)
if info.hasSignage:
sign_checks.append(start)
expected_len = end # or use (end - 1) to not count a newline
try:
for line in self.whatever_the_list_of_lines_is:
if len(line) != expected_len:
raise ValueError, "wrong length"
if not all(line[i] in '+-' for i in sign_checks):
raise ValueError, "wrong input"
parsedLine = [line[s:e] for s, e in slices]
except ValueError:
parsedLine = False

Don't compute start and end every time through this loop.
Compute them exactly once prior to using self.Columns() (Whatever that is. If 'Columns` is class with static values, that's silly. If it's a function with a name that begins with a capital letter, that's confusing.)
if slice == '' or len(slice) != info.columnLength can only happen if line is too short compared to the total size required by Columns. Check once, outside the loop.
slice[0:1].strip() != '+' sure looks like .startswith().
if not info.skipColumn. Apply this filter before even starting the loop. Remove these from self.Columns().

First thing I would consider is slice = line[start:end]. Slicing creates new instances; you could try to avoid explicitly constructing line [start:end] and examine its contents manually.
Why are you doing slice[0:1]? This should yield a subsequence containing a single item of slice (shouldn't it?), thus it can probably be checked more efficiently.

I want to tell you to use some sort of built-in Python feature to split the string, but I can't think of one. So I'm left with just trying to reduce the amount of code you have.
When we are done, end should be pointing at the end of the string; if this is the case, then all of the .columnLength values must have been okay. (Unless one was negative or something!)
Since this has a reference to self it must be a snip from a member function. So, instead of raising exceptions, you could just return False to exit the function early and return an error flag. But I like the debugging potential of changing the except clause to not catch the exception anymore, and getting a stack trace letting you identify where the problem came from.
#Remi used slice[0] in '+-' where I used slice.startswith(('+', '-)). I think I like #Remi's code better there, but I left mine unchanged just to show you a different way. The .startswith() way will work for strings longer than length 1, but since this is only a string of length 1 the terse solution works.
try:
line = line.strip('\n')
data = []
start = 0
for info in self.Columns():
end = start + info.columnLength
slice = line[start:end]
if info.hasSignage and not slice.startswith(('+', '-')):
raise ValueError, "wrong input"
if not info.skipColumn:
data.append(slice)
start = end
if end - 1 != len(line):
raise ValueError, "bad .columnLength"
parsedLine = data
except ValueError:
parsedLine = False

Related

More stable/efficient cartesian product function to determine all combinations n in length

I am working on an error correction method for DNA data storage. It works by identifying errors and then rapidly substituting all possible combinations of bases into error positions until the data decodes. My code correctly identifies errors and substitutes them using an itertools.product loop with one iterable repeated until n in length, n being the number of errors.
for error in itertools.product(' ATGC', repeat=len(errorPos)):
pos = 0
for base in error:
if base == " ":
fseqL[errorPos[pos]] = ""
else:
fseqL[errorPos[pos]] = base
pos = pos + 1
ffseq = ""
ffseq = ffseq.join(fseqL)
try:
Decoder.dna2binary(ffseq)
except:
pLeft = pLeft - 1
print("Incorrect decode: %s possibilities left"%(pLeft))
else:
print("Data decoded")
now = datetime.now()
start = open('data\\log.txt', 'r+')
end = start.read()
start.write("\nend: "+str(now))
start.close()
break
This program works well with small amounts of error, but it exponentially increases in processing time as error increases. The speed isn't a problem, but after about 10-15 errors it becomes unstable and locks up. Is there a better way to find this combination (NumPy function, wrong method, etc.)?

Reading data from a text file in Python according to the parameters provided

I have a text file something like this
Mqtt_allowed=true
Mqtt_host=192.168.0.1
Mqtt_port=2223
<=============>
cloud_allowed=true
cloud_host=m12.abc.com
cloud_port=1232
<=============>
local_storage=true
local_path=abcd
I needed to get each of the value w.r.t parameter provided by the user.
What i am doing right now is:
def search(param):
try:
with open('config.txt') as configuration:
for line in configuration:
if not line:
continue
function, f_input=line.split("=")
if function == param:
result=f_input.split()
break
else:
result="0"
except FileNotFoundError:
print("File not found: ")
return result
mqttIsAllowed=search("Mqtt_allowed")
print mqttIsAllowed
Now when i call only mqt stuff it is working fine but when i call cloud or anything after the "<==========>" separation it throws an error. Thanks
Just skip all the lines starting with <:
if not line or line.lstrip().startswith("<"):
continue
Or, if you really, really want to match the separator exactly:
if line.strip() == "<=============>":
continue
I think the first variant is better because if someone slightly modified the separator by accident, the second piece of code won't work at all.
Because you are trying to split on the = character in a style that seems to be standard INI format, it is safe to assume that your pairs will be at max size 2. I'm not a fan of using methods that rely on character checking (unless specifically called for), so give this a whirl:
def search(param):
result = '0' # declare here
try:
with open('config.txt') as configuration:
for line in configuration:
if not line:
continue
f_pair = line.strip().split("=") # remove \r\n, \n
if len(f_pair) > 2: # your separator will be much longer
continue
else if f_pair[0] == param:
result = f_pair[1]
# result = f_input.split() # why the 'split()' here?
break
except FileNotFoundError:
print("File not found: ")
return result
mqttIsAllowed=search("Mqtt_allowed")
I'm pretty sure the error you were getting was a ValueError: too many values to unpack.
Here is how I know that:
When you call this function for any of the Mqtt_* values, the loop never encounters the separator string <=============>. As soo as you try to call anything below that first separator (for example a cloud_* key), the loop eventually reaches the first separator and tries to execute:
function, f_input = line.split('=')
But that wont work, in fact it will tell you:
ValueError: too many values to unpack (expected 2)
And that is because you are forcing the split() call to push into only 2 variables, but a split('=') on your separator string will return a list of 15 elements (a '<', a '>' and 13 ''). Thus, doing what I have posted above ensures that your split('=') still goes off, but checks to see if you hit a separator or not.

Python - Perform file check based on format of 3 values then perform tasks

All,
I am trying to write a python script that will go through a crime file and separate the file based on the following items: UPDATES, INCIDENTS, and ARRESTS. The reports that I generally receive either show these sections as I have previously listed or by **UPDATES**, **INCIDENTS**, or **ARRESTS**. I have already started to write the following script to separate the files based on the following format with the **. However, I was wondering if there was a better way to check the files for both formats at the same time? Also, sometimes there is not an UPDATES or ARRESTS section which causes my code to break. I was wondering if there is a check I can do for this instance, and if this is the case, how can I still get the INCIDENTS section without the other two?
with open('CrimeReport20150518.txt', 'r') as f:
content = f.read()
print content.index('**UPDATES**')
print content.index('**INCIDENTS**')
print content.index('**ARRESTS**')
updatesLine = content.index('**UPDATES**')
incidentsLine = content.index('**INCIDENTS**')
arrestsLine = content.index('**ARRESTS**')
#print content[updatesLine:incidentsLine]
updates = content[updatesLine:incidentsLine]
#print updates
incidents = content[incidentsLine:arrestsLine]
#print incidents
arrests = content[arrestsLine:]
print arrests
You are currently using .index() to locate the headings in the text. The documentation states:
Like find(), but raise ValueError when the substring is not found.
That means that you need to catch the exception in order to handle it. For example:
try:
updatesLine = content.index('**UPDATES**')
print "Found updates heading at", updatesLine
except ValueError:
print "Note: no updates"
updatesLine = -1
From here you can determine the correct indexes for slicing the string based on which sections are present.
Alternatively, you could use the .find() method referenced in the documentation for .index().
Return -1 if sub is not found.
Using find you can just test the value it returned.
updatesLine = content.find('**UPDATES**')
# the following is straightforward, but unwieldy
if updatesLine != -1:
if incidentsLine != -1:
updates = content[updatesLine:incidentsLine]
elif arrestsLine != -1:
updates = content[updatesLine:arrestsLine]
else:
updates = content[updatesLine:]
Either way, you'll have to deal with all combinations of which sections are and are not present to determine the correct slice boundaries.
I would prefer to approach this using a state machine. Read the file line by line and add the line to the appropriate list. When a header is found then update the state. Here is an untested demonstration of the principle:
data = {
'updates': [],
'incidents': [],
'arrests': [],
}
state = None
with open('CrimeReport20150518.txt', 'r') as f:
for line in f:
if line == '**UPDATES**':
state = 'updates'
elif line == '**INCIDENTS**':
state = 'incidents'
elif line == '**ARRESTS**':
state = 'arrests'
else:
if state is None:
print "Warn: no header seen; skipping line"
else
data[state].append(line)
print data['arrests'].join('')
Try using content.find() instead of content.index(). Instead of breaking when the string isn't there, it returns -1. Then you can do something like this:
updatesLine = content.find('**UPDATES**')
incidentsLine = content.find('**INCIDENTS**')
arrestsLine = content.find('**ARRESTS**')
if incidentsLine != -1 and arrestsLine != -1:
# Do what you normally do
updatesLine = content.index('**UPDATES**')
incidentsLine = content.index('**INCIDENTS**')
arrestsLine = content.index('**ARRESTS**')
updates = content[updatesLine:incidentsLine]
incidents = content[incidentsLine:arrestsLine]
arrests = content[arrestsLine:]
elif incidentsLine != -1:
# Do whatever you need to do to files that don't have an arrests section here
elif arreststsLine != -1:
# Handle files that don't have an incidents section here
else:
# Handle files that are missing both
Probably you'll need to handle all four possible combinations slightly differently.
Your solution generally looks OK to me as long as the sections always come in the same order and the files don't get too big. You can get real feedback at stack exchange's code review https://codereview.stackexchange.com/

Loop thru a list and stop when the first string is found

I have a list and I want to extract to another list the data that exist between top_row and bottom_row.
I know the top_row and also that the bottom_row corresponds to data[0] = last integer data (next row is made of strings, but there are also rows with integers which I'm not interested).
I've tried several things, but w/o success:
for row,data in enumerate(fileData):
if row > row_elements: #top_row
try:
n = int(data[0])
aux = True
except:
n = 0
while aux: #until it finds the bottom_row
elements.append(data)
The problem is that it never iterates the second row, if I replace while with if I get all rows which the first column is an integer.
fileData is like:
*Element, type=B31H
1, 1, 2
2, 2, 3
.
.
.
359, 374, 375
360, 375, 376
*Elset, elset=PART-1-1_LEDGER-1-LIN-1-2-RAD-2__PICKEDSET2, generate
I'm only interested in rows with first column values equal to 1 to 360.
Many thanks!
The code you've posted is confusing. For example, "aux" is a poorly-named variable. And the loop really wants to start with a specific element of the input, but it loops over everything until it finds the iteration it wants, turning what might be a constant-time operation into a linear one. Let's try rewriting it:
for record in fileData[row_elements:]: # skip first row_elements (might need +1?)
try:
int(record[0])
except ValueError:
break # found bottow_row, stop iterating
elements.append(record)
If no exception is thrown in the try part, then you basically end up with an endless loop, given that aux will always be True.
I’m not perfectly sure what you are doing in your code, given the way the data looks isn’t clear and some things are not used (like n?), but in general, you can stop a running loop (both for and while loops) with the break statement:
for row, data in enumerate(fileData):
if conditionToAbortTheLoop:
break
So in your case, I would guess something like this would work:
for row, data in enumerate(fileData):
if row > row_elements: # below `top_row`
try:
int(data[0])
except ValueError:
break # not an int value, `bottom_row` found
# if we get here, we’re between the top- and bottom row.
elements.append(data)
Will this work?
for row, data in enumerate(fileData):
if row > row_elements: #top_row
try:
n = int(data[0])
elements.append(data)
except ValueError:
continue
Or what about:
elements = [int(data[0]) for data in fileData if data[0].isdigit()]
By the way, if you care to follow the convention of most python code, you can rename fileData to file_data.
Use a generator:
def isInteger(testInput):
try:
int(testInput)
return True
except ValueError: return False
def integersOnly(fileData):
element = fileData.next()
while isInteger(element):
yield element
element = fileData.next()

Python, I need the following code to finish quicker

I need the following code to finish quicker without threads or multiprocessing. If anyone knows of any tricks that would be greatly appreciated. maybe for i in enumerate() or changing the list to a string before calculating, I'm not sure.
For the example below, I have attempted to recreate the variables using a random sequence, however this has rendered some of the conditions inside the loop useless ... which is ok for this example, it just means the 'true' application for the code will take slightly longer.
Currently on my i7, the example below (which will mostly bypass some of its conditions) completes in 1 second, I would like to get this down as much as possible.
import random
import time
import collections
import cProfile
def random_string(length=7):
"""Return a random string of given length"""
return "".join([chr(random.randint(65, 90)) for i in range(length)])
LIST_LEN = 18400
original = [[random_string() for i in range(LIST_LEN)] for j in range(6)]
LIST_LEN = 5
SufxList = [random_string() for i in range(LIST_LEN)]
LIST_LEN = 28
TerminateHook = [random_string() for i in range(LIST_LEN)]
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Exclude above from benchmark
ListVar = original[:]
for b in range(len(ListVar)):
for c in range(len(ListVar[b])):
#If its an int ... remove
try:
int(ListVar[b][c].replace(' ', ''))
ListVar[b][c] = ''
except: pass
#if any second sufxList delete
for d in range(len(SufxList)):
if ListVar[b][c].find(SufxList[d]) != -1: ListVar[b][c] = ''
for d in range(len(TerminateHook)):
if ListVar[b][c].find(TerminateHook[d]) != -1: ListVar[b][c] = ''
#remove all '' from list
while '' in ListVar[b]: ListVar[b].remove('')
print(ListVar[b])
ListVar = original[:]
That makes a shallow copy of ListVar, so your changes to the second level lists are going to affect the original also. Are you sure that is what you want? Much better would be to build the new modified list from scratch.
for b in range(len(ListVar)):
for c in range(len(ListVar[b])):
Yuck: whenever possible iterate directly over lists.
#If its an int ... remove
try:
int(ListVar[b][c].replace(' ', ''))
ListVar[b][c] = ''
except: pass
You want to ignore spaces in the middle of numbers? That doesn't sound right. If the numbers can be negative you may want to use the try..except but if they are only positive just use .isdigit().
#if any second sufxList delete
for d in range(len(SufxList)):
if ListVar[b][c].find(SufxList[d]) != -1: ListVar[b][c] = ''
Is that just bad naming? SufxList implies you are looking for suffixes, if so just use .endswith() (and note that you can pass a tuple in to avoid the loop). If you really do want to find the the suffix is anywhere in the string use the in operator.
for d in range(len(TerminateHook)):
if ListVar[b][c].find(TerminateHook[d]) != -1: ListVar[b][c] = ''
Again use the in operator. Also any() is useful here.
#remove all '' from list
while '' in ListVar[b]: ListVar[b].remove('')
and that while is O(n^2) i.e. it will be slow. You could use a list comprehension instead to strip out the blanks, but better just to build clean lists to begin with.
print(ListVar[b])
I think maybe your indentation was wrong on that print.
Putting these suggestions together gives something like:
suffixes = tuple(SufxList)
newListVar = []
for row in original:
newRow = []
newListVar.append(newRow)
for value in row:
if (not value.isdigit() and
not value.endswith(suffixes) and
not any(th in value for th in TerminateHook)):
newRow.append(value)
print(newRow)

Categories