Fastest way of processing regexp - python

I have a script in python to process a log file - it parses the values and joins them simply with a tab.
p = re.compile(
"([0-9/]+) ([0-9]+):([0-9]+):([0-9]+) I.*"+
"worker\\(([0-9]+)\\)(?:#([^]]*))?.*\\[([0-9]+)\\] "+
"=RES= PS:([0-9]+) DW:([0-9]+) RT:([0-9]+) PRT:([0-9]+) IP:([^ ]*) "+
"JOB:([^!]+)!([0-9]+) CS:([\\.0-9]+) CONV:([^ ]*) URL:[^ ]+ KEY:([^/]+)([^ ]*)"
)
for line in sys.stdin:
line = line.strip()
if len(line) == 0: continue
result = p.match(line)
if result != None:
print "\t".join([x if x is not None else "." for x in result.groups()])
However, the scripts behaves quite slowly and it takes a long time to process the data.
How can I achieve the same behaviour in faster way? Perl/SED/PHP/Bash/...?
Thanks

It is hard to know without seeing your input, but it looks like your log file is made up of fields that are separated by spaces and do not contain any spaces internally. If so, you could split on whitespace first to put the individual log fields into an array. i.e.
line.split() #Split based on whitespace
or
line.split(' ') #Split based on a single space character
After that, use a few small regexes or even simple string operations to extract the data from the fields that you want.
It would likely be much more efficient, because the bulk of the line processing is done with a simple rule. You wouldn't have the pitfalls of potential backtracking, and you would have more readable code that is less likely to contain mistakes.
I don't know Python, so I can't write out a full code example, but that is the approach I would take in Perl.

Im writing Perl, not Python, but recently i used this technique to parse very big logs:
Divide input file to chunks (for example, FileLen/NumProcessors bytes
each).
Adjust start and end of every chunk to \n so you take full lines to
each worker.
fork() to create NumProcessors workers, each of which reading own
bytes range from file and writes his own output file.
Merge output files if needed.
Sure, you should work to optimize the regexp too, for example less use .* cus it will create many backtraces, this is slow. But anyway, 99% you will have bottleneck on CPU by this regexp, so working on 8 CPUs should help.

In Perl it is possible to use precompiled regexps which are much faster if you are using them many times.
http://perldoc.perl.org/perlretut.html#Compiling-and-saving-regular-expressions
"The qr// operator showed up in perl 5.005. It compiles a regular expression, but doesn't apply it."
If the data is large then it is worth to processing it paralel by split data into pieces. There are several modules in CPAN which makes this easier.

Related

fastest way to find one of several substrings in string

I'm doing a lot of file processing where I look for one of several substrings in each line. So I have code equivalent to this:
with open(file) as infile:
for line in infile:
for key in MY_SUBSTRINGS:
if key in line:
print(key, line)
MY_SUBSTRINGS is a list of 6-20 substrings. Substrings vary in length 10-30 chars and may contain spaces.
I'd really like to find a much faster way of doing this. Files have many 100k lines in them. Lines are typically 150 chars. User has to wait for 30s to a minute while file processes. The above is not the only thing taking time but it's taking quite a lot. I'm doing various other processes on a line-by-line basis so not appropraite to search the whole file as once.
I've tried the regex and ahocorasick answers from here but they both come out slower in my tests:
Fastest way to check whether a string is a substring in a list of strings
Any suggestions for faster methods?
I'm not quite sure of the best way to share example datasets. A logcat off an Android phone would be an example. One that's at least 200k lines long.
Then search for 10 strings like:
(NL80211_CMD_TRIGGER_SCAN) received for
Trying to associate with
Request to deauthenticate
interface state UNINITIALIZED->ENABLED
I tried regexes like this:
match_str = "|".join(MY_SUBSTRINGS)
regex = re.compile(match_str)
with open(file) as infile:
for line in infile:
match = regex.search(line)
if match:
print(match.group(0))
I would build a regular expression to search through the file.
Make sure that you're not running each of the search terms in loops when you use regex.
If each of your expressions are in one regexp it would look something like this:
import re
line = 'fsjdk abc def abc jkl'
re.findall(r'(abc|def)', line)
https://docs.python.org/3/library/re.html
If you need to to run still faster consider running a process concurrently with threads. This is a much broader topic but one method that might work is to first take a look at your problem and consider what the bottleneck might be.
If the issue is that your look is starved for disk throughput on the read what you can do is first run through the file and split it up into chunks and then map those chunks to worker threads that can process the data like a queue.
Definitely would need some more on your problem to understand exactly what kind of issue you're looking to solve. And there's people here that definitely would love to dig into a challenge.

Getting a regex trie to run faster?

I have a 50mb regex trie that I'm using to split phrases apart.
Here is the relevant code:
import io
import re
with io.open('REGEXES.rx.txt', encoding='latin-1') as myfile:
regex = myfile.read()
while True == True:
Password = input("Enter a phrase to be split: ")
Words = re.findall(regex, Password)
print(Words)
Since the regex is so large, this takes forever!
Here is the code I'm trying now, with re.compile(TempRegex):
import io
import re
with io.open('REGEXES.rx.txt', encoding='latin-1') as myfile:
TempRegex = myfile.read()
regex = re.compile(TempRegex)
while True == True:
Password = input("Enter a phrase to be split: ")
Words = re.findall(regex, Password)
print(Words)
What I'm trying to do is I'm trying to check to see if an entered phrase is a combination of names. For example, the phrase "johnsmith123" to return ['john', 'smith', '123']. The regex file was created by a tool from a word list of every first and last name from Facebook. I want to see if an entered phrase is a combination of words from that wordlist essentially ... If johns and mith are names in the list, then I would want "johnsmith123" to return ['john', 'smith', '123', 'johns', 'mith'].
I don't think that regex is the way to go here. It seems to me that all you are trying to do is to find a list of all of the substrings of a given string that happen to be names.
If the user's input is a password or passphrase, that implies a relatively short string. It's easy to break that string up into the set of possible substrings, and then test that set against another set containing the names.
The number of substrings in a string of length n is n(n+1)/2. Assuming that no one is going to enter more than say 40 characters you are only looking at 820 substrings, many of which could be eliminated as being too short. Here is some code to do that:
def substrings(s, min_length=1):
for start in range(len(s)):
for length in range(min_length, len(s)-start+1):
yield s[start:start+length]
So the problem then is loading the names into a suitable data structure. Your regex is 50MB, but considering the snippet that you showed in one of your comments, the amount of actual data is going to be a lot smaller than that due to the overhead of the regex syntax.
If you just used text files with one name per line you could do this:
names = set(word.strip().lower() for word in open('names.txt'))
def substrings(s, min_length=1):
for start in range(len(s)):
for length in range(min_length, len(s)-start+1):
yield s[start:start+length]
s = 'johnsmith123'
print(sorted(names.intersection(substrings(s)))
Might give output:
['jo', 'john', 'johns', 'mi', 'smith']
I doubt that there will be memory issues given the likely small data set, but if you find that there's not enough memory to load the full data set at once you could look at using sqlite3 with a simple table to store the names. This will be slower to query, but it will fit in memory.
Another way could be to use the shelve module to create a persistent dictionary with names as keys.
Python's regex engine is not actually a regular expression, since it includes features such as lookbehind, capture groups, back references, and uses backtracking to match the leftmost valid branch instead of the longest.
If you use a true regex engine, you will almost always get better results if your regex does not require those features.
One of the most important qualities of a true regular expression is that it will always return a result in time proportional to the length of the input, without using any memory.
I've written one myself using a DFA implemented in C (but usable from python via cffi), which will have optimal asymptotic performance, but I haven't tried constant-factor improvements such as vectorization and assembly generation. I didn't make a generally usable API though since I only need to call it from within my library, but it shouldn't be too hard to figure out from the examples. (Note that search can be implemented as match with .* up front, then match backward, but for my purpose I would rather return a single character as an error token). Link to my project
You might also consider building the DFA offline and using it for multiple runs of your program - but this is what flex does so there was no point in me doing that for my project, so maybe just use that if you're comfortable with C? Of course you'd almost certainly have to write a fair bit of custom C code to use my project anyway ...
If you compile it, the regex patterns will be compiled into a bytecodes then run by a matching engine. If you don't compile it, it will load it over and over for the same regex whenever it is called. That's why compiled one is way faster if you are using same regex for multiple different records.

Best way to chunk a large string by line

I have a large file (400+ MB) that I'm reading in from S3 using get_contents_as_string(), which means that I end up with the entire file in memory as a string. I'm running several other memory-intensive operations in parallel, so I need a memory-efficient way of splitting the resulting string into chunks by line number. Is split() efficient enough? Or is something like re.finditer() a better way to go?
I see three options here, from the most memory-consuming to the least:
split will create a copy of your file as a list of strings, meaning additional 400 MB used. Easy to implement, takes RAM.
Use re or simply iterate over a string and memorize \n positions: for i, c in enumerate(s): if c == '\n': newlines.append(i+1).
The same as point 2, but with the string stored as a file on HDD. Slow but really memory efficient, also addressing the disadvantage of Python strings - they're immutable, and if one wants to do some changes, interpreter will create a copy. Files don't suffer from this, allowing in-place operations without loading the whole file at all.
I would also suggest to encapsulate solutions 2 or 3 into a separate class in order to keep newline indexes and the string contents consistent. Proxy pattern and the idea of lazy evaluation would fit here, I think.
you could try to read the file line by line
f= open(filename)
partialstring = f.readline()
see
https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files

How to search for string in Python by removing line breaks but return the exact line where the string was found?

I have a bunch of PDF files that I have to search for a set of keywords against. I have to extract the exact line where the keyword was found. I first used xpdf's pdf2text to convert the file to PDF. (Tried solr but had a tough time tailoring the output/schema to suit my requirement).
import sys
file_name = sys.argv[1]
searched_string = sys.argv[2]
result = [(line_number+1, line) for line_number, line in enumerate(open(file_name)) if searched_string.lower() in line.lower()]
#print result
for each in result:
print each[0], each[1]
ThinkCode:~$ python find_string.py sample.txt "String Extraction"
The problem I have with this is that for cases where search string is broken towards the end of the line :
If you are going to index large binary files, remember to change the
size limits. String
Extraction is a common problem
If I am searching for 'String Extraction', I will miss this keyword if I use the code presented above. What is the most efficient way of achieving this without making 2 copies of text file (one for searching the keyword to extract the line (number) and the other for removing line breaks and finding the keyword to eliminate the case where the keyword spans across 2 lines).
Much appreciated guys!
Note: Some considerations without any code, but I think they belong to an answer rather than to a comment.
My idea would be to search only for the first keyword; if a match is found, search for the second. This allows you to, if the match is found at the end of the line, take into consideration the next line and do line concatenation only if a match is found in first place*.
Edit:
Coded a simple example and ended up using a different algorithm; the basic idea behind it is this code snippet:
def iterwords(fh):
for number, line in enumerate(fh):
for word in re.split(r'\s+', line.strip()):
yield number, word
It iterates over the file handler and produces a (line_number, word) tuple for each word in the file.
The matching afterwards becomes pretty easy; you can find my implementation as a gist on github. It can be run as follows:
python search.py 'multi word search string' file.txt
There is one main concern with the linked code, I didn't code a workaround both for performance and complexity reasons. Can you figure it out? (Spoiler: try to search for a sentence whose first word appears two times in a row in the file)
* I didn't perform any testing on my own, but this article and the python wiki suggest that string concatenation is not that efficient in python (don't know how actual the information is).
There may be a better way of doing it, but my suggestion would be to start by taking in two lines (let's call them line1 and line2), concatenating them into line3 or something similar, and then search that resultant line.
Then you'd assign line2 to line1, get a new line2, and repeat the process.
Use the flag re.MULTILINE when compiling your expressions: http://docs.python.org/library/re.html#re.MULTILINE
Then use \s to represent all white space (including new lines).

Efficient way to do a large number of search/replaces in Python?

I'm fairly new to Python, and am writing a series of script to convert between some proprietary markup formats. I'm iterating line by line over files and then basically doing a large number (100-200) of substitutions that basically fall into 4 categories:
line = line.replace("-","<EMDASH>") # Replace single character with tag
line = line.replace("<\\#>","#") # tag with single character
line = line.replace("<\\n>","") # remove tag
line = line.replace("\xe1","•") # replace non-ascii character with entity
the str.replace() function seems to be pretty efficient (fairly low in the numbers when I examine profiling output), but is there a better way to do this? I've seen the re.sub() method with a function as an argument, but am unsure if this would be better? I guess it depends on what kind of optimizations Python does internally. Thought I would ask for some advice before creating a large dict that might not be very helpful!
Additionally I do some parsing of tags (that look somewhat like HTML, but are not HTML). I identify tags like this:
m = re.findall('(<[^>]+>)',line)
And then do ~100 search/replaces (mostly removing matches) within the matched tags as well, e.g.:
m = re.findall('(<[^>]+>)',line)
for tag in m:
tag_new = re.sub("\*t\([^\)]*\)","",tag)
tag_new = re.sub("\*p\([^\)]*\)","",tag_new)
# do many more searches...
if tag != tag_new:
line = line.replace(tag,tag_new,1) # potentially problematic
Any thoughts of efficiency here?
Thanks!
str.replace() is more efficient if you're going to do basic search and replaces, and re.sub is (obviously) more efficient if you need complex pattern matching (because otherwise you'd have to use str.replace several times).
I'd recommend you use a combination of both. If you have several patterns that all get replaced by one thing, use re.sub. If you just have some cases where you just need to replace one specific tag with another, use str.replace.
You can also improve efficiency by using larger strings (call re.sub once instead of once for each line). Increases memory use, but shouldn't be a problem unless the file is HUGE, but also improves execution time.
If you don't actually need the regex and are just doing literal replacing, string.replace() will almost certainly be faster. But even so, your bottleneck here will be file input/output, not string manipulation.
The best solution though would probably be to use cStringIO
Depending on the ratio of relevant-to-not-relevant portions of the text you're operating on (and whether or not the parts each substitution operates on overlap), it might be more efficient to try to break down the input into tokens and work on each token individually.
Since each replace() in your current implementation has to examine the entire input string, that can be slow. If you instead broke down that stream into something like...
[<normal text>, <tag>, <tag>, <normal text>, <tag>, <normal text>]
# from an original "<normal text><tag><tag><normal text><tag><normal text>"
...then you could simply look to see if a given token is a tag, and replace it in the list (and then ''.join() at the end).
You can pass a function object to re.sub instead of a substitution string, it takes the match object and returns the substitution, so for example
>>> r = re.compile(r'<(\w+)>|(-)')
>>> r.sub(lambda m: '(%s)' % (m.group(1) if m.group(1) else 'emdash'), '<atag>-<anothertag>')
'(atag)(emdash)(anothertag)'
Of course you can use a more complex function object, this lambda is just an example.
Using a single regex that does all the substitution should be slightly faster than iterating the string many times, but if a lot of substitutions are perfomed the overhead of calling the function object that computes the substitution may be significant.

Categories