fastest way to find one of several substrings in string - python

I'm doing a lot of file processing where I look for one of several substrings in each line. So I have code equivalent to this:
with open(file) as infile:
for line in infile:
for key in MY_SUBSTRINGS:
if key in line:
print(key, line)
MY_SUBSTRINGS is a list of 6-20 substrings. Substrings vary in length 10-30 chars and may contain spaces.
I'd really like to find a much faster way of doing this. Files have many 100k lines in them. Lines are typically 150 chars. User has to wait for 30s to a minute while file processes. The above is not the only thing taking time but it's taking quite a lot. I'm doing various other processes on a line-by-line basis so not appropraite to search the whole file as once.
I've tried the regex and ahocorasick answers from here but they both come out slower in my tests:
Fastest way to check whether a string is a substring in a list of strings
Any suggestions for faster methods?
I'm not quite sure of the best way to share example datasets. A logcat off an Android phone would be an example. One that's at least 200k lines long.
Then search for 10 strings like:
(NL80211_CMD_TRIGGER_SCAN) received for
Trying to associate with
Request to deauthenticate
interface state UNINITIALIZED->ENABLED
I tried regexes like this:
match_str = "|".join(MY_SUBSTRINGS)
regex = re.compile(match_str)
with open(file) as infile:
for line in infile:
match = regex.search(line)
if match:
print(match.group(0))

I would build a regular expression to search through the file.
Make sure that you're not running each of the search terms in loops when you use regex.
If each of your expressions are in one regexp it would look something like this:
import re
line = 'fsjdk abc def abc jkl'
re.findall(r'(abc|def)', line)
https://docs.python.org/3/library/re.html
If you need to to run still faster consider running a process concurrently with threads. This is a much broader topic but one method that might work is to first take a look at your problem and consider what the bottleneck might be.
If the issue is that your look is starved for disk throughput on the read what you can do is first run through the file and split it up into chunks and then map those chunks to worker threads that can process the data like a queue.
Definitely would need some more on your problem to understand exactly what kind of issue you're looking to solve. And there's people here that definitely would love to dig into a challenge.

Related

Python : Fast search a valid substring for text from list of substrings

I need a fast and efficient method for searching pattern string from a list of many pattern strings which are valid substring of a string.
Conditions -
I have a list of 100 pattern strings added in a particular sequence (known).
The test case file is of size 35 GB and contains long strings in subsequent
lines
Ask -
I have to traverse the file and for each line, I have to search for a matched pattern string that is a valid substring of the line (whichever comes first from the list of 100 pattern strings).
Example -
pattern_strings = ["earth is round and huge","earth is round", "mars is small"]
Testcase file contents -
Among all the planets, the earth is round and mars is small.
..
..
Hence for the first line, the string at index 1 should qualify the condition.
Currently, I am trying to do a linear search -
def search(line,list_of_patterns):
for pat in list_of_patterns:
if pat in line:
return pat
else:
continue
return -1
The current run time is 21 minutes. The intent is to reduce it further. Need suggestions!
One trick I know of, though it has nothing to do with changing your existing code, is to try to run your code with PyPy rather than the standard CPython interpreter. That could be one trick that does significantly speed up execution.
https://www.pypy.org/features.html
As I have installed and used it myself, I can tell you know that installation is fairly simple.
This is one option if you do not want to change your code.
Another suggestion would be to time your code or use profilers to see where the bottleneck is and what is taking a relatively long amount of time.
Code-wise, you could avoid for loop and try these methods: https://betterprogramming.pub/how-to-replace-your-python-for-loops-with-map-filter-and-reduce-c1b5fa96f43a
A final option would be to write that piece of code in a faster more performant language such as C++ and call that .exe (if on Windows) from Python.

Getting a regex trie to run faster?

I have a 50mb regex trie that I'm using to split phrases apart.
Here is the relevant code:
import io
import re
with io.open('REGEXES.rx.txt', encoding='latin-1') as myfile:
regex = myfile.read()
while True == True:
Password = input("Enter a phrase to be split: ")
Words = re.findall(regex, Password)
print(Words)
Since the regex is so large, this takes forever!
Here is the code I'm trying now, with re.compile(TempRegex):
import io
import re
with io.open('REGEXES.rx.txt', encoding='latin-1') as myfile:
TempRegex = myfile.read()
regex = re.compile(TempRegex)
while True == True:
Password = input("Enter a phrase to be split: ")
Words = re.findall(regex, Password)
print(Words)
What I'm trying to do is I'm trying to check to see if an entered phrase is a combination of names. For example, the phrase "johnsmith123" to return ['john', 'smith', '123']. The regex file was created by a tool from a word list of every first and last name from Facebook. I want to see if an entered phrase is a combination of words from that wordlist essentially ... If johns and mith are names in the list, then I would want "johnsmith123" to return ['john', 'smith', '123', 'johns', 'mith'].
I don't think that regex is the way to go here. It seems to me that all you are trying to do is to find a list of all of the substrings of a given string that happen to be names.
If the user's input is a password or passphrase, that implies a relatively short string. It's easy to break that string up into the set of possible substrings, and then test that set against another set containing the names.
The number of substrings in a string of length n is n(n+1)/2. Assuming that no one is going to enter more than say 40 characters you are only looking at 820 substrings, many of which could be eliminated as being too short. Here is some code to do that:
def substrings(s, min_length=1):
for start in range(len(s)):
for length in range(min_length, len(s)-start+1):
yield s[start:start+length]
So the problem then is loading the names into a suitable data structure. Your regex is 50MB, but considering the snippet that you showed in one of your comments, the amount of actual data is going to be a lot smaller than that due to the overhead of the regex syntax.
If you just used text files with one name per line you could do this:
names = set(word.strip().lower() for word in open('names.txt'))
def substrings(s, min_length=1):
for start in range(len(s)):
for length in range(min_length, len(s)-start+1):
yield s[start:start+length]
s = 'johnsmith123'
print(sorted(names.intersection(substrings(s)))
Might give output:
['jo', 'john', 'johns', 'mi', 'smith']
I doubt that there will be memory issues given the likely small data set, but if you find that there's not enough memory to load the full data set at once you could look at using sqlite3 with a simple table to store the names. This will be slower to query, but it will fit in memory.
Another way could be to use the shelve module to create a persistent dictionary with names as keys.
Python's regex engine is not actually a regular expression, since it includes features such as lookbehind, capture groups, back references, and uses backtracking to match the leftmost valid branch instead of the longest.
If you use a true regex engine, you will almost always get better results if your regex does not require those features.
One of the most important qualities of a true regular expression is that it will always return a result in time proportional to the length of the input, without using any memory.
I've written one myself using a DFA implemented in C (but usable from python via cffi), which will have optimal asymptotic performance, but I haven't tried constant-factor improvements such as vectorization and assembly generation. I didn't make a generally usable API though since I only need to call it from within my library, but it shouldn't be too hard to figure out from the examples. (Note that search can be implemented as match with .* up front, then match backward, but for my purpose I would rather return a single character as an error token). Link to my project
You might also consider building the DFA offline and using it for multiple runs of your program - but this is what flex does so there was no point in me doing that for my project, so maybe just use that if you're comfortable with C? Of course you'd almost certainly have to write a fair bit of custom C code to use my project anyway ...
If you compile it, the regex patterns will be compiled into a bytecodes then run by a matching engine. If you don't compile it, it will load it over and over for the same regex whenever it is called. That's why compiled one is way faster if you are using same regex for multiple different records.

Fastest way of processing regexp

I have a script in python to process a log file - it parses the values and joins them simply with a tab.
p = re.compile(
"([0-9/]+) ([0-9]+):([0-9]+):([0-9]+) I.*"+
"worker\\(([0-9]+)\\)(?:#([^]]*))?.*\\[([0-9]+)\\] "+
"=RES= PS:([0-9]+) DW:([0-9]+) RT:([0-9]+) PRT:([0-9]+) IP:([^ ]*) "+
"JOB:([^!]+)!([0-9]+) CS:([\\.0-9]+) CONV:([^ ]*) URL:[^ ]+ KEY:([^/]+)([^ ]*)"
)
for line in sys.stdin:
line = line.strip()
if len(line) == 0: continue
result = p.match(line)
if result != None:
print "\t".join([x if x is not None else "." for x in result.groups()])
However, the scripts behaves quite slowly and it takes a long time to process the data.
How can I achieve the same behaviour in faster way? Perl/SED/PHP/Bash/...?
Thanks
It is hard to know without seeing your input, but it looks like your log file is made up of fields that are separated by spaces and do not contain any spaces internally. If so, you could split on whitespace first to put the individual log fields into an array. i.e.
line.split() #Split based on whitespace
or
line.split(' ') #Split based on a single space character
After that, use a few small regexes or even simple string operations to extract the data from the fields that you want.
It would likely be much more efficient, because the bulk of the line processing is done with a simple rule. You wouldn't have the pitfalls of potential backtracking, and you would have more readable code that is less likely to contain mistakes.
I don't know Python, so I can't write out a full code example, but that is the approach I would take in Perl.
Im writing Perl, not Python, but recently i used this technique to parse very big logs:
Divide input file to chunks (for example, FileLen/NumProcessors bytes
each).
Adjust start and end of every chunk to \n so you take full lines to
each worker.
fork() to create NumProcessors workers, each of which reading own
bytes range from file and writes his own output file.
Merge output files if needed.
Sure, you should work to optimize the regexp too, for example less use .* cus it will create many backtraces, this is slow. But anyway, 99% you will have bottleneck on CPU by this regexp, so working on 8 CPUs should help.
In Perl it is possible to use precompiled regexps which are much faster if you are using them many times.
http://perldoc.perl.org/perlretut.html#Compiling-and-saving-regular-expressions
"The qr// operator showed up in perl 5.005. It compiles a regular expression, but doesn't apply it."
If the data is large then it is worth to processing it paralel by split data into pieces. There are several modules in CPAN which makes this easier.

How to search for string in Python by removing line breaks but return the exact line where the string was found?

I have a bunch of PDF files that I have to search for a set of keywords against. I have to extract the exact line where the keyword was found. I first used xpdf's pdf2text to convert the file to PDF. (Tried solr but had a tough time tailoring the output/schema to suit my requirement).
import sys
file_name = sys.argv[1]
searched_string = sys.argv[2]
result = [(line_number+1, line) for line_number, line in enumerate(open(file_name)) if searched_string.lower() in line.lower()]
#print result
for each in result:
print each[0], each[1]
ThinkCode:~$ python find_string.py sample.txt "String Extraction"
The problem I have with this is that for cases where search string is broken towards the end of the line :
If you are going to index large binary files, remember to change the
size limits. String
Extraction is a common problem
If I am searching for 'String Extraction', I will miss this keyword if I use the code presented above. What is the most efficient way of achieving this without making 2 copies of text file (one for searching the keyword to extract the line (number) and the other for removing line breaks and finding the keyword to eliminate the case where the keyword spans across 2 lines).
Much appreciated guys!
Note: Some considerations without any code, but I think they belong to an answer rather than to a comment.
My idea would be to search only for the first keyword; if a match is found, search for the second. This allows you to, if the match is found at the end of the line, take into consideration the next line and do line concatenation only if a match is found in first place*.
Edit:
Coded a simple example and ended up using a different algorithm; the basic idea behind it is this code snippet:
def iterwords(fh):
for number, line in enumerate(fh):
for word in re.split(r'\s+', line.strip()):
yield number, word
It iterates over the file handler and produces a (line_number, word) tuple for each word in the file.
The matching afterwards becomes pretty easy; you can find my implementation as a gist on github. It can be run as follows:
python search.py 'multi word search string' file.txt
There is one main concern with the linked code, I didn't code a workaround both for performance and complexity reasons. Can you figure it out? (Spoiler: try to search for a sentence whose first word appears two times in a row in the file)
* I didn't perform any testing on my own, but this article and the python wiki suggest that string concatenation is not that efficient in python (don't know how actual the information is).
There may be a better way of doing it, but my suggestion would be to start by taking in two lines (let's call them line1 and line2), concatenating them into line3 or something similar, and then search that resultant line.
Then you'd assign line2 to line1, get a new line2, and repeat the process.
Use the flag re.MULTILINE when compiling your expressions: http://docs.python.org/library/re.html#re.MULTILINE
Then use \s to represent all white space (including new lines).

re.compile(pattern, file) calls causes the system to crash

I have a file I need to parse. The parsing is built incrementally, such that on each iteration the expressions becomes more case specific.
The code segment which overloads the system looks roughly like this:
for item in ret:
pat = r'a\sstyle=".+class="VEAPI_Pushpin"\sid="msftve(.+?)".+>%s<'%item[1]
r=re.compile(pat, re.DOTALL)
match = r.findall(f)
The file is a rather large HTML file (parsed from bing maps), and each answer must match its exact id.
Before appying this change the workflow was very good. Is there anything I can do to avoid this? Or to optimize the code?
My only guess is that you are getting too many matches and running out of memory. Though this doesn't seem very reasonable, it might be the case. Try using finditer instead of findall to get one match at a time without creating a monster list of matches. If that doesn't fix your problem, you might have stumbled on a more serious bug in the re module.

Categories