I need to search a regular expression pattern from a huge list of file names. So, time complexity is very important factor for me. Currently, I am using this code to find the pattern:
for item in mylist:
if reobj.match(name):
// do some stuff and return the result
So, I think reobj.match call is taking O(n^2) time. Can anyone suggest me any other pattern searching algorithm in python to make the operation faster?
Please note that the list is not sorted, so can't do a binary search here.
Few other ideas I have is related to maintain an index on file names. But again the concern there is with the regular expression patterns. How can I index on patterns?
Related
I want to find the prefix of a word for nlp purposes(interested in morphological negation).
For example, I want to know "unable" is negative, but "university" does not have any sort of negation. I have been using the startswith python function so far, but obviously there can be some issues.
Does anyone have any experience with finding prefixes of words? I feel like there should be some library or api, but I'm not sure.
Thanks!
Short of a full morphological analyser, you can work around this with exception lists and longest matching.
For example: you assume un- expresses negation. First, find longer prefixes (such as uni-), and match for that first, before looking at un-. There will be a handful of exceptions, such as uninteresting, which you can check for separately. This will be a fairly smallish list. Then, once all the uni- words have been dealt with, anything starting with un- is a candidate, though there will also be exceptions, such as under.
A slightly better solution is possible if you have a basic word list: cut of un- from the beginning of the string, and check whether the remainder is in your word list. University will become iversity, which is not in your list, and thus it's not the un- prefix. However, uninteresting will become interesting, which is, so here you have found a valid prefix. All you need for this is a list of non-negated words. You can of course also use this for other prefixes, such as the alpha privative, as in atypical the remainder typical will be in your list.
If you don't have such a list, simply split your text into tokens, sort and unique them, and then scan down the line of words beginning with your candidate prefixes. It's a bit tedious, but the numbers of relevant words are not that big. It's what we all did in NLP 30 years ago... :)
I need to do text matching over a large number of strings (between each pair) and find the overlapping subsequences.I wanted to know if the knuth morris pratt algorithm will be best for this job, considering that I want this functionality in Python and it should be scalable over a large set of string ? I am looking for advice like if this is the best way to go about it or is there any better way to do string matching that is both scalable and efficient ?
TL;DR: scalable + efficient = RegEx.
First of all I recommend you to read: Regular Expression Matching Can Be Simple And Fast.
RegEx would probably be the most scalable solution since it's not just for matching, but also provide possibility of group-capturing and back-referencing.
Furthermore the Python's re module is written in C and will probably be faster than most of the code you'll write yourself in Python.
For simple substring searching you could use Knuth-Morris-Pratt algorithm indeed but when it comes to real-world words and phrases (which are not so repetitive), you could find that RegEx is better on average.
I have the following text
text = "This is a string with C1234567 and CM123456, CM123, F1234567 and also M1234, M123456"
And I would like to extract this list of substrings
['C1234567', 'CM123456', 'F1234567']
This is what I came up with
new_string = re.compile(r'\b(C[M0-9]\d{6}|[FM]\d{7})\b')
new_string.findall(text)
However, I was wondering if there's a way to do this faster since I'm interested in performing this operation tens of thousands of times.
I thought I could use ^ to match the beginning of string, but the regex expression I came up with
new_string = re.compile(r'\b(^C[M0-9]\d{6}|^[FM]\d{7})\b')
Doesn't return anything anymore. I know this is a very basic question, but I'm not sure how to use the ^ properly.
Good and bad news. Bad news, regex looks pretty good, going to be hard to improve. Good news, I have some ideas :) I would try to do a little outside the box thinking if you are looking for performance. I do Extract Transform Load work, and a lot with Python.
You are already doing the re.compile (big help)
The regex engine is left to right, so short circuit where you can. Doesn't seem to apply here
If you have a big chunk of data that you are going to be looping over multiple times, clean it up front ONCE of stuff you KNOW won't match. Think of an HTML page, you only want stuff in HEAD stuff to get HEAD and need to run loops of many regexes over that section. Extract that section, only do that section, not the whole page. Seems obvious, isn't always :)
Use some metrics, give cProfile a try. Maybe there is some logic around where you are regexing that you can speed up. At least you can find your bottleneck, maybe the regex isn't the problem at all.
Brief version
I have a collection of python code (part of instructional materials). I'd like to build an index of when the various python keywords, builtins and operators are used (and especially: when they are first used). Does it make sense to use ast to get proper tokenization? Is it overkill? Is there a better tool? If not, what would the ast code look like? (I've read the docs but I've never used ast).
Clarification: This is about indexing python source code, not English text that talks about python.
Background
My materials are in the form of ipython Notebooks, so even if I had an indexing tool I'd need to do some coding anyway to get the source code out. And I don't have an indexing tool; googling "python index" doesn't turn up anything with the intended sense of "index".
I thought, "it's a simple task, let's write a script for the whole thing so I can do it exactly the way I want". But of course I want to do it right.
The dumb solution: Read the file, tokenize on whitespaces and word boundaries, index. But this gets confused by the contents of strings (when does for really get introduced for the first time?), and of course attached operators are not separated: text+="this" will be tokenized as ["text", '+="', "this"]
Next idea: I could use ast to parse the file, then walk the tree and index what I see. This looks like it would involve ast.parse() and ast.walk(). Something like this:
for source in source_files:
with open(source) as fp:
code = fp.read()
tree = ast.parse(code)
for node in tree.walk():
... # Get node's keyword, identifier etc., and line number-- how?
print(term, source, line) # I do know how to make an index
So, is this a reasonable approach? Is there a better one? How should this be done?
Did you search on "index" alone, or for "indexing tool"? I would think that your main problem would be to differentiate a language concept from its natural language use.
I expect that your major difficulty here is not traversing the text, but in the pattern-matching to find these things. For instance, how do you recognize introducing for loops? This would be the word for "near" the word loop, with a for command "soon" after. That command would be a line beginning with for and ending with a colon.
That is just one pattern, albeit one with many variations. However, consider what it takes to differentiate that from a list comprehension, and that from a generation comprehension (both explicit and built-in).
Will you have directed input? I'm thinking that a list of topics and keywords is essential, not all of which are in the language's terminal tokens -- although a full BNF grammar would likely include them.
Would you consider a mark-up indexing tool? Sometimes, it's easier to place a mark at each critical spot, doing it by hand, and then have the mark-up tool extract an index from that. Such tools have been around for at least 30 years. These are also found with a search for "indexing tools", adding "mark-up" or "tagging" to the search.
Got it. I thought you wanted to parse both, using the code as the primary key for introduction. My mistake. Too much contact with the Society for Technical Communication. :-)
Yes, AST is overkill -- internally. Externally -- it works, it gives you a tree including those critical non-terminal tokens (such as "list comprehension"), and it's easy to get given a BNF and the input text.
This would give you a sequential list of parse trees. Your coding would consist of traversing the tress to make an index of each new concept from your input list. Once you find each concept, you index the instance, remove it from the input list, and continue until you run out of sample code or input items.
I have a list of regular expressions and a string. I want to know which of the regular expressions (possibly more than one) match the string, if any. The trivial solution would be to try regular expressions one by one, but this part is performance critical... Is there a faster solution? Maybe by combining regular expressions in a single state machine somehow?
Reason: regular expressions are user-supplied filters that match incoming strings. When the message with the string arrives, system needs to perform additional user-specified actions.
There are up to 10000 regular expressions available. Regular expressions are user-supplied and can be somewhat simplified, if necessary (.* should be allowed though :) ). The list of regex-es is saved in MongoDB, but I can also prefetch them and perform the search inside Python if necessary.
Similar questions:
similar to this question, but the constraints are different: fuzzy matching is not enough, number of regular expressions is much lower (up to 10k)
similar to this question, but I can pre-process the regexes if necessary
similar to this question, but I need to find all matching regular expressions
I would appreciate some help.
First of all, if you have > 10K regular expressions you definitely should prefetch and keep them compiled (using re.compile) in memory. As a second step, I recommend to think about parallelism. Threads in Python aren't strong, due to GIL. So use multiple processes instead. And third, I would think about scalability on number of servers, using ZeroMQ (or another MQ) for communication.
As an interesting scientific task you can try to build regexp parser, which builds trees of similar regexps:
A
|-B
|-c
|-D
|-E
So if regexp A is matches string, then B, C, D, E match it too. So you will be able to reduce number of checks. IMHO, this task will take so much time. Using a bunch of servers will be cheaper and faster.