Scaling regex on big strings in Python

Scaling regex on big strings in Python - python

I'm trying to take Regex substring one mismatch in any location of string and turn it into a big data situation where I can:
Match all instances of big substrings such as SSQPSPSQSSQPSS (and allowing only one possible mismatch within this substring) to a much larger string such as SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQSPSSQSSQPSS.
In reality, my substrings and the strings that I match them to are in the hundreds and sometimes even thousands of letters and I wish to incorporate the possibility of mismatches.
How can I scale the regex notation of Regex substring one mismatch in any location of string to solve my big data problems? Is there an efficient way to go about this?

You may try this,
>>> s = "SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQSPSSQSSQPSS"
>>> re.findall(r'(?=(SSQPSPSQSSQPSS|[A-Z]SQPSPSQSSQPSS|S[A-Z]QPSPSQSSQPSS|SS[A-Z]PSPSQSSQPSS))', s)
['SSQPSPSQSSQPSS', 'SSQPSPSQSSQPSS', 'SSQPSPSQSSQPSS', 'SSQPSPSQSSQPSS']
Likwise add pattern with replacing remaining chars with [A-Z].

Related

Finding a simpler Python RegEx for a string that contains each character at least once

I'm working on a small project and have need for Regular Expressions that accept strings that contain each character in a given alphabet at least once.
So for the alphabet {J, K, L} I would need a RegEx that accepts strings containing J one or more times AND K one or more times, AND L one or more times, in any order, with any amount of duplicate characters before, after, or in-between.
I'm pretty inexperienced with RegEx and so have trouble finding "lateral thinking" solutions to many problems. My first approach to this was therefore pretty brute-force: I took each possible "base" string, for example,
JKL, JLK, KJL, KLJ, LKJ, LJK
and allow for any string that could be built up from one of those starting points. However the resulting regular expression* (despite working) ends up being very long and containing a lot of redundancy. Not to mention this approach becomes completely untenable once the alphabet has more than a handful of characters.
I spent a few hours trying to find a more elegant approach, but I have yet to find one that still accepts every possible string. Is there a method or technique I could be using to get this done in a way that's more elegant and scalable (to larger alphabets)?
*For reference, my regular expression for the listed example:
((J|K|L)*J(J|K|L)*K(J|K|L)*L(J|K|L)*)|
((J|K|L)*J(J|K|L)*L(J|K|L)*K(J|K|L)*)|
((J|K|L)*K(J|K|L)*J(J|K|L)*L(J|K|L)*)|
((J|K|L)*K(J|K|L)*L(J|K|L)*J(J|K|L)*)|
((J|K|L)*L(J|K|L)*J(J|K|L)*K(J|K|L)*)|
((J|K|L)*L(J|K|L)*K(J|K|L)*J(J|K|L)*)

This is a typical use-case for a lookahead. You can simply use ^(?=[^J]*J)(?=[^K]*K)(?=[^L]*L) to check all your conditions. If your string also must contain only these characters, you can append [JKL]+$ to it.

If using regex is not a requirement you could also check for the characters individually:
text = ...
alphabet = 'JKL'
assert all([character in text for character in alphabet])
Or if you do not want to allow characters that are not in the alphabet:
assert set(alphabet) == set(text)

How can I use use a regex to match characters that aren't included in certain words?

Suppose I want to return all occurrences of 'lep' in a string in Python, but not if an occurrence is in a substring like 'filepath' or 'telephone'. Right now I am using a combination of negative lookahead/lookbehind:
(?<!te|fi)lep(?!hone|ath)
However, I do want 'telepath' and 'filephone' as well as 'filep' and 'telep'. I've seen similar questions but not one that addresses this type of combination of lookahead/behind.
Thanks!

You can place lookaheads inside lookbehinds (and vice-versa; any combination, really, so long as every lookbehind has a fixed length). That allows you to combine the two conditions into one (doesn't begin with X and end with Y):
lep(?<!telep(?=hone))(?<!filep(?=ath))
Putting the lookbehinds last is more efficient, too. I would advise doing it that way even if there's no suffix (for example, lep(?<!filep) to exclude filep).
However, generating the regexes from user input like lep -telephone -filepath promises to be finicky and tedious. If you can, it would be much easier to search for the unwanted terms first and eliminate them. For example, search for:
(?:telephone|filepath|(lep))
If the search succeeds and group(1) is not None, it's a hit.

Finding repetitive substrings

Having some arbitrary string such as
hello hello hello I am I am I am your string string string string of strings
Can I somehow find repetitive sub-strings delimited by spaces(EDIT)? In this case it would be 'hello', 'I am' and 'string'.
I have been wondering about this for some time but I still can not find any real solution.
I also have read some articles concerning this topic and hit up on suffix trees but can this help me even though I need to find every repetition e.g. with repetition count higher than two?
If it is so, is there some library for python, that can handle suffix trees and perform operations on them?
Edit: I am sorry I was not clear enough. So just to make it clear - I am looking for repetitive sub-strings, that means sequences in string, that, for example, in terms of regular expressions can be substituted by + or {} wildcards. So If I would have to make regular expression from listed string, I would do
(hello ){3}(I am ){3}your (string ){4}of strings

To find two or more characters that repeat two or more times, each delimited by spaces, use:
(.{2,}?)(?:\s+\1)+
Here's a working example with your test string: http://bit.ly/17cKX62
EDIT: made quantifier in capture group reluctant by adding ? to match shortest possible match (i.e. now matches "string" and not "string string")
EDIT 2: added a required space delimiter for cleaner results

The most elegant way to find n words in String with the particular word

There is a big string and I need to find all substrings containing exactly N words (if it is possible).
For example:
big_string = "The most elegant way to find n words in String with the particular word"
N = 2
find_sub(big_string, 'find', N=2) # => ['way to find n words']
I've tried to solve it with regular expressions, but it happened to be more complex then I expect at first. Is there an elegant solution around I've just overlook?
Upd
By word we mean everything separated by \b
N parameter indicates how many words on each side of the 'find' should be

For your specific example (if we use the "word" definition of regular expressions, i.e. anything containing letters, digits and underscores) the regex would look like this:
r'(?:\w+\W+){2}find(?:\W+\w+){2}'
\w matches one of said word characters. \W matches any other character. I think it's obvious where in the pattern your parameters go. You can use the pattern with re.search or re.findall.
The issue is if there are less than the desired amount of words around your query (i.e. if it's too close to one end of the string). But you should be able to get away with:
r'(?:\w+\W+){0,2}find(?:\W+\w+){0,2}'
thanks to greediness of repetition. Note that in any case, if you want multiple results, matches can never overlap. So if you use the first pattern, you will only get the first match, if two occurrences of find are to close to each other, whereas in the second, you won't get n words before the second find (the ones that were already consumed will be missing). In particular, if two occurrences of find are closer together than n so that the second find will already be part of the first match, then you can't get the second match at all.
If you want to treat a word as anything that is not a white-space character, the approach looks similar:
r'(?:\S+\s+){0,2}find(?:\s+\S+){0,2}'
For anything else you will have to come up with the character classes yourself, I guess.

Text Parsing Design

Let's say I have a paragraph of text like so:
Snails can be found in a very wide
range of environments including
ditches, deserts, and the abyssal
depths of the sea. Numerous kinds of snail can
also be found in fresh waters. (source)
I have 10,000 regex rules to match text, which can overlap. For example, the regex /Snails? can/i will find two matches (italicized in the text). The regex /can( also)? be/i has two matches (bolded).
After iterating through my regexes and finding matches, what is the best data structure to use, that given some place in the text, it returns all regexes that mached it? For example, if I want the matches for line 1, character 8 (0-based, which is the a in can), I would get a match for both regexes previously described.
I can create a hashmap: (key: character location, value: set of all matching regexes). Is this optimal? Is there a better way to parse the text with thousands of regexes (to not loop through each one)?
Thanks!

Storing all of the matches in a dictionary will work, but will it means you'll have to store all of the matches in memory at the same time. If your data is small enough to easily fit into memory, don't worry about it. Just do what works and move on.
If you do need to reduce memory usage of increase speed it really depends on how you are using the data. For example, if you process positions starting at the beginning and going to the end, you could use re.finditer to iteratively process all of the regexes and not maintain extra matches in memory longer then needed.

I'm assuming that your regex does not cross between multiple sentences. In that case you could
1) break your text into array of sentences
2) for each sentence simply record which (id) regex have matched.
3) when you would like to see the match - run the regex again.
"Store less / compute more" solution.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scaling regex on big strings in Python - python

Related

Finding a simpler Python RegEx for a string that contains each character at least once

How can I use use a regex to match characters that aren't included in certain words?

Finding repetitive substrings

The most elegant way to find n words in String with the particular word

Text Parsing Design

Categories

Resources