Inverse regex match on group in Python - python

I see a lot of similarly worded questions, but I've had a strikingly difficult time coming up with the syntax for this.
Given a list of words, I want to print all the words that do not have special characters.
I have a regex which identifies words with special characters \w*[\u00C0-\u01DA']\w*. I've seen a lot of answers with fairly straightforward scenarios like a simple word. However, I haven't been able to find anything that negates a group - I've seen several different sets of syntax to include the negative lookahead ?!, but I haven't been able to come up with a syntax that works with it.
In my case given a string like: "should print nŌt thìs"
should print should and print but not the other two words. re.findall("(\w*[\u00C0-\u01DA']\w*)", paragraph.text) gives you the special characters - I just want to invert that.

For this particular case, you can simply specify the regular alphabet range in your search:
a = "should print nŌt thìs"
re.findall(r"(\b[A-Za-z]+\b)", a)
# ['should', 'print']
Of course you can add digits or anything else you want to match as well.
As for negative lookaheads, they use the syntax (?!...), with ? before !, and they must be in parentheses. To use one here, you can use:
r"\b(?!\w*[À-ǚ])\w*"
This:
Checks for a word boundary \b, like a space or the start of the input string.
Does the negative lookahead and stops the match if it finds any special character preceded by 0 or more word characters. You have to include the \w* because (?![À-ǚ]) would only check for the special character being the first letter in the word.
Finally, if it makes it past the lookahead, it matches any word characters.
Demo. Note in regex101.com you must specify Python flavor for \b to work properly with special characters.
There is a third option as well:
r"\b[^À-ǚ\s]*\b"
The middle part [^À-ǚ\s]* means match any character other than special characters or whitespace an unlimited number of times.

I know this is not a regex, but just a completely different idea you may not have had besides using regexes. I suppose it would be also much slower but I think it works:
>>> import unicodedata as ud
>>> [word for word in ['Cá', 'Lá', 'Aqui']\
if any(['WITH' in ud.name(letter) for letter in word])]
['Cá', 'Lá']
Or use ... 'WITH' not in to reverse.

Related

Excluding a string in a regex expression

I currently have the following regular expression:
/(_|[a-z]|[A-Z])(_|[a-z]|[A-Z]|[0-9])*/
I would like the expression not to match with "PI", however I failed to do so.
To clarify, I would like the following to be valid:
_PI, abcPI, PIpipipi
I just dont want to accept PI when its on its own.
Before jumping at the solution, please have a look at your regex: the character classes for single ranges inside alternation groups is an inefficient way of writing regex patterns. You may simply merge these ([A-Z]|[0-9]|_)+ into [A-Z0-9_]+.
The solution may be a word boundary with a negative lookahead after it:
r"\b(?!PI\b)[_a-zA-Z][_a-zA-Z0-9]*"
See the regex demo. You may replace [a-zA-Z0-9_] with \w:
re.compile(r"\b(?!PI\b)[_a-zA-Z]\w*") # In Python 2.x, re.UNICODE is not enabled by default
re.compile(r"\b(?!PI\b)[_a-zA-Z]\w*", re.A) # In Python 3.x, make \w match ASCII only
Details
\b - word boundary
(?!PI\b) - immediately to the right, there can't be PI as a whole word
[_a-zA-Z] - an ASCII letter or _
[_a-zA-Z0-9]* - 0 or more underscores, ASCII letters or digits.
Submitting another answer:
^(((?!PI).)*)$|^.*(PI).+$|^.+(PI).*$
I broke it down into 3 cases using OR |:
1) Match a string that doesn't contain PI at all.
^(((?!PI).)*)$
2) Match a string that has PI in it but has at least one character behind it, and optionally any characters ahead of it.
^.*(PI).+$
3) Match a string that has PI in it but has at least one character ahead of it, and optionally any characters behind it.
^.+(PI).*$
Here it is with test cases:
https://regex101.com/r/7rzqpe/3
Please comment if you find a missing edge case.
Not very nice, but I'll add it anyway for variety:
/^([A-OQ-Za-z_][A-Za-z0-9_]*|P([A-HJ-Za-z0-9_][A-Za-z0-9_]*)?)$/

Why does this regular expression to match two consecutive words not work?

There is a similar question here: Regular Expression For Consecutive Duplicate Words. This addresses the general question of how to solve this problem, whereas I am looking for specific advice on why my solution does not work.
I'm using python regex, and I'm trying to match all consecutively repeated words, such as the bold in:
I am struggling to to make this this work
I tried:
[A-Za-z0-9]* {2}
This is the logic behind this choice of regex: The '[A-Za-z0-9]*' should match any word of any length, and '[A-Za-z0-9]* ' makes it consider the space at the end of the word. Hence [A-Za-z0-9]* {2} should flag a repetition of the previous word with a space at the end. In other words it says "For any word, find cases where it is immediately repeated after a space".
How is my logic flawed here? Why does this regex not work?
[A-Za-z0-9]* {2}
Quantifiers in regular expressions will always only apply to the element right in front of them. So a \d+ will look for one or more digits but x\d+ will look for a single x, followed by one or more digits.
If you want a quantifier to apply to more than just a single thing, you need to group it first, e.g. (x\d)+. This is a capturing group, so it will actually capture that in the result. This is sometimes undesired if you just want to group things to apply a common quantifier. In that case, you can prefix the group with ?: to make it a non-capturing group: (?:x\d)+.
So, going back to your regular expression, you would have to do it like this:
([A-Za-z0-9]* ){2}
However, this does not actually have any check that the second matched word is the same as the first one. If you want to match for that, you will need to use backreferences. Backreferences allow you to reference a previously captured group within the expression, looking for it again. In your case, this would look like this:
([A-Za-z0-9]*) \1
The \1 will reference the first capturing group, which is ([A-Za-z0-9]*). So the group will match the first word. Then, there is a space, followed by a backreference to the first word again. So this will look for a repetition of the same word separated by a space.
As bobble bubble points out in the comments, there is still a lot one can do to improve the regular expression. While my main concern was to explain the various concepts without focusing too much on your particular example, I guess I still owe you a more robust regular expression for matching two consecutive words within a string that are separated by a space. This would be my take on that:
\b(\w+)\s\1\b
There are a few things that are different to the previous approach: First of all, I’m looking for word boundaries around the whole expression. The \b matches basically when a word starts or ends. This will prevent the expression from matching within other words, e.g. neither foo fooo nor foo oo would be matched.
Then, the regular expression requires at least one character. So empty words won’t be matched. I’m also using \w here which is a more flexible way of including alphanumerical characters. And finally, instead of looking for an actual space, I accept any kind of whitespace between the words, so this could even match tabs or line breaks. It might make sense to add a quantifier there too, i.e. \s+ to allow multiple whitespace characters.
Of course, whether this works better for you, depends a lot on your actual requirements which we won’t be able to tell just from your one example. But this should give you a few ideas on how to continue at least.
You can match a previous capture group with \1 for the first group, \2 for the second, etc...
import re
s = "I am struggling to to make this this work"
matches = re.findall(r'([A-Za-z0-9]+) \1', s)
print(matches)
>>> ['to', 'this']
If you want both occurrences, add a capture group around \1:
matches = re.findall(r'([A-Za-z0-9]+) (\1)', s)
print(matches)
>>> [('to', 'to'), ('this', 'this')]
At a glance it looks like this will match any two words, not repeated words. If I recall correctly asterisk (*) will match zero or more times, so perhaps you should be using plus (+) for one or more. Then you need to provide a capture and re-use the result of the capture. Additionally the \w can be used for alphanumerical characters for clarity. Also \b can be used to match empty string at word boundary.
Something along the lines of the example below will get you part of the way.
>>> import re
>>> p = re.compile(r'\b(\w+) \1\b')
>>> p.findall('fa fs bau saa saa fa bau eek mu muu bau')
['saa']
These pages may offer some guidance:
Python regex cheat sheet
RegExp match repeated characters
Regular Expression For Consecutive Duplicate Words.
This should work: \b([A-Za-z0-9]+)\s+\1\b
\b matches a word boundary, \s matches whitespace and \1 specifies the first capture group.
>>> s = 'I am struggling to to make this this work'
>>> re.findall(r'\b([A-Za-z0-9]+)\s+\1\b', s)
['to', 'this']
Here is a simple solution not using RegEx.
sentence = 'I am struggling to to make this this work'
def find_duplicates_in_string(words):
""" Takes in a string and returns any duplicate words
i.e. "this this"
"""
duplicates = []
words = words.split()
for i in range(len(words) - 1):
prev_word = words[i]
word = words[i + 1]
if word == prev_word:
duplicates.append(word)
return duplicates
print(find_duplicates_in_string(sentence))

Python RE, is \b ever useful to indicate end of a word

I understand that \b can represent either the beginning or the end of a word. When would \b be required to represent the end? I'm asking because it seems that it's always necessary to have \s to indicate the end of the word, therefore eliminating the need to have \b. Like in the case below, one with a '\b' to end the inner group, the other without, and they get the same result.
m = re.search(r'(\b\w+\b)\s+\1', 'Cherry tree blooming will begin in in later March')
print m.group()
m = re.search(r'(\b\w+)\s+\1', 'Cherry tree blooming will begin in in later March')
print m.group()
\s is just whitespace. You can have word boundaries that aren't whitespace (punctuation, etc.) which is when you need to use \b. If you're only matching words that are delimited by whitespace then you can just use \s; and in that case you don't need the \b.
import re
sentence = 'Non-whitespace delimiters: Commas, semicolons; etc.'
print(re.findall(r'(\b\w+)\s+', sentence))
print(re.findall(r'(\b\w+\b)+', sentence))
Produces:
['whitespace']
['Non', 'whitespace', 'delimiters', 'Commas', 'semicolons', 'etc']
Notice how trying to catch word endings with just \s ends up missing most of them.
Consider wanting to match the word "march":
>>> regex = re.compile(r'\bmarch\b')
It can come at the end of the sentence...
>>> regex.search('I love march')
<_sre.SRE_Match object at 0x10568e4a8>
Or the beginning ...
>>> regex.search('march is a great month')
<_sre.SRE_Match object at 0x10568e440>
But if I don't want to match things like marching, word boundaries are the most convenient:
>>> regex.search('my favorite pass-time is marching')
>>>
You might be thinking "But I can get all of these things using r'\s+march\s+'" and you're kind of right... The difference is in what matches. With the \s+, you also might be including some whitespace in the match (since that's what \s+ means). This can make certain things like search for a word and replace it more difficult because you might have to manage keeping the whitespace consistent with what it was before.
It's not because it's at the end of the word, it's because you know what comes after the word. In your example:
m = re.search(r'(\b\w+\b)\s+\1', 'Cherry tree blooming will begin in in later March')
...the first \b is necessary to prevent a match starting with the in in begin. The second one is redundant because you're explicitly matching the non-word characters (\s+) that follow the word. Word boundaries are for situations where you don't know what the character on the other side will be, or even if there will be a character there.
Where you should be using another one is at the end of the regex. For example:
m = re.search(r'(\b\w+)\s+\1\b', "Let's go to the theater")
Without the second \b, you would get a false positive for the theater.
"I understand that \b can represent either the beginning or the end of a word. When would \b be required to represent the end?"
\b is never required to represent the end, or beginning, of a word. To answer your bigger question, it's only useful during development -- when working with natural language, you'll ultimately need to replace \b with something else. Why?
The \b operator matches a word boundary as you've discovered. But a key concept here is, "What is a word?" The answer is the very narrow set [A-Za-z0-9_] -- word is not a natural language word but a computer language identifier. The \b operator exists for a formal language's parser.
This means it doesn't handle common natural language situations like:
The word let's becomes two words, 'let' & 's' if `\b' represents the boundaries of a word. Also consider titles like Mr. & Mrs. lose their period.
Similarly, if `\b' represents the start of a word, then the appostrophe in these cases will be lost: 'twas 'bout 'cause
Hyphenated words suffer at the hand of `\b' as well, e.g mother-in-law (unless you want her to suffer.)
Unfortunately, you can't simply augment \b by including it in a character set as it doesn't represent a character. You may be able to combine it with other characters via alternation in a zero-width assertion.
When working with natural language, the \b operator is great for quickly prototyping an idea, but ultimately, probably not what you want. Ditto \w, but, since it represents a character, it's more easily augmented.

Python Regex to capture single character alphabeticals

Why doesn't the below regex print True?
print re.compile(r'^\b[a-z]\b$').search('(s)')
I want to match single char alphabeticals that may have non alphanumeric characters before and after, but do not have any more alphanumeric characters anywhere in the string. So the following should be matches:
'b'
'b)'
'(b)'
'b,
and the following should be misses:
'b(s)'
'blah(b)'
'bb)'
'b-b'
'bb'
The solutions here don't work.
The ^ at the begining and $ at the end cause the expression to match only if the entire string is a single character. (Thus, they make each \b obsolete.) Remove the anchors to match inside a larger string:
print re.compile(r'\b[a-z]\b').search('b(s)')
Alternatively, ensure only one character like:
print re.compile(r'^\W*[a-z]\W*$').match('b(s)')
Note that in the first case, 'b-b' and 'blah(b)' will match because they contain single alphabetical characters not touching others inside them. In the second case, 'b(s)' will not be a match, because it contains two alphabetical characters, but the other four cases will match correctly, and all of the no-match cases will return None (false logical value) as intended.
Ok here is the answer:
print re.compile(^[(,\[]?[a-z][),;\]]?[,;]?$).search('(s)')
It catches a variety of complex patterns for single character alphanumerics. I realize this is different than what I asked for but in reality it works better.

Extracting whole words

I have a large set of real-world text that I need to pull words out of to input into a spell checker. I'd like to extract as many meaningful words as possible without too much noise. I know there's plenty of regex ninjas around here, so hopefully someone can help me out.
Currently I'm extracting all alphabetical sequences with '[a-z]+'. This is an okay approximation, but it drags a lot of rubbish out with it.
Ideally I would like some regex (doesn't have to be pretty or efficient) that extracts all alphabetical sequences delimited by natural word separators (such as [/-_,.: ] etc.), and ignores any alphabetical sequences with illegal bounds.
However I'd also be happy to just be able to get all alphabetical sequences that ARE NOT adjacent to a number. So for instance 'pie21' would NOT extract 'pie', but 'http://foo.com' would extract ['http', 'foo', 'com'].
I tried lookahead and lookbehind assertions, but they were applied per-character (so for example re.findall('(?<!\d)[a-z]+(?!\d)', 'pie21') would return 'pi' when I want it to return nothing). I tried wrapping the alpha part as a term ((?:[a-z]+)) but it didn't help.
More detail: The data is an email database, so it's mostly plain English with normal numbers, but occasionally there's rubbish strings like GIHQ4NWL0S5SCGBDD40ZXE5IDP13TYNEA and AC7A21C0 that I'd like to ignore completely. I'm assuming any alphabetical sequence with a number in it is rubbish.
If you restrict yourself to ASCII letters, then use (with the re.I option set)
\b[a-z]+\b
\b is a word boundary anchor, matching only at the start and end of alphanumeric "words". So \b[a-z]+\b matches pie, but not pie21 or 21pie.
To also allow other non-ASCII letters, you can use something like this:
\b[^\W\d_]+\b
which also allows accented characters etc. You may need to set the re.UNICODE option, especially when using Python 2, in order to allow the \w shorthand to match non-ASCII letters.
[^\W\d_] as a negated character class allows any alphanumeric character except for digits and underscore.
Are you familiar with word boundaries? (\b). You can extract word's using the \b around the sequence and matching the alphabet within:
\b([a-zA-Z]+)\b
For instance, this will grab whole words but stop at tokens such as hyphens, periods, semi-colons, etc.
You can the \b sequence, and others, over at the python manual
EDIT Also, if you're looking to about a number following or preceding the match, you can use a negative look-ahead/behind:
(?!\d) # negative look-ahead for numbers
(?<!\d) # negative look-behind for numbers
What about:
import re
yourString="pie 42 http://foo.com GIHQ4NWL0S5SCGBDD40ZXE5IDP13TYNEA pie42"
filter (lambda x:re.match("^[a-zA-Z]+$",x),[x for x in set(re.split("[\s:/,.:]",yourString))])
Note that:
split explodes your string into potential candidates => returns a list of "potential words"
set makes unicity filtering => transforms the list in set, thus removing entries appearing more than once. This step is not mandatory.
filter reduces the number of candidates : takes a list, applies a test function to each element, and returns a list of the element succeeding the test. In our case, the test function is "anonymous"
lambda : anonymous function, taking an item and checking if it's a word (upper or lower letters only)
EDIT : added some explanations
Sample code
print re.search(ur'(?u)ривет\b', ur'Привет')
print re.search(ur'(?u)\bривет\b', ur'Привет')
or
s = ur"abcd ААБВ"
import re
rx1 = re.compile(ur"(?u)АБВ")
rx2 = re.compile(ur"(?u)АБВ\b")
rx3 = re.compile(ur"(?u)\bАБВ\b")
print rx1.findall(s)
print rx2.findall(s)
print rx3.findall(s)

Categories