I'm working on a regular expression that finds he or she that is surrounded by white space, so not finding he in other words (standalone). It is searching through a book.
I have tried the '+' 'and'
def q9():
pattern = r'\s(he)\s'
return re.compile(pattern)
This returns 1371 values when it should be 2000 This part doesn't really apply to you unless you know the book
Use this:
re.compile(r'\bs?he\b', re.I)
re.I do case-insentitive matching, \b is for word boundary, s?he means s is optional and he should always be matched. Equavalent way to write this is r'\b(she|he)\b' if you want to be more readable.
Related
I am looking for expressions as Vc Am in texts and for that I have
rex = r"(\(?)(?<!([A-Za-z0-9]))[A-Z][a-z](?!([A-Za-z0-9]))(\)?)"
explanation:
[A-Z][a-z] = Cap followed by lower case letter
(?<!([A-Za-z0-9])) -> lookbehind not being a letter or number
(?!([A-Za-z0-9]))(\)?) _ Look ahead not being letter or number
# all that optionally wihtin parenthesis
import re
text="this is Vc and not Cr nor Pb"
matches = re.finditer(rex,text)
What I want to achieve is exclude a list of terms like Cr or Pb.
How should I include exceptions in the expression?
thanks
First, let's shorten your RegEx:
(?<!([A-Za-z0-9])) -> lookbehind not being a letter or number
(?!([A-Za-z0-9]))(\)?) -> look ahead not being letter or number
these are so common there is a RegEx feature for them: Word boundaries \b. They have zero width like lookarounds and only match if there is no alphanumeric character.
Your RegEx then becomes \b[A-Z][a-z]\b; looking at this RegEx (and your examples), it appears you want to match certain element abbreviations?
Now you can simply use a lookbehind:
\b[A-Z][a-z](?<!Cr|Pb)\b
to assert that the element is neither Chrome nor Lead.
Just for fun:
Alternatively, if you want a less readable (but more portable) RegEx that makes do with fewer advanced RegEx features (not every engine supports lookaround), you can use character sets as per the following observations:
If the first letter is not a C or P, the second letter may be any lowercase letter;
If the first letter is a C, the second letter may not be an r
If the first letter is a P, the second letter may not be an b
Using character sets, this gives us:
[ABD-OQ-Z][a-z]
C[a-qs-z]
P[ac-z]
Operator precedence works as expected here: Concatenation (implicit) has higher precendence than alteration (|). This makes the RegEx [ABD-OQ-Z][a-z]|C[a-qs-z]|P[ac-z]. Wrapping this in word boundaries using a group gives us \b([ABD-OQ-Z][a-z]|C[a-qs-z]|P[ac-z])\b.
You might write the pattern without using the superfluous capture groups, and exclude matching Cr or Pb:
\(?(?<![A-Za-z0-9])(?!Cr\b|Pb\b)[A-Z][a-z](?![A-Za-z0-9])\)?
See a regex demo for the matches.
If you are not interested in matching the parenthesis, and you also do not want to allow an underscore along with the letters or numbers, you can use a word boundary instead:
\b(?!Cr\b|Pb\b)[A-Z][a-z]\b
Explanation
\b A word boundary to prevent a partial word match
(?! Negative lookahead
Cr\b|Pb\b Match either Cr or Pb
) Close the lookahead
[A-Z][a-z] Match a single uppercase and single lowercase char
\b A word boundary
Regex demo
I have a problem where I want to match any number of German words inside [] braces, ignoring the case. The expression should only match spaces and words, nothing else i.e no punctuation marks or parenthesis
E.g :
The expression ['über das thema schreibt'] should be matched with ['Über', 'das', 'Thema', 'schreibt']
I have one list with items of the former order and another with the latter order, as long as the words are same, they both should match.
The code I tried with is -
regex = re.findall('[(a-zA-Z_äöüÄÖÜß\s+)]', str(term))
or
re.findall('[(\S\s+)]', str(term))
But they are not working. Kindly help me find a solution
In the simplest form using \w+ works for finding words (needs Unicode flag for non-ascii chars), but since you want them to be within the square brackets (and quotes I assume) you'd need something a bit complex
\[(['\"])((\w+\s?)+)\1\]
\[ and \] are used to match the square brackets
['\"] matches either quote and the \1 makes sure the same quote is one the other end
\w+ captures 1 word. The \s? is for an optional space.
The whole string is in the second group which you can split to get the list
import re
text = "['über das thema schreibt']"
regex = re.compile("\[(['\"])((\w+\s?)+)['\"]\]", flags=re.U)
match = regex.match(text)
if match:
print(match.group(2).split())
(slight edit as \1 did not seem to work in the terminal for me)
I found the easiest solution to it :
for a, b in zip(list1, list2):
reg_a = re.findall('[(\w\s+)]', str(a).lower())
reg_b = re.findall('[(\w\s+)]', str(b).lower())
if reg_a == reg_b:
return True
else
return False
Updated based on comments to match each word. This simply ignores spaces, single quotes and square braces
import re
text = "['über das thema schreibt']"
re.findall("([a-zA-Z_äöüÄÖÜß]+)", str(text))
# ['über', 'das', 'thema', 'schreibt']
If you are solving case sensitivity issue, add the regex flaf re.IGNORECASE
like
re.findall('[(\S\s+)]', str(term),re.IGNORECASE)
You might need to consider converting them to unicode, if it did not help.
Would like to find the following pattern in a string:
word-word-word++ or -word-word-word++
So that it iterates the -word or word- pattern until the end of the substring.
the string is quite large and contains many words with those^ patterns.
The following has been tried:
p = re.compile('(?:\w+\-)*\w+\s+=', re.IGNORECASE)
result = p.match(data)
but it returns NONE. Does anyone know the answer?
Your regex will only match the first pattern, match() will only find one occurrence, and that only if it is immediately followed by some whitespace and an equals sign.
Also, in your example you implied you wanted three or more words, so here's a version that was changed in the following ways:
match both patterns (note the leading -?)
match only if there are at least three words to the pattern ({2,} instead of +)
match even if there's nothing after the pattern (the \b matches a word boundary. It is not really necessary here, since the preceding \w+ guarantees we are at a word boundary anyway)
returns all matches instead of only the first one.
Here's the code:
#!/usr/bin/python
import re
data=r"foo-bar-baz not-this -this-neither nope double-dash--so-nope -yeah-this-even-at-end-of-string"
p = re.compile(r'-?(?:\w+-){2,}\w+\b', re.IGNORECASE)
print p.findall(data)
# prints ['foo-bar-baz', '-yeah-this-even-at-end-of-string']
Can someone help me with this kind of regular expression matching?
For example, I'm searching through list containing different strings with a letter iterating at the end of the string:
MonsterA
MonsterB
MonsterC
HeroA
HeroB
HeroC
...
What I need this script to return is only the preceding part of the string, in this example Monster and Hero.
If you absolutely need a regex:
re.match(r"(.*)[A-Z]", word).group(1)
But it is not the most efficient if you just want to remove the last character.
You could use a positive lookahead assertion (?=...) to check the words ends in a single uppercase character and then use word boudaries \b...\b to ensure it does not match patterns that arent whole words:
>>> text = "This re will match MonsterA and HeroB but not heroC or MonsterCC"
>>> re.findall(r"\b[A-Z][a-z]+(?=[A-Z]\b)", text)
['Monster', 'Hero']
re.findall returns all such matches in a list.
I'm having trouble matching a string with regexp (I'm not that experienced with regexp). I have a string which contains a forward slash after each word and a tag. An example:
led/O by/O Timothy/PERSON R./PERSON Geithner/PERSON ,/O the/O president/O of/O the/O New/ORGANIZATION
In those strings, I am only interested in all strings that precede /PERSON. Here's the regexp pattern that I came up with:
(\w)*\/PERSON
And my code:
match = re.findall(r'(\w)*\/PERSON', string)
Basically, I am matching any word that comes before /PERSON. The output:
>>> reg
['Timothy', '', 'Geithner']
My problem is that the second match, matched to an empty string as for R./PERSON, the dot is not a word character. I changed my regexp to:
match = re.findall(r'(\w|.*?)\/PERSON', string)
But the match now is:
['led/O by/O Timothy', ' R.', ' Geithner']
It is taking everything prior to the first /PERSON which includes led/O by/O instead of just matching Timothy. Could someone please help me on how to do this matching, while including a full stop as an abbreviation? Or at least, not have an empty string match?
Thanks,
Match everything but a space character ([^ ]*). You also need the star (*) inside the capture:
match = re.findall(r'([^ ]*)\/PERSON', string)
Firstly, (\w|.) matches "a word character, or any character" (dot matches any character which is why you're getting those spaces).
Escaping this with a backslash will do the trick: (\w|\.)
Second, as #Ionut Hulub points out you may want to use + instead of * to ensure you match something but Regular Expressions work on the principle of "leftmost, longest" so it'll always try to match the longest part that it can before the slash.
If you want to match any non-whitespace character you can use \S instead of (\w|\.), which may actually be what you want.