Regex to match preceding word - python

i'm attempting to extract the word 'Here' as 'Here' contains a capital letter at beginning of word and occurs before word 'now'.
Here is my attempt based on regex from :
regex match preceding word but not word itself
import re
sentence = "this is now test Here now tester"
print(re.compile('\w+(?= +now\b)').match(sentence))
None is printed in above example.
Have I implemented regex correctly ?

The following works for the given example:
Regex:
re.search(r'\b[A-Z][a-z]+(?= now)', sentence).group()
Output:
'Here'
Explanation:
\b imposes word boundary
[A-Z] requires that word begins with capital letter
[a-z]+ followed by 1 or more lowercase letters (modify as necessary)
(?= now) positive look-ahead assertion to match now with leading whitespace

Related

Vowels not at the end or start of the words in string

I am trying to find the words in string not starting or ending with letters 'aıoueəiöü'. But regex fails to find words when I use this code:
txt = "Nasa has fixed a problem with malfunctioning equipment on a new rocket designed to take astronauts to the Moon."
re.findall(r"\b[^aıoueəiöü]\w+[^aıoueəiöü]\b",txt)
Instead, it works fine when whitespace character \s is added in negation part:
re.findall(r"\b[^aıoueəiöü\s]\w+[^aıoueəiöü\s]\b",txt)
I cannot understand the issue in first example of code, why should I specify whitespace characters too?
Note that [^aıoueəiöü] matches any char other than a, ı, o, u, e, ə, i, ö and ü. It can match a whitespace, a digit, punctuation, etc.
Also, you regex matches strings of at least three chars, you need to adjust it to match one and two char strings, too.
You do not have to rely on excluding whitespace from the pattern. Since you only want to match word chars other than vowels, add \W rather than \s:
\b[^\Waıoueəiöü](?:\w*[^\Waıoueəiöü])?\b
See the regex demo.
Details:
\b - a word boundary
[^\Waıoueəiöü] - any word char except a letter from the aıoueəiöü set
(?:\w*[^\Waıoueəiöü])? - an optional occurrence of
\w* - any zero or more word chars
[^\Waıoueəiöü] - any word char except a letter from the aıoueəiöü set
\b - a word boundary

Missing something in the regex?

I'm trying to use this regex
art\..*[A-Z].*\s
to extract the text in bold here
some text bla art. 100 of Important_text other text bla
Basically, I would like to extract all the text that follow this pattern:
*art.* *number* *whatever* *first word that starts in uppercase*
But it's not working as expected. Any suggestion?
With your shown samples, please try following.
\bart\..*?\d+.*?[A-Z]\w*
Online demo for above regex
Explanation: Adding detailed explanation for above.
\b ##mentioning word boundary here.
art\. ##Looking for word art with a literal dot here.
.*?\d+ ##Using non-greedy approach for matching 1 or more digits.
.*?[A-Z]\w* ##Using non-greedy approach to match 1 capital letter followed by word characters.
You can match art. then match until the first digits and then match until the first occurrence of an uppercase char.
\bart\.\D*\d+[^A-Z]*[A-Z]\S*
The pattern matches
\bart\. Match art. preceded by a word boundary
\D*\d+ Match 0+ times a non digit, followed by 1+ digits
[^A-Z]* Match 0+ times any char except A-Z
[A-Z]\S* Match a char A-Z followed by optional non whitespace chars.
Regex demo
If the word has to start with A-Z you can assert a whitespace boundary to the left using (?<!\S) before matching an uppercase char A-Z.
\bart\.\D*\d+[^A-Z]*(?<!\S)[A-Z]\S*

A word starting with t but ends with other than e

I am trying to create a regex that starts with t or T and doesn't end with e letter. I tried the code below so far, but it's not giving me the desirable result. Could anyone show me what is exactly missing here?
my_str = my_file.read()
word = re.findall("[tT].*[^e]$", my_str)
print(word)
You can use
\bt(?:[a-z]*[a-df-z])?\b
\bt[a-z]*\b(?<!e)
Just for completeness, here is a regex to match any word starting with a Cyrillic т and not ending with a Cyrillic е:
\bт[^\W\d_]*\b(?<!е)
See the regex demo #1, regex demo #2 and a Cyrillic regex demo.
If you need a case insensitive matching, add re.I:
re.findall(r'\bt(?:[a-z]*[a-df-z])?\b', text, re.I)
And a note on word boundaries: if the words can be glued to _ or digits, use letter boundaries rather than word boundaries:
r'(?<![a-z])t(?:[a-z]*[a-df-z])?(?![a-z])'
r'(?<![^\W\d_])т[^\W\d_]*(?![^\W\d_])(?<!е)' # Unicode letter boundaries
Regex details
\b - word boundary (start of string or a position immediately after a char other than a digit, letter, underscore)
(?<![a-z]) ((?<![^\W\d_]) is a Unicode aware equivalent) - a negative lookbehind that matches a location that is not immediately preceded with a letter
t - a t letter
(?:[a-z]*[a-df-z])? - an optional non-capturing group matching 0 or more letters and then a letter other than e
\b - word boundary
(?![a-z]) ((?![^\W\d_]) is a Unicode aware equivalent) - a negative lookahead that matches a location that is not immediately followed with a letter.
Also,
\bt[a-z]*\b(?<!e) matches a word boundary, t, any zero or more lowercase ASCII letters (any ASCII letters with re.I), then a word boundary marks the end of a word and the negative lookbehind (?<!e) fails the match if there is e at the end of the word
[^\W\d_]* - matches zero or more more Unicode letters.
See a Python demo:
import re
text = r't, train => main,teene!'
cyr_text = r'таня тане работе'
print( re.findall(r'\bt(?:[a-z]*[a-df-z])?\b', text, re.I) )
# => ['t', 'train']
print( re.findall(r'\bt[a-z]*\b(?<!e)', text, re.I) )
# => ['t', 'train']
print( re.findall(r'\bт[^\W\d_]*\b(?<!е)', cyr_text, re.I) )
# => ['таня']
print( re.findall(r'(?<![^\W\d_])т[^\W\d_]*(?![^\W\d_])(?<!е)', cyr_text, re.I) )
# => ['таня']
There is also another way of doing it:
re.findall(r"\b[Tt]+[a-zA-Z]*[^Ee\s]\b", my_str)
Maybe:
[\W]([Tt]\w*[^e])[\W]
Any non word character followed by (capture: Tt, some optional word characters, not e) followed by first non word character

Matching an apostrophe only within a word or string

I'm looking for a Python regex that can match 'didn't' and returns only the character that is immediately preceded by an apostrophe, like 't, but not the 'd or t' at the beginning and end.
I have tried (?=.*\w)^(\w|')+$ but it only matches the apostrophe at the beginning.
Some more examples:
'I'm' should only match 'm and not 'I
'Erick's' should only return 's and not 'E
The text will always start and end with an apostrophe and can include apostrophes within the text.
To match an apostrophe inside a whole string = match it anwyhere but at the start/end of the string:
(?!^)'(?!$)
See the regex demo.
Often, the apostophe is searched only inside a word (but in fact, a pair of words where the second one is shortened), then you may use
\b'\b
See this regex demo. Here, the ' is preceded and followed with a word boundary, so that ' could be preceded with any word, letter or _ char. Yes, _ char and digits are allowed to be on both sides.
If you need to match a ' only between two letters, use
(?<=[A-Za-z])'(?=[A-Za-z]) # ASCII only
(?<=[^\W\d_])'(?=[^\W\d_]) # Any Unicode letters
See this regex demo.
As for this current question, here is a bunch of possible solutions:
import re
s = "'didn't'"
print(s.strip("'")[s.strip("'").find("'")+1])
print(re.search(r'\b\'(\w)', s).group(1))
print(re.search(r'\b\'([^\W\d_])', s).group(1))
print(re.search(r'\b\'([a-z])', s, flags=re.I).group(1))
print(re.findall(r'\b\'([a-z])', "'didn't know I'm a student'", flags=re.I))
The s.strip("'")[s.strip("'").find("'")+1] gets the character after the first ' after stripping the leading/trailing apostrophes.
The re.search(r'\b\'(\w)', s).group(1) solution gets the word (i.e. [a-zA-Z0-9_], can be adjusted from here) char after a ' that is preceded with a word char (due to the \b word boundary).
The re.search(r'\b\'([^\W\d_])', s).group(1) is almost identical to the above solution, it only fetches a letter character as [^\W\d_] matches any char other than a non-word, digit and _.
Note that the re.search(r'\b\'([a-z])', s, flags=re.I).group(1) solution is next to identical to the above one, but you cannot make it Unicode aware with re.UNICODE.
The last re.findall(r'\b\'([a-z])', "'didn't know I'm a student'", flags=re.I) just shows how to fetch multiple letter chars from a string input.

Regex to match a string with 2 capital letters only

I want to write a regex which will match a string only if the string consists of two capital letters.
I tried - [A-Z]{2}, [A-Z]{2, 2} and [A-Z][A-Z] but these only match the string 'CAS' while I am looking to match only if the string is two capital letters like 'CA'.
You could use anchors:
^[A-Z]{2}$
^ matches the beginning of the string, while $ matches its end.
Note in your attempts, you used [A-Z]{2, 2} which should actually be [A-Z]{2,2} (without space) to mean the same thing as the others.
You need to add word boundaries,
\b[A-Z]{2}\b
DEMO
Explanation:
\b Matches between a word character and a non-word character.
[A-Z]{2} Matches exactly two capital letters.
\b Matches between a word character and a non-word character.
You could try:
\b[A-Z]{2}\b
\b matches a word boundary.
Try =
^[A-Z][A-Z]$
Just added start and end points for the string.

Categories