Missing something in the regex? - python

I'm trying to use this regex
art\..*[A-Z].*\s
to extract the text in bold here
some text bla art. 100 of Important_text other text bla
Basically, I would like to extract all the text that follow this pattern:
*art.* *number* *whatever* *first word that starts in uppercase*
But it's not working as expected. Any suggestion?

With your shown samples, please try following.
\bart\..*?\d+.*?[A-Z]\w*
Online demo for above regex
Explanation: Adding detailed explanation for above.
\b ##mentioning word boundary here.
art\. ##Looking for word art with a literal dot here.
.*?\d+ ##Using non-greedy approach for matching 1 or more digits.
.*?[A-Z]\w* ##Using non-greedy approach to match 1 capital letter followed by word characters.

You can match art. then match until the first digits and then match until the first occurrence of an uppercase char.
\bart\.\D*\d+[^A-Z]*[A-Z]\S*
The pattern matches
\bart\. Match art. preceded by a word boundary
\D*\d+ Match 0+ times a non digit, followed by 1+ digits
[^A-Z]* Match 0+ times any char except A-Z
[A-Z]\S* Match a char A-Z followed by optional non whitespace chars.
Regex demo
If the word has to start with A-Z you can assert a whitespace boundary to the left using (?<!\S) before matching an uppercase char A-Z.
\bart\.\D*\d+[^A-Z]*(?<!\S)[A-Z]\S*

Related

word boundary \b doesn't work on string with dot in Python regex [duplicate]

For example: George R.R. Martin
I want to match only George and Martin.
I have tried: \w+\b. But doesn't work!
The \w+\b. matches 1+ word chars that are followed with a word boundary, and then any char that is a non-word char (as \b restricts the following . subpattern). Note that this way is not negating anything and you miss an important thing: a literal dot in the regex pattern must be escaped.
You may use a negative lookahead (?!\.):
var s = "George R.R. Martin";
console.log(s.match(/\b\w+\b(?!\.)/g));
See the regex demo
Details:
\b - leading word boundary
\w+ - 1+ word chars
\b - trailing word boundary
(?!\.) - there must be no . after the last word char matched.
See more about how negative lookahead works here.

Email Validation using Python Regular Expression

I have 9 email patterns. I expect:
myname#domainemail.com
my.name#domainemail.com
my.name1#domainemail.com
my_name.1#domainemail.com
are valid emails.
and
my-name#domainemail.com
my.name.1#domainemail.com
domainname.1#domainemail.com
1myname#domainemail.com
1.myname#domainemail.com
are not valid emails.
Then, I have made script of regex like:
regex = r"(^[a-zA-Z_]+[\.]?[a-z0-9]+)#([\w.]+\.[\w.]+)$"
But, email domainname.1#domainemail.com is still valid.
How to make the right pattern regex so that email become not valid, and all of email patterns can fit to my expectation?
For the example data you could either match an optional part with underscores where a dot followed by a digit is allowed before the #
Or you match a part that with a dot and a char a-z before the #
^[a-zA-Z]+(?:(?:_[a-zA-Z0-9]+)+\.[A-Za-z0-9]+|\.[a-zA-Z][a-zA-Z0-9]*)?#(?:[a-zA-Z0-9]+\.)*[a-zA-Z0-9]{2,}$
Explanation
^ Start of string
[a-zA-Z]+ Match 1+ times a char a-z
(?: Non capture group
(?:_[a-zA-Z0-9]+)+ Repeat 1+ times an underscore followed by a char a-z or digit 0-9
\.[A-Za-z0-9]+ Match a dot and 1+ chars a-z or digit 0-9
| Or
\.[a-zA-Z][a-zA-Z0-9]* Match a a dot and a single char a-z and 0+ chars a-z or digits
)? Close group and make it optional
# Match literally
(?:[a-zA-Z0-9]+\.)* Repeat 0+ times a-z0-9 followed by a dot
[a-zA-Z0-9]{2,} Match a-z0-9 2 or more times
$ End of string
Regex demo
Use the following regex pattern with gmi flags:
^[a-z]+(?:(?:\.[a-z]+)+\d*|(?:_[a-z]+)+(?:\.\d+)?)?#(?!.*\.\.)[^\W_][a-z\d.]+[a-z\d]{2}$
https://regex101.com/r/xoVprE/4

Regular Expression for a string contains if characters all in capital python

I'm extracting textual paragraph followed by text like "OBSERVATION #1" or "OBSERVATION #2" in the output from library like PyPDF2.
However there would be some error so it could be like "OBSERVA'TION #2" and I have to avoid like "Suite #300" so the rule is "IF THERE IS CHARACTER, IT WOULD BE IN CAPITAL".
Currently the python code snippet like
inspection_observation=pdfFile.getPage(z).extractText()
if 'OBSERVATION' in inspection_observation:
for finding in re.findall(r"[OBSERVATION] #\d+(.*?) OBSERVA'TION #\d?", inspection_observation, re.DOTALL):
#print inspection_observation;
print finding;
Please advise the appropriate regular expression for this instance,
If there should be a capital and the word can contain a ', you could use a character class where you can list the characters that are allowed and a positive lookahead.
Then you can capture the content between those capital words and use a positive lookahead to check if what follows is another capital word followed by # and 1+ digits or the end of the string. This regex makes use of re.DOTALL where the dot matches a newline.
(?=[A-Z']*[A-Z])[A-Z']+\s+#\d+(.*?(?=[A-Z']*[A-Z][A-Z']*\s+#\d+|$))
Explanation
(?=[A-Z']*[A-Z]) Positive lookahead to assert what follows at least a char A-Z where a ' can occur before
[A-Z']+\s+#\d+ match 1+ times A-Z or ', 1+ whitespace characters and 1+ digits
( Capture group
.*? Match any character
(?= Positive lookahead to assert what follows is
[A-Z']*[A-Z][A-Z']* Match uppercase char A-Z where a ' can be before and after
\s+#\d+ Match 1+ whitespace chars, # and 1+ digits or the end of the string
) Close non capture group
) Close capture group
Regex demo

Regex to match preceding word

i'm attempting to extract the word 'Here' as 'Here' contains a capital letter at beginning of word and occurs before word 'now'.
Here is my attempt based on regex from :
regex match preceding word but not word itself
import re
sentence = "this is now test Here now tester"
print(re.compile('\w+(?= +now\b)').match(sentence))
None is printed in above example.
Have I implemented regex correctly ?
The following works for the given example:
Regex:
re.search(r'\b[A-Z][a-z]+(?= now)', sentence).group()
Output:
'Here'
Explanation:
\b imposes word boundary
[A-Z] requires that word begins with capital letter
[a-z]+ followed by 1 or more lowercase letters (modify as necessary)
(?= now) positive look-ahead assertion to match now with leading whitespace

extract string using regular expression

fix_release='Ubuntu 16.04 LTS'
p = re.compile(r'(Ubuntu)\b(\d+[.]\d+)\b')
fix_release = p.search(fix_release)
logger.info(fix_release) #fix_release is None
I want to extract the string 'Ubuntu 16.04'
But, result is None.... How can I extract the correct sentence?
You confused the word boundary \b with white space, the former matches the boundary between a word character and a non word character and consumes zero character, you can simply use r'Ubuntu \d+\.\d+' for your case:
fix_release='Ubuntu 16.04 LTS'
p = re.compile(r'Ubuntu \d+\.\d+')
p.search(fix_release).group(0)
# 'Ubuntu 16.04'
Try this Regex:
Ubuntu\s*\d+(?:\.\d+)?
Click for Demo
Explanation:
Ubuntu - matches Ubuntu literally
\s* - matches 0+ occurrences of a white-space, as many as possible
\d+ - matches 1+ digits, as many as possible
(?:\.\d+)? - matches a . followed by 1+ digits, as many as possible. A ? at the end makes this part optional.
Note: In your regex, you are using \b for the spaces. \b returns 0 length matches between a word-character and a non-word character. You can use \s instead

Categories