extract string using regular expression - python

fix_release='Ubuntu 16.04 LTS'
p = re.compile(r'(Ubuntu)\b(\d+[.]\d+)\b')
fix_release = p.search(fix_release)
logger.info(fix_release) #fix_release is None
I want to extract the string 'Ubuntu 16.04'
But, result is None.... How can I extract the correct sentence?

You confused the word boundary \b with white space, the former matches the boundary between a word character and a non word character and consumes zero character, you can simply use r'Ubuntu \d+\.\d+' for your case:
fix_release='Ubuntu 16.04 LTS'
p = re.compile(r'Ubuntu \d+\.\d+')
p.search(fix_release).group(0)
# 'Ubuntu 16.04'

Try this Regex:
Ubuntu\s*\d+(?:\.\d+)?
Click for Demo
Explanation:
Ubuntu - matches Ubuntu literally
\s* - matches 0+ occurrences of a white-space, as many as possible
\d+ - matches 1+ digits, as many as possible
(?:\.\d+)? - matches a . followed by 1+ digits, as many as possible. A ? at the end makes this part optional.
Note: In your regex, you are using \b for the spaces. \b returns 0 length matches between a word-character and a non-word character. You can use \s instead

Related

Regex python ignore word followed by given character

I have the regex (?<=^|(?<=[^a-zA-Z0-9-_\.]))#([A-Za-z]+[A-Za-z0-9-_]+)(?!\w).
Given the string #first#nope #second#Hello #my-friend, email# whats.up#example.com #friend, what can I do to exclude the strings #first and #second since they are not whole words on their own ?
In other words, exclude them since they are succeeded by # .
You can use
(?<![a-zA-Z0-9_.-])#(?=([A-Za-z]+[A-Za-z0-9_-]*))\1(?![#\w])
(?a)(?<![\w.-])#(?=([A-Za-z][\w-]*))\1(?![#\w])
See the regex demo. Details:
(?<![a-zA-Z0-9_.-]) - a negative lookbehind that matches a location that is not immediately preceded with ASCII digits, letters, _, . and -
# - a # char
(?=([A-Za-z]+[A-Za-z0-9_-]*)) - a positive lookahead with a capturing group inside that captures one or more ASCII letters and then zero or more ASCII letters, digits, - or _ chars
\1 - the Group 1 value (backreferences are atomic, no backtracking is allowed through them)
(?![#\w]) - a negative lookahead that fails the match if there is a word char (letter, digit or _) or a # char immediately to the right of the current location.
Note I put hyphens at the end of the character classes, this is best practice.
The (?a)(?<![\w.-])#(?=([A-Za-z][\w-]*))\1(?![#\w]) alternative uses shorthand character classes and the (?a) inline modifier (equivalent of re.ASCII / re.A makes \w only match ASCII chars (as in the original version). Remove (?a) if you plan to match any Unicode digits/letters.
Another option is to assert a whitespace boundary to the left, and assert no word char or # sign to the right.
(?<!\S)#([A-Za-z]+[\w-]+)(?![#\w])
The pattern matches:
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left
# Match literally
([A-Za-z]+[\w-]+) Capture group1, match 1+ chars A-Za-z and then 1+ word chars or -
(?![#\w]) Negative lookahead, assert not # or word char to the right
Regex demo
Or match a non word boundary \B before the # instead of a lookbehind.
\B#([A-Za-z]+[\w-]+)(?![#\w])
Regex demo

Missing something in the regex?

I'm trying to use this regex
art\..*[A-Z].*\s
to extract the text in bold here
some text bla art. 100 of Important_text other text bla
Basically, I would like to extract all the text that follow this pattern:
*art.* *number* *whatever* *first word that starts in uppercase*
But it's not working as expected. Any suggestion?
With your shown samples, please try following.
\bart\..*?\d+.*?[A-Z]\w*
Online demo for above regex
Explanation: Adding detailed explanation for above.
\b ##mentioning word boundary here.
art\. ##Looking for word art with a literal dot here.
.*?\d+ ##Using non-greedy approach for matching 1 or more digits.
.*?[A-Z]\w* ##Using non-greedy approach to match 1 capital letter followed by word characters.
You can match art. then match until the first digits and then match until the first occurrence of an uppercase char.
\bart\.\D*\d+[^A-Z]*[A-Z]\S*
The pattern matches
\bart\. Match art. preceded by a word boundary
\D*\d+ Match 0+ times a non digit, followed by 1+ digits
[^A-Z]* Match 0+ times any char except A-Z
[A-Z]\S* Match a char A-Z followed by optional non whitespace chars.
Regex demo
If the word has to start with A-Z you can assert a whitespace boundary to the left using (?<!\S) before matching an uppercase char A-Z.
\bart\.\D*\d+[^A-Z]*(?<!\S)[A-Z]\S*

word boundary \b doesn't work on string with dot in Python regex [duplicate]

For example: George R.R. Martin
I want to match only George and Martin.
I have tried: \w+\b. But doesn't work!
The \w+\b. matches 1+ word chars that are followed with a word boundary, and then any char that is a non-word char (as \b restricts the following . subpattern). Note that this way is not negating anything and you miss an important thing: a literal dot in the regex pattern must be escaped.
You may use a negative lookahead (?!\.):
var s = "George R.R. Martin";
console.log(s.match(/\b\w+\b(?!\.)/g));
See the regex demo
Details:
\b - leading word boundary
\w+ - 1+ word chars
\b - trailing word boundary
(?!\.) - there must be no . after the last word char matched.
See more about how negative lookahead works here.

Python regex: removing all special characters and numbers NOT attached to words

I am trying to remove all special characters and numbers in python, except numbers that are directly attached to words.
I have succeded in doing this for all cases of special characters and numbers attached and not attached to words, how to do it in such a way that numbers attached are not matched.
Here's what I did:
import regex as re
string = "win32 backdoor guid:64664646 DNS-lookup h0lla"
re.findall(r'[^\p{P}\p{S}\s\d]+', string.lower())
I get as output
win backdoor guid DNS lookup h lla
But I want to get:
win32 backdoor guid DNS lookup h0lla
demo: https://regex101.com/r/x4HrGo/1
To match alphanumeric strings or only letter words you may use the following pattern with re:
import re
# ...
re.findall(r'(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*|[^\W\d_]+', text.lower())
See the regex demo.
Details
(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*- either 1+ letters followed with a digit, or 1+ digits followed with a letter, and then 0+ letters/digits
| - or
[^\W\d_]+ - either any 1+ Unicode letters
NOTE It is equivalent to \d*[^\W\d_][^\W_]* pattern posted by PJProudhon, that matches any 1+ alphanumeric character chunks with at least 1 letter in them.
You could give a try to \b\d*[^\W\d_][^\W_]*\b
Decomposition:
\b # word boundary
/d* # zero or more digits
[^\W\d_] # one alphabetic character
[^\W_]* # zero or more alphanumeric characters
\b # word boundary
For beginners:
[^\W] is typical double negated construct. Here you want to match any character which is not alphanumeric or _ (\W is the negation of \w, which matches any alphanumeric character plus _ - common equivalent [a-zA-Z0-9_]).
It reveals useful here to compose:
Any alphanumeric character = [^\W_] matches any character which is not non-[alphanumeric or _] and is not _.
Any alphabetic character = [^\W\d_] matches any character which is not non-[alphanumeric or _] and is not digit (\d) and is not _.
Some further reading here.
Edit:
When _ is also considered a word delimiter, just skip the word boundaries, which toggle on that character, and use \d*[^\W\d_][^\W_]*.
Default greediness of star operator will ensure all relevant characters are actually matched.
Demo.
Try this RegEx instead:
([A-Za-z]+(\d)*[A-Za-z]*)
You can expand it from here, for example flipping the * and + on the first and last sets to capture string like "win32" and "01ex" equally.

Regular Expression for a string contains if characters all in capital python

I'm extracting textual paragraph followed by text like "OBSERVATION #1" or "OBSERVATION #2" in the output from library like PyPDF2.
However there would be some error so it could be like "OBSERVA'TION #2" and I have to avoid like "Suite #300" so the rule is "IF THERE IS CHARACTER, IT WOULD BE IN CAPITAL".
Currently the python code snippet like
inspection_observation=pdfFile.getPage(z).extractText()
if 'OBSERVATION' in inspection_observation:
for finding in re.findall(r"[OBSERVATION] #\d+(.*?) OBSERVA'TION #\d?", inspection_observation, re.DOTALL):
#print inspection_observation;
print finding;
Please advise the appropriate regular expression for this instance,
If there should be a capital and the word can contain a ', you could use a character class where you can list the characters that are allowed and a positive lookahead.
Then you can capture the content between those capital words and use a positive lookahead to check if what follows is another capital word followed by # and 1+ digits or the end of the string. This regex makes use of re.DOTALL where the dot matches a newline.
(?=[A-Z']*[A-Z])[A-Z']+\s+#\d+(.*?(?=[A-Z']*[A-Z][A-Z']*\s+#\d+|$))
Explanation
(?=[A-Z']*[A-Z]) Positive lookahead to assert what follows at least a char A-Z where a ' can occur before
[A-Z']+\s+#\d+ match 1+ times A-Z or ', 1+ whitespace characters and 1+ digits
( Capture group
.*? Match any character
(?= Positive lookahead to assert what follows is
[A-Z']*[A-Z][A-Z']* Match uppercase char A-Z where a ' can be before and after
\s+#\d+ Match 1+ whitespace chars, # and 1+ digits or the end of the string
) Close non capture group
) Close capture group
Regex demo

Categories