Remove words from string except within quotes [duplicate] - python

I would like a Python regular expression that matches a given word that's not between simple quotes. I've tried to use the (?! ...) but without success.
In the following screenshot, I would like to match all foe except the one in the 4th line.
Plus, the text is given as one big string.
Here is the link regex101 and the sample text is below:
var foe = 10;
foe = "";
dark_vador = 'bad guy'
foe = ' I\'m your father, foe ! '
bar = thingy + foe

A regex solution below will work in most cases, but it might break if the unbalanced single quotes appear outside of string literals, e.g. in comments.
A usual regex trick to match strings in-context is matching what you need to replace and match and capture what you need to keep.
Here is a sample Python demo:
import re
rx = r"('[^'\\]*(?:\\.[^'\\]*)*')|\b{0}\b"
s = r"""
var foe = 10;
foe = "";
dark_vador = 'bad guy'
foe = ' I\'m your father, foe ! '
bar = thingy + foe"""
toReplace = "foe"
res = re.sub(rx.format(toReplace), lambda m: m.group(1) if m.group(1) else 'NEWORD', s)
print(res)
See the Python demo
The regex will look like
('[^'\\]*(?:\\.[^'\\]*)*')|\bfoe\b
See the regex demo.
The ('[^'\\]*(?:\\.[^'\\]*)*') part captures ingle-quoted string literals into Group 1 and if it matches, it is just put back into the result, and \bfoe\b matches whole words foe in any other string context - and subsequently is replaced with another word.
NOTE: To also match double quoted string literals, use r"('[^'\\]*(?:\\.[^'\\]*)*'|\"[^\"\\]*(?:\\.[^\"\\]*)*\")".

You can try this:-
((?!\'[\w\s]*)foe(?![\w\s]*\'))

How about this regular expression:
>>> s = '''var foe = 10;
foe = "";
dark_vador = 'bad guy'
' I\m your father, foe ! '
bar = thingy + foe'''
>>>
>>> re.findall(r'(?!\'.*)foe(?!.*\')', s)
['foe', 'foe', 'foe']
The trick here is to make sure the expression does not match any string with leading and trailing ' and to remember to account for the characters in between, thereafter .* in the re expression.

((?!\'[\w\s]*[\\']*[\w\s]*)foe(?![\w\s]*[\\']*[\w\s]*\'))

Capture group 1 of the following regular expression will contain matches of 'foe'.
r'^(?:[^'\n]|\\')*(?:(?<!\\)'(?:[^'\n]|\\')*(?:(?<!\\)')(?:[^'\n]|\\')*)*\b(foe)\b'
Start your engine!
Python's regex engine performs the following operations.
^ : assert beginning of string
(?: : begin non-capture group
[^'\n] : match any char other than single quote and line terminator
| : or
\\' : match '\' then a single quote
) : end non-capture group
* : execute non-capture group 0+ times
(?: : begin non-capture group
(?<!\\) : next char is not preceded by '\' (negative lookbehind)
' : match single quote
(?: : begin non-capture group
[^'\n] : match any char other than single quote and line terminator
| : or
\\' : match '\' then a single quote
) : end non-capture group
* : execute non-capture group 0+ times
(?: : begin non-capture group
(?<!\\) : next char is not preceded by '\' (negative lookbehind)
' : match single quote
) : end non-capture group
(?: : begin non-capture group
[^'\n] : match any char other than single quote and line terminator
| : or
\\' : match '\' then a single quote
) : end non-capture group
* : execute non-capture group 0+ times
) : end non-capture group
* : execute non-capture group 0+ times
\b(foe)\b : match 'foe' in capture group 1

Related

Regex having optional groups with non-capturing groups

I have an Regex with multiple optional and Non-Capturing Groups. All of these groups can occur, but don't have to. The Regex should use Non-Capturing Groups to return the whole string.
When I set the last group also as optional, the Regex will have several grouped results. When I set the first group as not-optional, the Regex matches. Why is that?
The input will be something like input_text = "xyz T1 VX N1 ", expected output T1 VX N1.
regexs = {
"allOptional": 'p?(?:T[X0-4]?)?\\s?(?:V[X0-2])?\\s?(?:N[X0-3])?',
"lastNotOptional": 'p?(?:T[X0-4]?)?\\s?(?:V[X0-2])?\\s?(?:N[X0-3])',
"firstNotOptional": 'p?(?:T[X0-4]?)\\s?(?:V[X0-2])?\\s?(?:N[X0-3])?',
}
for key, regex in regexs.items():
matches = re.findall(regex, input_text)
# Results
allOptional = ['', '', '', ' ', 'T1 VX N1', '']
lastNotOptional = ['T1 VX N1']
firstNotOptional = ['T1 VX N1']
Thanks in advance!
I suggest
\b(?=\w)p?(?:T[X0-4]?)?\s?(?:V[X0-2])?\s?(?:N[X0-3])?\b(?<=\w)
See the regex demo.
Alternative for this is a combination of lookarounds that make sure the match is immediately preceded with a whitespace char or start of string, and the first char of a match is a whitespace char, and another lookaround combination (at the end of the pattern) to make sure the match end char is a non-whitespace and then a whitespace or end of string follows:
(?<!\S)(?=\S)p?(?:T[X0-4]?)?\s?(?:V[X0-2])?\s?(?:N[X0-3])?(?!\S)(?<=\S)
See this regex demo.
The main point here are two specific word/whitespace boundaries:
\b(?=\w) at the start makes sure the word boundary position is matched, that is immediately followed with a word char
\b(?<=\w) at the end asserts the position at the word boundary, with a word char immediately on the left
(?<!\S)(?=\S) - a position that is at the start of string, or immediately after a whitespace and that is immediately followed with a non-whitespace char
(?!\S)(?<=\S) - a position that is at the end of string, or immediately before a whitespace and that is immediately preceded with a non-whitespace char.
See a Python demo:
import re
input_text = "xyz T1 VX N1 G1"
pattern = r'\b(?=\w)p?(?:T[X0-4]?)?\s?(?:V[X0-2])?\s?(?:N[X0-3])?\b(?<=\w)'
print(re.findall(pattern, input_text))
# => ['T1 VX N1']

What is the correct way of grabbing an inner string in regular expressions for Python for multiple conditions

I would like to return all strings within the specified starting and end strings.
Given a string libs = 'libr(lib1), libr(lib2), libr(lib3), req(reqlib), libra(nonlib)'.
From the above libs string I would like to search for strings that are in between libr( and ) or the string between req( and ).
I would like to return ['lib1', 'lib2', 'lib3', 'reqlib']
import re
libs = 'libr(lib1), libr(lib2), libr(lib3), req(reqlib), libra(nonlib)'
pat1 = r'libr+\((.*?)\)'
pat2 = r'req+\((.*?)\)'
pat = f"{pat1}|{pat2}"
re.findall(pat, libs)
The code above currently returns [('lib1', ''), ('lib2', ''), ('lib3', ''), ('', 'reqlib')] and I am not sure how I should fix this.
Try this regex
(?:(?<=libr\()|(?<=req\())[^)]+
Click for Demo
Click for Code
Explanation:
(?:(?<=libr\()|(?<=req\())
(?<=libr\() - positive lookbehind that matches the position which is immediately preceded by text libr(
| - or
(?<=req\() - positive lookbehind that matches the position which is immediately preceded by text req(
[^)]+ - matches 1+ occurrences of any character which is not a ). So, this will match everything until it finds the next )
You can do it like this:
pat1 = r'(?<=libr\().*?(?=\))'
pat2 = r'(?<=req\().*?(?=\))'
It uses positive lookbehind (?<=) and positive lookahead (?=).
.*? : selects all characters in between. I'll name it "content"
(?<=libr\() : "content" must be preceded by libr( (we escape the
( )
?(?=\)) : content must be followed by ) ( ( is escaped too)
Complete code:
import re
libs = 'libr(lib1), libr(lib2), libr(lib3), req(reqlib), libra(nonlib)'
pat1 = r'(?<=libr\().*?(?=\))'
pat2 = r'(?<=req\().*?(?=\))'
pat = f"{pat1}|{pat2}"
result = re.findall(pat, libs)
print(result)
Output:
['lib1', 'lib2', 'lib3', 'reqlib']
I think a common way to do so is using alternation in the word you would want to be preceding the pattern you like to capture:
\b(?:libr|req)\(([^)]+)
See the online demo
\b - Word-boundary.
(?: - Open non-capture group:
libr|req - Match "libr" or "req".
) - Close non-capture group.
\( - A literal opening paranthesis.
( - Open a capture group:
[^)]+ - Match 1+ characters apart from closing paranthesis.
) - Close capture group.
A python demo:
import re
libs = 'libr(lib1), libr(lib2), libr(lib3), req(reqlib), libra(nonlib)'
lst = re.findall(r'\b(?:libr|req)\(([^)]+)', libs)
print(lst)
Prints:
['lib1', 'lib2', 'lib3', 'reqlib']

How to match everything up to double newline "\n\n" using regex in Python?

Suppose I have the following Python string
str = """
....
Dummyline
Start of matching
+----------+----------------------------+
+ test + 1234 +
+ test2 + 5678 +
+----------+----------------------------+
Finish above. Do not match this
+----------+----------------------------+
+ dummy1 + 00000000000 +
+ dummy2 + 12345678910 +
+----------+----------------------------+
"""
and I want to match everything that the first table has. I could use a regex that starts matching from
"Start"
and matches everything until it finds a double newline
\n\n
I found some tips on how to do this in another stackoverflow post (How to match "anything up until this sequence of characters" in a regular expression?), but it doesn't seem to be working for the double newline case.
I thought of the following code
pattern = re.compile(r"Start[^\n\n]")
matches = pattern.finditer(str)
where basically
[^x]
means match everything until character x is found. But this works only for characters, not with strings ("\n\n" in this case)
Anybody has any idea on it?
You can match Start until the end of the lines, and then match all lines that start with a newline and are not immediately followed by a newline using a negative lookahead (?!
^Start .*(?:\r?\n(?!\r?\n).*)*
Explanation
^Start .* Match Start from the start of the string ^ and 0+ times any char except a newline
(?: Non capture group
\r?\n Match a newline
(?!\r?\n) Negative lookahead, assert what is directly to the right is not a newline
.* Match 0+ times any character except a newline
)* Close the non capturing group and repeat 0+ times to get all the lines
Regex demo

Creating regular expression for extracting specific measurements

I am trying to extract measurements from a file using Python. I want to extract them with specification words. For example:
Width 3.5 in
Weight 10 kg
I used the following code:
p = re.compile('\b?:Length|Width|Height|Weight (?:\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?) (?:in|oz|lbs|VAC|Hz|amps|H.P.)\b')
print(p.findall(text))
However, it only outputs the first word (just "Height" or "Length") and completely misses the rest. Is there something I should fix in the above regular expression?
=====
UPDATE:
For some reason, online regex tester and my IDE give me completely different results for the same pattern:
expression = r"""\b
(?:
[lL]ength\ +(?P<Length>\d+(?:\.\d+)?|\d+-\d+\/\d+)\ +(?:in|ft|cm|m)|
[wW]idth\ +(?P<Width>\d+(?:\.\d+)?)\ +(?:in|ft|cm|m)|
[wW]eight\ +(?P<Weight>\d+(?:\.\d+)?|\d+-\d)\ +(?:oz|lb|g|kg)|
Electrical\ +(?P<Electrical>[^ ]+)\ +(?:VAC|Hz|[aA]mps)
)
\b
"""
print(re.findall(expression,text,flags=re.X|re.MULTILINE|re.I))
returns me [('17-13/16', '', '', '')] for the same input.
Is there something I should update?
Consider using the following regular expression, which ties the format of the values and the units of measurement to the element being matched.
\b
(?:
Length\ +(?<Length>\d+(?:\.\d+)?)\ +(?:in|ft|cm|m)|
Width\ +(?<Width>\d+(?:\.\d+)?)\ +(?:in|ft|cm|m)|
Weight\ +(?<Weight>\d+)\ +(?:oz|lb|g|kg)
)
\b
I've written this with the x ("extended") flag (which ignores whitespace) to make it easier to read. For that reason I needed to have escaped the space characters. (Alternatively, I could have put each in a character class.)
As seen, "Length" and "Width" require the value to be an integer or a float and the units to be any of "in", "ft", "cm" or "m", whereas "Weight" requires the value to be an integer and the units to be any of "oz", "lb", "g" or "kg". It could of course be extended in the obvious way.
Start your engine!
Python's regex engine performs the following operations.
\b : assert word boundary
(?: : begin non-capture group
Length + : match 'Length' then 1+ spaces
(?<Length> : begin named capture group 'Length'
\d+(?:\.\d+)? : match 1+ digits
(?:\.\d+)?
) : close named capture group
\ + : match 1+ spaces
(?:in|ft|cm|m) : match 'in', 'ft', 'cm' or 'm' in a
non-capture group
| : or
Width\ + : similar to above
(?<Width> : ""
\d+ : ""
(?:\.\d+)? : ""
) : ""
\ + : ""
(?:in|ft|cm|m) : ""
| : ""
Weight\ + : ""
(?<Weight>\d+) : match 1+ digits in capture group 'Weight'
\ + : similar to above
(?:oz|lb|g|kg) : ""
) : end non-capture group
\b : assert word boundary
To allow "Length" to be expressed in fractional amounts, change
(?<Length>
\d+
(?:\.\d+)?
)
to
(?<Length>
\d+
(?:\.\d+)?
| : or
\d+-\d+\/\d+ : match 1+ digits, '-' 1+ digits, '/', 1+ digits
)
Fractional values
To add an element to the alternation for "Electical", append a pipe (|) at the end of the "Weight" row and insert the following before the last right parenthesis.
Electrical\ + : match 'Electrical' then 1+ spaces
(?<Electrical> : begin capture group 'Electrical'
[^ ]+ : match 1+ characters other than spaces
) : close named capture group
\ + : match 1+ spaces
(?:VAC|Hz|[aA]mps) : match 'VAC', 'Hz' or 'amps' in a
non-capture group
Here I've made the elecrical value merely a string of characters other than spaces because values of 'Hz' (e.g., 50-60) are different than the those for 'VAC' and 'amps'. That could be fine-tuned if necessary.
Add Electrical
There are a few issues with the pattern:
You can not put a quantifier ? after the word boundary
The alternatives Length|Width etc should be within a grouping structure
Add kg at the last alternation
Escape the dots to match them literally
Assert a whitespace boundary at the end (?!\S) because H.P. is one of the options and will not match when using \b and followed by a space for example
For example
\b(?:Length|Width|Height|Weight) (?:\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?) (?:in|oz|lbs|VAC|Hz|amps|H\.P\.|kg)(?!\S)
Regex demo | Python demo
Also note Wiktor Stribiżew comment about \b. This page explains the difference.

need regex expression to avoid " \n " character

I want to apply regex to the below string in python Where i only want to capture Model Number : 123. I tried the below regex but it didn't fetch me the result.
string = """Model Number : 123
Serial Number : 456"""
model_number = re.findall(r'(?s)Model Number:.*?\n',string)
Output is as follows Model Number : 123\n How can i avoid \n at the end of the output?
Remove the DOTALL (?s) inline modifier to avoid matching a newline char with ., add \s* after Number and use .* instead of .*?\n:
r'Model Number\s*:.*'
See the regex demo
Here, Model Number will match a literal substring, \s* will match 0+ whitespaces, : will match a colon and .* will match 0 or more chars other than line break chars.
Python demo:
import re
s = """Model Number : 123
Serial Number : 456"""
model_number = re.findall(r'Model Number\s*:.*',s)
print(model_number) # => ['Model Number : 123']
If you need to extract just the number use
r'Model Number\s*:\s*(\d+)'
See another regex demo and this Python demo.
Here, (\d+) will capture 1 or more digits and re.findall will only return these digits. Or, use it with re.search and once the match data object is obtained, grab it with match.group(1).
NOTE: If the string appears at the start of the string, use re.match. Or add ^ at the start of the pattern and use re.M flag (or add (?m) at the start of the pattern).
you can use strip() function
model_number.strip()
this will remove all white spaces

Categories