Creating regular expression for extracting specific measurements - python

I am trying to extract measurements from a file using Python. I want to extract them with specification words. For example:
Width 3.5 in
Weight 10 kg
I used the following code:
p = re.compile('\b?:Length|Width|Height|Weight (?:\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?) (?:in|oz|lbs|VAC|Hz|amps|H.P.)\b')
print(p.findall(text))
However, it only outputs the first word (just "Height" or "Length") and completely misses the rest. Is there something I should fix in the above regular expression?
=====
UPDATE:
For some reason, online regex tester and my IDE give me completely different results for the same pattern:
expression = r"""\b
(?:
[lL]ength\ +(?P<Length>\d+(?:\.\d+)?|\d+-\d+\/\d+)\ +(?:in|ft|cm|m)|
[wW]idth\ +(?P<Width>\d+(?:\.\d+)?)\ +(?:in|ft|cm|m)|
[wW]eight\ +(?P<Weight>\d+(?:\.\d+)?|\d+-\d)\ +(?:oz|lb|g|kg)|
Electrical\ +(?P<Electrical>[^ ]+)\ +(?:VAC|Hz|[aA]mps)
)
\b
"""
print(re.findall(expression,text,flags=re.X|re.MULTILINE|re.I))
returns me [('17-13/16', '', '', '')] for the same input.
Is there something I should update?

Consider using the following regular expression, which ties the format of the values and the units of measurement to the element being matched.
\b
(?:
Length\ +(?<Length>\d+(?:\.\d+)?)\ +(?:in|ft|cm|m)|
Width\ +(?<Width>\d+(?:\.\d+)?)\ +(?:in|ft|cm|m)|
Weight\ +(?<Weight>\d+)\ +(?:oz|lb|g|kg)
)
\b
I've written this with the x ("extended") flag (which ignores whitespace) to make it easier to read. For that reason I needed to have escaped the space characters. (Alternatively, I could have put each in a character class.)
As seen, "Length" and "Width" require the value to be an integer or a float and the units to be any of "in", "ft", "cm" or "m", whereas "Weight" requires the value to be an integer and the units to be any of "oz", "lb", "g" or "kg". It could of course be extended in the obvious way.
Start your engine!
Python's regex engine performs the following operations.
\b : assert word boundary
(?: : begin non-capture group
Length + : match 'Length' then 1+ spaces
(?<Length> : begin named capture group 'Length'
\d+(?:\.\d+)? : match 1+ digits
(?:\.\d+)?
) : close named capture group
\ + : match 1+ spaces
(?:in|ft|cm|m) : match 'in', 'ft', 'cm' or 'm' in a
non-capture group
| : or
Width\ + : similar to above
(?<Width> : ""
\d+ : ""
(?:\.\d+)? : ""
) : ""
\ + : ""
(?:in|ft|cm|m) : ""
| : ""
Weight\ + : ""
(?<Weight>\d+) : match 1+ digits in capture group 'Weight'
\ + : similar to above
(?:oz|lb|g|kg) : ""
) : end non-capture group
\b : assert word boundary
To allow "Length" to be expressed in fractional amounts, change
(?<Length>
\d+
(?:\.\d+)?
)
to
(?<Length>
\d+
(?:\.\d+)?
| : or
\d+-\d+\/\d+ : match 1+ digits, '-' 1+ digits, '/', 1+ digits
)
Fractional values
To add an element to the alternation for "Electical", append a pipe (|) at the end of the "Weight" row and insert the following before the last right parenthesis.
Electrical\ + : match 'Electrical' then 1+ spaces
(?<Electrical> : begin capture group 'Electrical'
[^ ]+ : match 1+ characters other than spaces
) : close named capture group
\ + : match 1+ spaces
(?:VAC|Hz|[aA]mps) : match 'VAC', 'Hz' or 'amps' in a
non-capture group
Here I've made the elecrical value merely a string of characters other than spaces because values of 'Hz' (e.g., 50-60) are different than the those for 'VAC' and 'amps'. That could be fine-tuned if necessary.
Add Electrical

There are a few issues with the pattern:
You can not put a quantifier ? after the word boundary
The alternatives Length|Width etc should be within a grouping structure
Add kg at the last alternation
Escape the dots to match them literally
Assert a whitespace boundary at the end (?!\S) because H.P. is one of the options and will not match when using \b and followed by a space for example
For example
\b(?:Length|Width|Height|Weight) (?:\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?) (?:in|oz|lbs|VAC|Hz|amps|H\.P\.|kg)(?!\S)
Regex demo | Python demo
Also note Wiktor Stribiżew comment about \b. This page explains the difference.

Related

How to create regex where there can be specific char between characters in pattern and I have one "wildcard" char?

I want to create regex in python where I'm given a substring and I want to find it in my string. Characters in substring and my string are always either D, T or F. There are two conditions for match:
After every character in given substring there can occur char '-' (I don't know how to approach this one especially)
Every character can be either the character I'm looking at or 'X' so X is a "wildcard" (I know I can use '|' for that so it would be I believe ([DTF]|X))
So what I mean is if I'm given DTTFDD as substring other proper matches would be:
D-TTFDD
DXTFDD
Edit: These matches can occur in bigger string such as FTDTTDFDD-TTFXDTFTFD
How can I put all of this together?
Looks like you could try:
[DX]-?(?:[TX]-?){2}[FX]-?(?:[DX]-?){2}
See the online demo
[DX]-? - A literal "D" or "X" followed by an optional hyphen.
(?: - Open non-capture group:
[TX]-? - A literal "T" or "X" followed by an optional hyphen.
){2} - Close non-capture group and match twice.
[FX]-? - A literal "F" or "X" followed by an optional hyphen.
(?: - Open non-capture group:
[DX]-? - A literal "D" or "X" followed by an optional hyphen.
){2} - Close non-capture group and match twice.
A little less verbose without the non-capture groups:
[DX]-?[TX]-?[TX]-?[FX]-?[DX]-?[DX]-?

Regular expression for printing integers within brackets

First time ever using regular expressions and can't get it working although there's quite a few examples in stackoverflow already.
How can I extract integers which are in a string inside bracket?
Example:
dijdi[d43] d5[55++][ 43] [+32]dm dij [ -99]x
would return
[43, 32, -99]
'+' and '-' is okay, if it's in the beginning of the brackets, but not okay if it's in the middle or end. If the '+' sign is in the beginning, it should not be taken into account. (+54 --> 54)
Been trying :
re.findall('\[[-]?\d+\]',str)
but it's not working the way I want.
If you need to fail the match in [ +-34 ] (i.e. if you needn't extract a negative number if there is a + before it) you will need to use
\[\s*(?:\+|(-))?(\d+)\s*]
and when getting a match, concat the Group 1 and Group 2 values. See this regex demo.
Details
\[ - a [ char
\s* - 0+ whitespaces
\+? - an optional + char
(-?\d+) - Capturing group 1 (the actual output of re.findall): an optional - and 1+ digits
\s* - 0+ whitespaces
] - a ] char.
In Python,
import re
text = "dijdi[d43] d5[55++][ 43] [+32]dm dij [ -99]x"
numbers_text = [f"{x}{y}" for x, y in re.findall(r'\[\s*(?:\+|(-))?(\d+)\s*]', text)]
numbers = list(map(int, numbers_text))
# => [43, 32, -99] for both
If you want to extract integers from a string the code that I use is this:
def stringToNumber(inputStr):
myNumberList = []
for s in inputStr.split():
newString = ''.join(i for i in s if i.isdigit())
if (len(newString) != 0):
myNumberList.append(newString)
return myNumberList
I hope it works for you.
If you've not done so I suggest you switch to the PyPI regex module. Using it here with regex.findall and the following regular expression allows you to extract just what you need.
r'\[ *\+?\K-?\d+(?= *\])'
regex engine <¯\(ツ)/¯> Python code
At the regex tester pass your cursor across the regex for details about individual tokens.
The regex engine performs the following operations.
\[ : match '['
\ * : match 0+ spaces
\+? : optionally match '+'
\K : forget everything matched so far and reset
start of match to current position
-? : optionally match '-'
\d+ : match 1+ digits
(?= *\]) : use positive lookahead to assert the last digit
: matched is followed by 0+ spaces then ']'

Remove words from string except within quotes [duplicate]

I would like a Python regular expression that matches a given word that's not between simple quotes. I've tried to use the (?! ...) but without success.
In the following screenshot, I would like to match all foe except the one in the 4th line.
Plus, the text is given as one big string.
Here is the link regex101 and the sample text is below:
var foe = 10;
foe = "";
dark_vador = 'bad guy'
foe = ' I\'m your father, foe ! '
bar = thingy + foe
A regex solution below will work in most cases, but it might break if the unbalanced single quotes appear outside of string literals, e.g. in comments.
A usual regex trick to match strings in-context is matching what you need to replace and match and capture what you need to keep.
Here is a sample Python demo:
import re
rx = r"('[^'\\]*(?:\\.[^'\\]*)*')|\b{0}\b"
s = r"""
var foe = 10;
foe = "";
dark_vador = 'bad guy'
foe = ' I\'m your father, foe ! '
bar = thingy + foe"""
toReplace = "foe"
res = re.sub(rx.format(toReplace), lambda m: m.group(1) if m.group(1) else 'NEWORD', s)
print(res)
See the Python demo
The regex will look like
('[^'\\]*(?:\\.[^'\\]*)*')|\bfoe\b
See the regex demo.
The ('[^'\\]*(?:\\.[^'\\]*)*') part captures ingle-quoted string literals into Group 1 and if it matches, it is just put back into the result, and \bfoe\b matches whole words foe in any other string context - and subsequently is replaced with another word.
NOTE: To also match double quoted string literals, use r"('[^'\\]*(?:\\.[^'\\]*)*'|\"[^\"\\]*(?:\\.[^\"\\]*)*\")".
You can try this:-
((?!\'[\w\s]*)foe(?![\w\s]*\'))
How about this regular expression:
>>> s = '''var foe = 10;
foe = "";
dark_vador = 'bad guy'
' I\m your father, foe ! '
bar = thingy + foe'''
>>>
>>> re.findall(r'(?!\'.*)foe(?!.*\')', s)
['foe', 'foe', 'foe']
The trick here is to make sure the expression does not match any string with leading and trailing ' and to remember to account for the characters in between, thereafter .* in the re expression.
((?!\'[\w\s]*[\\']*[\w\s]*)foe(?![\w\s]*[\\']*[\w\s]*\'))
Capture group 1 of the following regular expression will contain matches of 'foe'.
r'^(?:[^'\n]|\\')*(?:(?<!\\)'(?:[^'\n]|\\')*(?:(?<!\\)')(?:[^'\n]|\\')*)*\b(foe)\b'
Start your engine!
Python's regex engine performs the following operations.
^ : assert beginning of string
(?: : begin non-capture group
[^'\n] : match any char other than single quote and line terminator
| : or
\\' : match '\' then a single quote
) : end non-capture group
* : execute non-capture group 0+ times
(?: : begin non-capture group
(?<!\\) : next char is not preceded by '\' (negative lookbehind)
' : match single quote
(?: : begin non-capture group
[^'\n] : match any char other than single quote and line terminator
| : or
\\' : match '\' then a single quote
) : end non-capture group
* : execute non-capture group 0+ times
(?: : begin non-capture group
(?<!\\) : next char is not preceded by '\' (negative lookbehind)
' : match single quote
) : end non-capture group
(?: : begin non-capture group
[^'\n] : match any char other than single quote and line terminator
| : or
\\' : match '\' then a single quote
) : end non-capture group
* : execute non-capture group 0+ times
) : end non-capture group
* : execute non-capture group 0+ times
\b(foe)\b : match 'foe' in capture group 1

need regex expression to avoid " \n " character

I want to apply regex to the below string in python Where i only want to capture Model Number : 123. I tried the below regex but it didn't fetch me the result.
string = """Model Number : 123
Serial Number : 456"""
model_number = re.findall(r'(?s)Model Number:.*?\n',string)
Output is as follows Model Number : 123\n How can i avoid \n at the end of the output?
Remove the DOTALL (?s) inline modifier to avoid matching a newline char with ., add \s* after Number and use .* instead of .*?\n:
r'Model Number\s*:.*'
See the regex demo
Here, Model Number will match a literal substring, \s* will match 0+ whitespaces, : will match a colon and .* will match 0 or more chars other than line break chars.
Python demo:
import re
s = """Model Number : 123
Serial Number : 456"""
model_number = re.findall(r'Model Number\s*:.*',s)
print(model_number) # => ['Model Number : 123']
If you need to extract just the number use
r'Model Number\s*:\s*(\d+)'
See another regex demo and this Python demo.
Here, (\d+) will capture 1 or more digits and re.findall will only return these digits. Or, use it with re.search and once the match data object is obtained, grab it with match.group(1).
NOTE: If the string appears at the start of the string, use re.match. Or add ^ at the start of the pattern and use re.M flag (or add (?m) at the start of the pattern).
you can use strip() function
model_number.strip()
this will remove all white spaces

Regular Expression for a string contains if characters all in capital python

I'm extracting textual paragraph followed by text like "OBSERVATION #1" or "OBSERVATION #2" in the output from library like PyPDF2.
However there would be some error so it could be like "OBSERVA'TION #2" and I have to avoid like "Suite #300" so the rule is "IF THERE IS CHARACTER, IT WOULD BE IN CAPITAL".
Currently the python code snippet like
inspection_observation=pdfFile.getPage(z).extractText()
if 'OBSERVATION' in inspection_observation:
for finding in re.findall(r"[OBSERVATION] #\d+(.*?) OBSERVA'TION #\d?", inspection_observation, re.DOTALL):
#print inspection_observation;
print finding;
Please advise the appropriate regular expression for this instance,
If there should be a capital and the word can contain a ', you could use a character class where you can list the characters that are allowed and a positive lookahead.
Then you can capture the content between those capital words and use a positive lookahead to check if what follows is another capital word followed by # and 1+ digits or the end of the string. This regex makes use of re.DOTALL where the dot matches a newline.
(?=[A-Z']*[A-Z])[A-Z']+\s+#\d+(.*?(?=[A-Z']*[A-Z][A-Z']*\s+#\d+|$))
Explanation
(?=[A-Z']*[A-Z]) Positive lookahead to assert what follows at least a char A-Z where a ' can occur before
[A-Z']+\s+#\d+ match 1+ times A-Z or ', 1+ whitespace characters and 1+ digits
( Capture group
.*? Match any character
(?= Positive lookahead to assert what follows is
[A-Z']*[A-Z][A-Z']* Match uppercase char A-Z where a ' can be before and after
\s+#\d+ Match 1+ whitespace chars, # and 1+ digits or the end of the string
) Close non capture group
) Close capture group
Regex demo

Categories