Regular expression for printing integers within brackets - python

First time ever using regular expressions and can't get it working although there's quite a few examples in stackoverflow already.
How can I extract integers which are in a string inside bracket?
Example:
dijdi[d43] d5[55++][ 43] [+32]dm dij [ -99]x
would return
[43, 32, -99]
'+' and '-' is okay, if it's in the beginning of the brackets, but not okay if it's in the middle or end. If the '+' sign is in the beginning, it should not be taken into account. (+54 --> 54)
Been trying :
re.findall('\[[-]?\d+\]',str)
but it's not working the way I want.

If you need to fail the match in [ +-34 ] (i.e. if you needn't extract a negative number if there is a + before it) you will need to use
\[\s*(?:\+|(-))?(\d+)\s*]
and when getting a match, concat the Group 1 and Group 2 values. See this regex demo.
Details
\[ - a [ char
\s* - 0+ whitespaces
\+? - an optional + char
(-?\d+) - Capturing group 1 (the actual output of re.findall): an optional - and 1+ digits
\s* - 0+ whitespaces
] - a ] char.
In Python,
import re
text = "dijdi[d43] d5[55++][ 43] [+32]dm dij [ -99]x"
numbers_text = [f"{x}{y}" for x, y in re.findall(r'\[\s*(?:\+|(-))?(\d+)\s*]', text)]
numbers = list(map(int, numbers_text))
# => [43, 32, -99] for both

If you want to extract integers from a string the code that I use is this:
def stringToNumber(inputStr):
myNumberList = []
for s in inputStr.split():
newString = ''.join(i for i in s if i.isdigit())
if (len(newString) != 0):
myNumberList.append(newString)
return myNumberList
I hope it works for you.

If you've not done so I suggest you switch to the PyPI regex module. Using it here with regex.findall and the following regular expression allows you to extract just what you need.
r'\[ *\+?\K-?\d+(?= *\])'
regex engine <¯\(ツ)/¯> Python code
At the regex tester pass your cursor across the regex for details about individual tokens.
The regex engine performs the following operations.
\[ : match '['
\ * : match 0+ spaces
\+? : optionally match '+'
\K : forget everything matched so far and reset
start of match to current position
-? : optionally match '-'
\d+ : match 1+ digits
(?= *\]) : use positive lookahead to assert the last digit
: matched is followed by 0+ spaces then ']'

Related

How to use "?" in regular expression to change a qualifier to be non-greedy and find a string in the middle of the data? [duplicate]

I have a text like this;
[Some Text][1][Some Text][2][Some Text][3][Some Text][4]
I want to match [Some Text][2] with this regex;
/\[.*?\]\[2\]/
But it returns [Some Text][1][Some Text][2]
How can i match only [Some Text][2]?
Note : There can be any character in Some Text including [ and ] And the numbers in square brackets can be any number not only 1 and 2. The Some Text that i want to match can be at the beginning of the line and there can be multiple Some Texts
JSFiddle
The \[.*?\]\[2\] pattern works like this:
\[ - finds the leftmost [ (as the regex engine processes the string input from left to right)
.*? - matches any 0+ chars other than line break chars, as few as possible, but as many as needed for a successful match, as there are subsequent patterns, see below
\]\[2\] - ][2] substring.
So, the .*? gets expanded upon each failure until it finds the leftmost ][2]. Note the lazy quantifiers do not guarantee the "shortest" matches.
Solution
Instead of a .*? (or .*) use negated character classes that match any char but the boundary char.
\[[^\]\[]*\]\[2\]
See this regex demo.
Here, .*? is replaced with [^\]\[]* - 0 or more chars other than ] and [.
Other examples:
Strings between angle brackets: <[^<>]*> matches <...> with no < and > inside
Strings between parentheses: \([^()]*\) matches (...) with no ( and ) inside
Strings between double quotation marks: "[^"]*" matches "..." with no " inside
Strings between curly braces: \{[^{}]*} matches "..." with no " inside
In other situations, when the starting pattern is a multichar string or complex pattern, use a tempered greedy token, (?:(?!start).)*?. To match abc 1 def in abc 0 abc 1 def, use abc(?:(?!abc).)*?def.
You could try the below regex,
(?!^)(\[[A-Z].*?\]\[\d+\])
DEMO

Pandas regex to remove digits before consecutive dots

I have a string Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23.
Removing all the numbers that are before the dot and after the word.
Ignoring the first part of the string i.e. "Node57Name123".
Should not remove the digits if they are inside words.
Tried re.sub(r"\d+","",string) but it removed every other digit.
The output should look like this "Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape"
Can you please point me to the right direction.
You can use
re.sub(r'^([^.]*\.)|\d+(?![^.])', r'\1', text)
See the regex demo.
Details:
^([^.]*\.) - zero or more chars other than a dot and then a . char at the start of the string captured into Group 1 (referred to with \1 from the replacement pattern)
| - or
\d+(?![^.]) - one or more digits followed with a dot or end of string (=(?=\.|$)).
See the Python demo:
import re
text = r'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
print( re.sub(r'^([^.]*\.)|\d+(?![^.])', r'\1', text) )
## => Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape
Just to give you a non-regex alternative' using rstrip(). We can feed this function a bunch of characters to remove from the right of the string e.g.: rstrip('0123456789'). Alternatively we can also use the digits constant from the string module:
from string import digits
s = 'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
x = '.'.join([s.split('.')[0]] + [i.rstrip(digits) for i in s.split('.')[1:]])
print(x)
Prints:
Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape
EDIT:
If you must use a regular pattern, it seems that the following covers your sample:
(\.[^.]*?)\d+\b
Replace with the 1st capture group, see the online demo
( - Open capture group:
\.[^.]*? - A literal dot followed by 0+ non-dot characters (lazy).
) - Close capture group.
\d+\b - Match 1+ digits up to a word-boundary.
A sample:
import re
s = 'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
x = re.sub(r'(\.[^.]*?)\d+\b', r'\1', s)
print(x)
Prints:
Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape

Extracting the text after the initial substrings between square brackets

I would like to extract the substring from the string, such as
Case 1:
text = "some_txt" # → some_txt
Case2:
text = "[info1]some_txt" # → some_txt
Case3:
text = "[info1][info2] some_text" # → some_txt
Case4:
text = "[info1][info2] some_text_with_[___]_abc" # → some_text_with_[___]_abc
What I did was
match = re.search("^\[.+\] (.*)", text)
if match:
result = match.group(1)
It works okay except case 4, which gives abc only. I want to get some_text_with_[___]_abc instead.
Any help will be greatly appreciated.
With your current code, you can use
r"^(?:\[[^][]+](?:\s*\[[^][]+])*)?\s*(.*)"
See the regex demo.
If you are not actually interested in whether there is a match or not, you may use re.sub to remove these bracketed substrings from the start of the string using
re.sub(r'^\[[^][]+](?:\s*\[[^][]+])*\s*', '', text)
See another regex demo.
Regex details
^ - start of string
(?:\[[^][]+](?:\s*\[[^][]+])*)? - an optional occurrence of
\[[^][]+] - a [, then any one or more chars other than [ and ] as many as possible and then a ]
(?:\s*\[[^][]+])* - zero or more occurrences of zero or more whitespaces and then a [, then any one or more chars other than [ and ] as many as possible and then a ]
\s* - zero or more whitespaces
(.*) - Group 1: any zero or more chars other than line break chars, as many as possible.

Creating regular expression for extracting specific measurements

I am trying to extract measurements from a file using Python. I want to extract them with specification words. For example:
Width 3.5 in
Weight 10 kg
I used the following code:
p = re.compile('\b?:Length|Width|Height|Weight (?:\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?) (?:in|oz|lbs|VAC|Hz|amps|H.P.)\b')
print(p.findall(text))
However, it only outputs the first word (just "Height" or "Length") and completely misses the rest. Is there something I should fix in the above regular expression?
=====
UPDATE:
For some reason, online regex tester and my IDE give me completely different results for the same pattern:
expression = r"""\b
(?:
[lL]ength\ +(?P<Length>\d+(?:\.\d+)?|\d+-\d+\/\d+)\ +(?:in|ft|cm|m)|
[wW]idth\ +(?P<Width>\d+(?:\.\d+)?)\ +(?:in|ft|cm|m)|
[wW]eight\ +(?P<Weight>\d+(?:\.\d+)?|\d+-\d)\ +(?:oz|lb|g|kg)|
Electrical\ +(?P<Electrical>[^ ]+)\ +(?:VAC|Hz|[aA]mps)
)
\b
"""
print(re.findall(expression,text,flags=re.X|re.MULTILINE|re.I))
returns me [('17-13/16', '', '', '')] for the same input.
Is there something I should update?
Consider using the following regular expression, which ties the format of the values and the units of measurement to the element being matched.
\b
(?:
Length\ +(?<Length>\d+(?:\.\d+)?)\ +(?:in|ft|cm|m)|
Width\ +(?<Width>\d+(?:\.\d+)?)\ +(?:in|ft|cm|m)|
Weight\ +(?<Weight>\d+)\ +(?:oz|lb|g|kg)
)
\b
I've written this with the x ("extended") flag (which ignores whitespace) to make it easier to read. For that reason I needed to have escaped the space characters. (Alternatively, I could have put each in a character class.)
As seen, "Length" and "Width" require the value to be an integer or a float and the units to be any of "in", "ft", "cm" or "m", whereas "Weight" requires the value to be an integer and the units to be any of "oz", "lb", "g" or "kg". It could of course be extended in the obvious way.
Start your engine!
Python's regex engine performs the following operations.
\b : assert word boundary
(?: : begin non-capture group
Length + : match 'Length' then 1+ spaces
(?<Length> : begin named capture group 'Length'
\d+(?:\.\d+)? : match 1+ digits
(?:\.\d+)?
) : close named capture group
\ + : match 1+ spaces
(?:in|ft|cm|m) : match 'in', 'ft', 'cm' or 'm' in a
non-capture group
| : or
Width\ + : similar to above
(?<Width> : ""
\d+ : ""
(?:\.\d+)? : ""
) : ""
\ + : ""
(?:in|ft|cm|m) : ""
| : ""
Weight\ + : ""
(?<Weight>\d+) : match 1+ digits in capture group 'Weight'
\ + : similar to above
(?:oz|lb|g|kg) : ""
) : end non-capture group
\b : assert word boundary
To allow "Length" to be expressed in fractional amounts, change
(?<Length>
\d+
(?:\.\d+)?
)
to
(?<Length>
\d+
(?:\.\d+)?
| : or
\d+-\d+\/\d+ : match 1+ digits, '-' 1+ digits, '/', 1+ digits
)
Fractional values
To add an element to the alternation for "Electical", append a pipe (|) at the end of the "Weight" row and insert the following before the last right parenthesis.
Electrical\ + : match 'Electrical' then 1+ spaces
(?<Electrical> : begin capture group 'Electrical'
[^ ]+ : match 1+ characters other than spaces
) : close named capture group
\ + : match 1+ spaces
(?:VAC|Hz|[aA]mps) : match 'VAC', 'Hz' or 'amps' in a
non-capture group
Here I've made the elecrical value merely a string of characters other than spaces because values of 'Hz' (e.g., 50-60) are different than the those for 'VAC' and 'amps'. That could be fine-tuned if necessary.
Add Electrical
There are a few issues with the pattern:
You can not put a quantifier ? after the word boundary
The alternatives Length|Width etc should be within a grouping structure
Add kg at the last alternation
Escape the dots to match them literally
Assert a whitespace boundary at the end (?!\S) because H.P. is one of the options and will not match when using \b and followed by a space for example
For example
\b(?:Length|Width|Height|Weight) (?:\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?) (?:in|oz|lbs|VAC|Hz|amps|H\.P\.|kg)(?!\S)
Regex demo | Python demo
Also note Wiktor Stribiżew comment about \b. This page explains the difference.

Invalid pattern in look-behind

Why does this regex work in Python but not in Ruby:
/(?<!([0-1\b][0-9]|[2][0-3]))/
Would be great to hear an explanation and also how to get around it in Ruby
EDIT w/ the whole line of code:
re.sub(r'(?<!([0-1\b][0-9]|[2][0-3])):(?!([0-5][0-9])((?i)(am)|(pm)|(a\.m)|(p\.m)|(a\.m\.)|(p\.m\.))?\b)' , ':\n' , s)
Basically, I'm trying to add '\n' when there is a colon and it is not a time.
Ruby regex engine doesn't allow capturing groups in look behinds.
If you need grouping, you can use a non-capturing group (?:):
[8] pry(main)> /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
SyntaxError: (eval):2: invalid pattern in look-behind: /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
[8] pry(main)> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/
=> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/
Docs:
(?<!subexp) negative look-behind
Subexp of look-behind must be fixed-width.
But top-level alternatives can be of various lengths.
ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.
In negative look-behind, capturing group isn't allowed,
but non-capturing group (?:) is allowed.
Learned from this answer.
Acc. to Onigmo regex documentation, capturing groups are not supported in negative lookbehinds. Although it is common among regex engines, not all of them count it as an error, hence you see the difference in the re and Onigmo regex libraries.
Now, as for your regex, it is not working correctly nor in Ruby nor in Python: the \b inside a character class in a Python and Ruby regex matches a BACKSPACE (\x08) char, not a word boundary. Moreover, when you use a word boundary after an optional non-word char, if the char appears in the string a word char must appear immediately to the right of that non-word char. The word boundary must be moved to right after m before \.?.
Another flaw with the current approach is that lookbehinds are not the best to exclude certain contexts like here. E.g. you can't account for a variable amount of whitespaces between the time digits and am / pm. It is better to match the contexts you do not want to touch and match and capture those you want to modify. So, we need two main alternatives here, one matching am/pm in time strings and another matching them in all other contexts.
Your pattern also has too many alternatives that can be merged using character classes and ? quantifiers.
Regex demo
\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?):
\b - word boundary
((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?) - capturing group 1:
(?:[01]?[0-9]|2[0-3]) - an optional 0 or 1 and then any digit or 2 and then a digit from 0 to 3
:[0-5][0-9] - : and then a number from 00 to 59
\s* - 0+ whitespaces
[pa]\.?m\b\.? - a or p, an optional dot, m, a word boundary, an optional dot
| - or
\b[ap]\.?m\b\.? - word boundary, a or p, an optional dot, m, a word boundary, an optional dot
Python fixed solution:
import re
text = 'am pm P.M. 10:56pm 10:43 a.m.'
rx = r'\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?'
result = re.sub(rx, lambda x: x.group(1) if x.group(1) else "\n", text, flags=re.I)
Ruby solution:
text = 'am pm P.M. 10:56pm 10:43 a.m.'
rx = /\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?/i
result = text.gsub(rx) { $1 || "\n" }
Output:
"\n \n \n 10:56pm 10:43 a.m."
For sure #mrzasa found the problem out.
But ..
Taking a guess at your intent to replace a non-time colon with a ':\n`
it could be done like this I guess. Does a little whitespace trim as well.
(?i)(?<!\b[01][0-9])(?<!\b[2][0-3])([^\S\r\n]*:)[^\S\r\n]*(?![0-5][0-9](?:[ap]\.?m\b\.?)?)
PCRE - https://regex101.com/r/7TxbAJ/1 Replace $1\n
Python - https://regex101.com/r/w0oqdZ/1 Replace \1\n
Readable version
(?i)
(?<!
\b [01] [0-9]
)
(?<!
\b [2] [0-3]
)
( # (1 start)
[^\S\r\n]*
:
) # (1 end)
[^\S\r\n]*
(?!
[0-5] [0-9]
(?: [ap] \.? m \b \.? )?
)

Categories