Issue with python regex query: why does (-)? not capture a match? [duplicate] - python

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 2 years ago.
I want to capture numbers and number ranges from a list: ["op.15", "Op.16-17", "Op16,17,18"]
match = re.compile(r"\d+[-]?\d+").findall(text)
Gets the correct result
op.15 ['15']
Op.16-17 ['16-17']
Op16,17,18 ['16', '17', '18']
but this doesn't work:
match = re.compile(r"\d+(-)?\d+").findall(text)
op.15 ['']
Op.16-17 ['-']
Op16,17,18 ['', '', '']
What's the issue here? I want to add in alternative values to -, such as "to" i.e. -|to which doesn't work with [].

The documentation for findall in re module says
Return a list of all non-overlapping matches in the string. If one or
more capturing groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group. Empty matches are included in the result.
In your first regex you dont provide any capture groups so you get returned a list of non overlapping matches I.E it will return one or more digits followed by 0 or 1 hyphen followed by one or more digits.
In your second regex you change your [ ] which was saying match any chars in this list. To ( ) which is a capture group. so now you are saying match one or more digits followed by and capture zero or one hyphen, followed by one or more digits.
Now since you have given a capture group as per the documentation you wont now be returned the full non over lapping match, instead you will be returned only the capture group. I.e only returned anything inside the ( ) which will be either empty if there was 0 hyphen or will be - if there was 1 hyphen.
To fix the issue, use a non-capturing group: r"\d+(?:-)?\d+".

Related

Regex group doesn't capture all of matched part of string [duplicate]

This question already has an answer here:
Why Does a Repeated Capture Group Return these Strings?
(1 answer)
Closed 1 year ago.
I have the following regex: '(/[a-zA-Z]+)*/([a-zA-Z]+)\.?$'.
Given a string the following string "/foo/bar/baz", I expect the first captured group to be "/foo/bar". However, I get the following:
>>> import re
>>> regex = re.compile('(/[a-zA-Z]+)*/([a-zA-Z]+)\.?$');
>>> match = regex.match('/foo/bar/baz')
>>> match.group(1)
'/bar'
Why isn't the whole expected group being captured?
Edit: It's worth mentioning that the strings I'm trying to match are parts of URLs. To give you an idea, it's the part of the URL that would be returned from window.location.pathname in javascript, only without file extensions.
This will capture multiple repeated groups:
(/[a-zA-Z]+)*
However, as already discussed in another thread, quoting from #ByteCommander
If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.
Thus the reason why you are only seeing the last match "/bar". What you can do instead is take advantage of the greedy matching of .* up to the last / via the pattern (/.*)/
regex = re.compile('(/.*)/([a-zA-Z]+)\.?$');
Don't need the * between the two expressions here, also move the first / into the brackets:
>>> regex = re.compile('([/a-zA-Z]+)/([a-zA-Z]+)\.?$')
>>> regex.match('/foo/bar/baz').group(1)
'/foo/bar'
>>>
In this case, you may don't need regex.
You can simply use split function.
text = "/foo/bar/baz"
"/".join(text.split("/", 3)[:3])
output:
/foo/bar
a.split("/", 3) splits your string up to the third occurrence of /, and then you can join the desidered elements.
As suggested by Niel, you can use a negative index to extract anything but the last part from a url (or a path).
In this case the generic approach would be :
text = "/foo/bar/baz/boo/bye"
"/".join(text.split("/", -1)[:-1])
Output:
/foo/bar/baz/boo

regex - Extract complete word base on match string [duplicate]

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 2 years ago.
Can some one please help me on this - Here I'm trying extract word from given sentence which contains G,ML,KG,L,ML,PCS along with numbers .
I can able to match the string , but not sure how can I extract the comlpete word
for example my input is "This packet contains 250G Dates" and output should be 250G
another example is "You paid for 2KG Apples" and output should be 2KG
in my regular expression I'm getting only match string not complete word :(
import re
val = 'FUJI ALUMN FOIL CAKE, 240G, CHCLTE'
key_vals = ['G','GM','KG','L','ML','PCS']
re.findall("\d+\.?\d*(\s|G|KG|GM|L|ML|PCS)\s?", val)
This regex will not get you what you want:
re.findall("\d+\.?\d*(\s|G|KG|GM|L|ML|PCS)\s?", val)
Let's break it down:
\d+: one or more digits
\.?: a dot (optional, as indicated by the question mark)
\d*: one or more optional digits
(\s|G|KG|GM|L|ML|PCS): a group of alternatives, but whitespace is an option among others, it should be out of the group: what you probably want is allow optional whitespace between the number and the unit ie: 240G or 240 G
\s?: optional whitespace
A better expression for your purpose could be:
re.findall("\d+\s*(?:G|KG|GM|L|ML|PCS)", val)
That means: one or more digits, followed by optional whitespace and then either of these units: G|KG|GM|L|ML|PCS.
Note the presence of ?: to indicate a non-capturing group. Without it the expression would return G
Try using this Regex:
\d+\s*(G|KG|GM|L|ML|PCS)\s?
It matches every string which starts with at least one digit, is then followed by one the units. Between the digits and the units and behind the units there can also be whitespaces.
Adjust this like you want to :)
Use non-grouping parentheses (?:...) instead of the normal ones. Without grouping parentheses findall returns the string(s) which match the whole pattern.

Regex get all possible occurrence in Python

I have a string s = '10000',
I need using only the Python re.findall to get how many 0\d0 in the string s
For example: for the string s = '10000' it should return 2
explanation:
the first occurrence is 10000 while the second occurrence is 10000
I just need how many occurrences and not interested in the occurrence patterns
I've tried the following regex statements:
re.findall(r'(0\d0)', s) #output: ['000']
re.findall(r'(0\d0)*', s) #output: ['', '', '000', '', '', '']
Finally, if I want to make this regex generic to fetch any number then
any_number_included_my_number then the_same_number_again, how can I do it?
How to get all possible occurrences?
The regex
As I mentioned in my comment, you can use the following pattern:
(?=(0\d0))
How it works:
(?=...) is a positive lookahead ensuring what follows matches. This doesn't consume characters (allowing us to check for a match at each position in the string as a regex would otherwise resume pattern matching after the consumed characters).
(0\d0) is a capture group matching 0, then any digit, then 0
The code
Your code becomes:
See code in use here
re.findall(r'(?=(0\d0))', s)
The result is:
['000', '000']
The python re.findall method states the following
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
This means that our matches are the results of capture group 1 rather than the full match as many would expect.
How to generalize the pattern?
The regex
You can use the following pattern:
(\d)\d\1
How this works:
(\d) captures any digit into capture group 1
\d matches any digit
\1 is a backreference that matches the same text as most recently matched by capture group 1
The code
Your code becomes:
See code in use here
re.findall(r'(?=((\d)\d\2))', s)
print([n[0] for n in x])
Note: The code above has two capture groups, so we need to change the backreference to \2 to match correctly. Since we now have two capture groups, we will get tuples as the documentation states and can use list comprehension to get the expected results.
The result is:
['000', '000']

Use regex to identify 4 to 5 numbers that are (consecutive, i.e no whitespace or special characters included), without including preceding 0's

I am trying to use regular expressions to identify 4 to 5 digit numbers. The code below is working effectively in all cases unless there are consecutive 0's preceding a one, two or 3 digit number. I don't want '0054','0008',or '0009' to be a match, but i would want '10354' or '10032', or '9005', or '9000' to all be matches. Is there a good way to implement this using regular expressions? Here is my current code that works for most cases except when there are preceding 0's to a series of digits less than 4 or 5 characters in length.
import re
line = 'US Machine Operations | 0054'
match = re.search(r'\d{4,5}', line)
if match is None:
print(0)
else:
print(int(match[0]))
You may use
(?<!\d)[1-9]\d{3,4}(?!\d)
See the regex demo.
NOTE: In Pandas str.extract, you must wrap the part you want to be returned with a capturing group, a pair of unescaped parentheses. So, you need to use
(?<!\d)([1-9]\d{3,4})(?!\d)
^ ^
Example:
df2['num_col'] = df2.Warehouse.str.extract(r'(?<!\d)([1-9]\d{3,4})(?!\d)', expand = False).astype(float)
Just because you can simple use a capturing group, you may use an equivalent regex:
(?:^|\D)([1-9]\d{3,4})(?!\d)
Details
(?<!\d) - no digit immediately to the left
or (?:^|\D) - start of string or non-digit char (a non-capturing group is used so that only 1 capturing group could be accommodated in the pattern and let str.extract only extract what needs extracting)
[1-9] - a non-zero digit
\d{3,4} - three or four digits
(?!\d) - no digit immediately to the right is allowed
Python demo:
import re
s = "US Machine Operations | 0054 '0054','0008',or '0009' to be a match, but i would want '10354' or '10032', or '9005', or '9000'"
print(re.findall(r'(?<!\d)[1-9]\d{3,4}(?!\d)', s))
# => ['10354', '10032', '9005', '9000']

Why does this regex to find repeated characters fail?

I'm trying to build a regex to match any occurrence of two or more repeated alphanumeric characters. The following regex fails:
import re
s = '__commit__'
m = re.search(r'([a-zA-Z0-9])\1\1', s)
But when I change it to this it works:
m = re.search(r'([a-zA-A0-9])\1+', s)
I'm pretty baffled as to why this is the way it is. Can anyone provide some insight?
Look at this line.
m = re.search(r'([a-zA-Z0-9])\1\1', s)
You are using a pattern and two backreferences (A reference of already matched pattern). So, it will match only when minimum of three consecutive characters appear. You can do:
m = re.search(r'([a-zA-Z0-9])\1', s)
Which will match when minimum of two consecutive character appears.
However, the following one is much better.
m = re.search(r'([a-zA-A0-9])\1+', s)
That's because, now you are trying to match at least one or more backreferences \1+, that is minimum two consecutive characters.
The \1 is a back-reference to any of the previously matching groups. So the original regex that does not work for you essentially means :
Match alphanumeric strings that contain 3 occurences of the previously matchd group. In this case the previously matched group ([a-zA-Z0-9]) contains a single character a-z or A-Z or 0-9. You then have two '\1 in your regex which accounts for two back-references to the previously matched character.
In the second regex the back-reference \1 has a + in front of it which means match atleast one occurence of the previously captured character - which means that the string confirming to this pattern has to be atleast 2 characters in length.
Hope this helps.

Categories