Regex for parsing uid from URL

Regex for parsing uid from URL - python

I am trying to parse UIDs from URLs. However regex is not something I am good at so seeking for some help.
Example Input:
https://example.com/d/iazs9fEil/somethingelse?foo=bar
Example Output:
iazs9fEil
What I've tried so far is
([/d/]+[\d\x])\w+
Which somehow works, but returns in with the /d/ prefix, so the output is /d/iazs9fEil.
How to change the regex to not contain the /d/ prefix?
EDIT:
I've tried this regex ([^/d/]+[\d\x])\w+ which outputs the correct string which is iazs9fEil, but also returns the rest of the url, so here it is somethingelse?foo=bar

In short, you may use
match = re.search(r'/d/(\w+)', your_string) # Look for a match
if match: # Check if there is a match first
print(match.group(1)) # Now, get Group 1 value
See this regex demo and a regex graph:
NOTE
/ is not any special metacharacter, do not escape it in Python string patterns
([/d/]+[\d\x])\w+ matches and captures into Group 1 any one or more slashes or digits (see [/d/]+, a positive character class) and then a digit or (here, Python shows an error: sre_contants.error incomplete escape \x, probably it could parse it as x, but it is not the case), and then matches 1+ word chars. You put the /d/ into a character class and it stopped matching a char sequence, [/d/]+ matches slashes and digits in any order and amount, and certainly places this string into Group 1.

Try (?<=/d/)[^/]+
Explanation:
(?<=/d/) - positive lookbehind, assure that what's preceeding is /d/
[^/]+ - match one or more characters other than /, so it matches everything until /
Demo

You could use a capturing group:
https?://.*?/d/([^/\s]+)
Regex demo

Related

Python path regex optional match

I have path strings like these two:
tree/bee.horse_2021/moose/loo.se
bee.horse_2021/moose/loo.se
bee.horse_2021/mo.ose/loo.se
The path can be arbitrarily long after moose. Sometimes the first part of the path such as tree/ is missing, sometimes not. I want to capture tree in the first group if it exists and bee.horse in the second.
I came up with this regex, but it doesn't work:
path_regex = r'^(?:(.*)/)?([a-zA-Z]+\.[a-zA-Z]+).+$'
What am I missing here?

You can restrict the characters to be matched in the first capture group.
For example, you could match any character except / or . using a negated character class [^/\n.]+
^(?:([^/\n.]+)/)?([a-zA-Z]+\.[a-zA-Z]+).*$
Regex demo
Or you can restrict the characters to match word characters \w+ only
^(?:(\w+)/)?([a-zA-Z]+\.[a-zA-Z]+).*$
Regex demo
Note that in your pattern, the .+ at the end matches as least a single character. If you want to make that part optional, you can change it to .*

Regex - Word boundary not working even with raw-string

I'm coding a set of regex to match dates in text using python. One of my regex was designed to match dates in the format MM/YYYY only. The regex is the following:
r'\b((?:(?:0)[0-9])|(?:(?:1)[0-2])|(?:(?:[1-9])))(?:\/|\\)(\d{4})\b'
Looks like the word boundary is not working as it is matching parts of dates like 12/02/2020 (it should not match this date format at all).
In the attached image only the second pattern should have been recognized. The first one shouldn't, even parts of it, have been a match.
Remembering that the regex should match the MM/YYYY pattern in strings like:
"The range of dates go from 21/02/2020 to 21/03/2020 as specified above."
Can you help me find the error in my pattern to make it match only my goal format?

A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]).
What is a word boundary in regex?
What happens is that the \ character is not part of the group \w, thus every time your string has a new \ it is considered to be a new word boundary.
You have not provided the full string you are matching, but I could solve the example you have posted you could solve it by just putting the anchors ^$
^((?:(?:0)[0-9])|(?:(?:1)[0-2])|(?:(?:[1-9])))(?:\/|\\)(\d{4})$
https://regex101.com/r/xncZNN/1
edit:
Working on your full example and your regex I did some "clean up" because it was a bit confusing, but I think I understood the pattern you were trying to map
here is the new:
(?<=^|[a-zA-Z ])(0[0-9]|1[12]|[1-9])(?:\/|\\)([\d]{4})(?=[a-zA-Z ]|$)
I have substituted the word boundary by lookahead (?!...) and lookbehind (?<!...), and specified the pattern I want to match before and after the date. You can adjust it to your specific need and add other characters like numbers or specific stuff.
https://regex101.com/r/xncZNN/4

The problem is that \b\d{2}/\d{4}\b matches 02/2000 in the string 01/02/2000 because the first forward slash is a word break. The solution is to identify the characters that should not precede and follow the match and use negative lookarounds in place of word breaks. Here you could use the regular expression
r'(?<![\d/])(?:0[1-9]|1[0-2])/\d{4}(?![\d/])'
The negative lookbehind, (?<![\d/]), prevents the two digits representing the month to be preceded by a digit or forward slash; the negative lookahead, (?![\d/]) prevents the four digits representing the year to be followed by a digit or forward slash.
Regex demo
Python demo
If 6/2000 is to be matched as well as 06/2000, change (?:0[1-9] to (?:0?[1-9].

How to say "match anything until a specific character, then work your way backwards"?

I am often faced with patterns where the part which is interesting is delimited by a specific character, the rest does not matter. A typical example:
/dev/sda1 472437724 231650856 216764652 52% /
I would like to extract 52 (which can also be 9, or 100 - so 1 to 3 digits) by saying "match anything, then when you get to % (which is unique in that line), see before for the matches to extract".
I tried to code this as .*(\d*)%.* but the group is not matched:
.* match anything, any number of times
% ... until you get to the litteral % (the \d is also matched by .* but my understanding is that once % is matched, the regex engine will work backwards, since it now has an "anchor" on which to analyze what was before -- please tell if this reasoning is incorrect, thank you)
(\d*) ... and now before that % you had a (\d*) to match and group
.* ... and the rest does not matter (match everything)

Your regex does not work because . matches too much, and the group matches too little. The group \d* can basically match nothing because of the * quantifier, leaving everything matched by the ..
And your description of .* is somewhat incorrect. It actually matches everything until the end, and moves backwards until the thing after it ((\d*).*) matches. For more info, see here.
In fact, I think your text can be matched simply by:
(\d{1,3})%
And getting group 1.
The logic of "keep looking until you find..." is kind of baked into the regex engine, so you don't need to explicitly say .* unless you want it in the match. In this case you just want the number before the % right?

If you are just looking to extract just the number then I would use:
import re
pattern = r"\d*(?=%)"
string = "/dev/sda1 472437724 231650856 216764652 52% /"
returnedMatches = re.findall(pattern, string)
The regex expression does a positive look ahead for the special character

In your pattern this part .* matches until the end of the string. Then it backtracks giving up as least as possible till it can match 0+ times a digit and a %.
The % is matched because matching 0+ digits is ok. Then you match again .* till the end of the string. There is a capturing group, only it is empty.
What you might do is add a word boundary or a space before the digits:
.* (\d{1,3})%.* or .*\b(\d{1,3})%.*
Regex demo 1 Or regex demo 2
Note that using .* (greedy) you will get the last instance of the digits and the % sign.
If you would make it non greedy, you would match the first occurrence:
.*?(\d{1,3})%.*
Regex demo

By default regex matches as greedily as possible. The initial .* in your regex sequence is matching everything up to the %:
"/dev/sda1 472437724 231650856 216764652 52"
This is acceptable for the regex, because it just chooses to have the next pattern, (\d*), match 0 characters.
In this scenario a couple of options could work for you. I would most recommend to use the previous spaces to define a sequence which "starts with a single space, contains any number of digits in the middle, and ends with a percentage symbol":
' (\d*)%'

Try this:
.*(\b\d{1,3}(?=\%)).*
demo

How to search/extract patterns in a string?

I have a pattern I want to search for in my message.
The patterns are:
1. "aaa-b3-c"
2. "a3-b6-c"
3. "aaaa-bb-c"
I know how to search for one of the patterns, but how do I search for all 3?
Also, how do you identify and extract dates in this format: 5/21 or 5/21/2019.
found = re.findall(r'.{3}-.{2}-.{1}', message)

Try this :
found = re.findall(r'a{2,4}-b{2}-c', message)

You could use
a{2,4}-bb-c
as a pattern.
Now you need to check the match for truthiness:
match = re.search(pattern, string)
if match:
# do sth. here
As from Python 3.8 you can use the walrus operator as in
if (match := re.search(pattern, string)) is not None:
# do sth. here

try this:
re.findall(r'a.*-b.*-c',message)

The first part could be a quantifier {2,4} instead of 3. The dot matches any character except a newline, [a-zA-Z0-9] will match a upper or lowercase char a-z or a digit:
\b[a-zA-Z0-9]{2,4}-[a-zA-Z0-9]{2}-[a-zA-Z0-9]\b
Demo
You could add word boundaries \b or anchors ^ and $ on either side if the characters should not be part of a longer word.
For the second pattern you could also use \d with a quantifier to match a digit and an optional patter to match the part with / and 4 digits:
\d{1,2}/\d{2}(?:/\d{4})?
Regex demo
Note that the format does not validate a date itself. Perhaps this page can help you creating / customize a more specific date format.

Here, we might just want to write three expressions, and swipe our inputs from left to right just to be safe and connect them using logical ORs and in case we had more patterns we can simply add to it, similar to:
([a-z]+-[a-z]+[0-9]+-[a-z]+)
([a-z]+[0-9]+-[a-z]+[0-9]+-[a-z])
([a-z]+-[a-z]+-[a-z])
which would add to:
([a-z]+-[a-z]+[0-9]+-[a-z]+)|([a-z]+[0-9]+-[a-z]+[0-9]+-[a-z])|([a-z]+-[a-z]+-[a-z])
Then, we might want to bound it with start and end chars:
^([a-z]+-[a-z]+[0-9]+-[a-z]+)$|^([a-z]+[0-9]+-[a-z]+[0-9]+-[a-z])$|^([a-z]+-[a-z]+-[a-z])$
or
^(([a-z]+-[a-z]+[0-9]+-[a-z]+)|([a-z]+[0-9]+-[a-z]+[0-9]+-[a-z])|([a-z]+-[a-z]+-[a-z]))$
RegEx
If this expression wasn't desired, it can be modified or changed in regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:

Python Not Extracting Expected Pattern

I'm new to RegEx and I am trying to perform a simple match to extract a list of items using re.findall. However, I am not getting the expected result. Can you please help explain why I am also getting the first piece of this string based on the below regex pattern and what I need to modify to get the desired output?
import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''
print(re.findall('_\w+_\w+_bar_\d+', string))
Current Output:
['_1y345_xyz_orange_bar_1', '_123a5542_xyz_orange_bar_1', '_1z34512_abc_purple_bar_1']
Desired Output:
['_xyz_orange_bar_1', '_xyz_orange_bar_1', '_abc_purple_bar_1']

The \w pattern matches letters, digits and _ symbol. Depending on the Python version and options used, the letters and digits it can match may be from the whole Unicode range or just ASCII.
So, the best way to fix the issue is by replacing \w with [^\W_]:
import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''
print(re.findall(r'_[^\W_]+_[^\W_]+_bar_[0-9]+', string))
# => ['_xyz_orange_bar_1', '_xyz_orange_bar_1', '_abc_purple_bar_1']
See the Python demo.
Details:
_ - an underscore
[^\W_]+ - 1 or more chars that are either digits or letters (a [^ starts the negated character class, \W matches any non-word char, and _ is added to match any word chars other than _)
_[^\W_]+ - same as above
_bar_ - a literal substring _bar_
[0-9]+ - 1 or more ASCII digits.
See the regex demo.

_[a-z]+_\w+_bar_\d+ should work.
import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''
print(re.findall('_[a-z]+_\w+_bar_\d+', string))
o/p
['_xyz_orange_bar_1', '_xyz_orange_bar_1', '_abc_purple_bar_1']

Your problem is that the regular expression is greedy and tries to match as much as possible. Sometimes this can be fixed by adding a ? (question mark) after the + (plus) sign. However, in your current solution that is not doable (in any simple way, at least - it can likely be done with some lookahead). However, you can choose another pattern, that explicitly forbids matching then _ (underline) character as:
import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''
print(re.findall('_[^_\W]+_[^_\W]+_bar_\d+', string))
This will match what you hope for. The [^ ... ] construct means not, thus not underline and not not whitespace.

The problem with your code is that \w pattern is equivalent to the following set of characters: [a-zA-Z0-9_]
I guess you need to match the same set but without an underscore:
import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''
print(re.findall('_[a-zA-Z0-9]+_[a-zA-Z0-9]+_bar_\d+', string))
The output:
['_xyz_orange_bar_1', '_xyz_orange_bar_1', '_abc_purple_bar_1']

Your \w usage is too permissive. It will find not only letters, but numbers and underscores as well. From the docs:
When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.
Instead us actual character groupings to match.
_[a-z]+_[a-z]+_bar_[0-9]+
If you actually need the complete matching of \w without the underscore, you can change the character groupings to:
[a-zA-Z0-9]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex for parsing uid from URL - python

Try (?<=/d/)[^/]+ Explanation: (?<=/d/) - positive lookbehind, assure that what's preceeding is /d/ [^/]+ - match one or more characters other than /, so it matches everything until / Demo

You could use a capturing group: https?://.*?/d/([^/\s]+) Regex demo

Related

Python path regex optional match

Regex - Word boundary not working even with raw-string

How to say "match anything until a specific character, then work your way backwards"?

How to search/extract patterns in a string?

Python Not Extracting Expected Pattern

Categories

Resources