I am trying to write a regex that matches a url of the following format:
/api/v1/users/<mongo_object_id>/submissions
Where an example of a mongo_object_id is 556b352f87d4693546d31185.
I have cooked up the following pattern, but it does not seems to work.
/api/v1/users\\/(?=[a-f\\d]{24}$)(\\d+[a-f]|[a-f]+\\d)\\/submissions
Any help is appreciated.
This will do (considering 24 hex chars), using raw keyword before string so no need to escape with double slashes:
r'\/api\/v1\/users\/([a-f\d]{24})\/submissions'
Python console:
>>> re.findall(r'\/api\/v1\/users\/([a-f\d]{24})\/submissions','/api/v1/users/556b352f87d4693546d31185/submissions')
['556b352f87d4693546d31185']
It looks like an object's ID is a hexadecimal number, which means that it's matched by something as simple as this:
[0-9a-f]+
If you want to make sure it's always 24 characters:
[0-9a-f]{24}
Toss that between the slashes:
/api/v1/users/([0-9a-f]{24})/submissions
And it should work.
Note: You will probably have to escape the slashes, depending on how Python's regex syntax works. If I remember right, you can do this:
import re
re.findall(r'/api/v1/users/([0-9a-f]{24})/submissions', url)
or
re.findall(r'/api/v1/users/([0-9a-f]{24})/submissions', url, re.I)
if you wanna make the whole thing case-insensitive.
Related
I am trying to get a substring between two markers using re in Python, for example:
import re
test_str = "#$ -N model_simulation 2022"
# these two lines work
# the output is: model_simulation
print(re.search("-N(.*)2022",test_str).group(1))
print(re.search(" -N(.*)2022",test_str).group(1))
# these two lines give the error: 'NoneType' object has no attribute 'group'
print(re.search("$ -N(.*)2022",test_str).group(1))
print(re.search("#$ -N(.*)2022",test_str).group(1))
I read the documentation of re here. It says that "#" is intentionally ignored so that the outputs look neater.
But in my case, I do need to include "#" and "$". I need them to identify the part of the string that I want, because the "-N" is not unique in my entire text string for real work.
Is there a way to force re to include those? Or is there a different way without using re?
Thanks.
You can escape both with \, for example,
print(re.search("\#\$ -N(.*)2022",test_str).group(1))
# output model_simulation
You can get rid of the special meaning by using the backslash prefix: $. This way, you can match the dollar symbol in a given string
# add backslash before # and $
# the output is: model_simulation
print(re.search("\$ -N(.*)2022",test_str).group(1))
print(re.search("\#\$ -N(.*)2022",test_str).group(1))
In regular expressions, $ signals the end of the string. So 'foo' would match foo anywhere in the string, but 'foo$' only matches foo if it appears at the end. To solve this, you need to escape it by prefixing it with a backslash. That way it will match a literal $ character
# is only the start of a comment in verbose mode using re.VERBOSE (which also ignores spaces), otherwise it just matches a literal #.
In general, it is also good practice to use raw string literals for regular expressions (r'foo'), which means Python will let backslashes alone so it doesn't conflict with regular expressions (that way you don't have to type \\\\ to match a single backslash \).
Instead of re.search, it looks like you actually want re.fullmatch, which matches only if the whole string matches.
So I would write your code like this:
print(re.search(r"\$ -N(.*)2022", test_str).group(1)) # This one would not work with fullmatch, because it doesn't match at the start
print(re.fullmatch(r"#\$ -N(.*)2022", test_str).group(1))
In a comment you mentioned that the string you need to match changes all the time. In that case, re.escape may prove useful.
Example:
prefix = '#$ - N'
postfix = '2022'
print(re.fullmatch(re.escape(prefix) + '(.*)' + re.escape(postfix), tst_str).group(1))
I have been meaning to extract the month name from the following string with regex and despite the fact that my regex works on a platform like regex101, I can't seem to be able to extract the word "August".
import re
s = "word\anyword\2021\August\202108_filename.csv"
re.findall("\d+\\([[:alpha:]]+)\\\d+", s)
Which results in the following error:
error: unbalanced parenthesis at position 17
I also tried using re.compile, re.escape as per suggestions of the previous posts dealing with the same error but none of them seems to work.
Any help and also a little explanation on why this isn't working is greatly appreciated.
You can use
import re
s = r"word\anyword\2021\August\202108_filename.csv"
m = re.search(r"\d+\\([a-zA-Z]+)\\\d+", s)
if m:
print(m.group(1))
See the Python demo.
There are three main problems here:
The input string should be the same as used at regex101.com, i.e. you need to make sure you are using literal backslashes in the Python code, hence the use of raw string literals for both the input text and regex
The POSIX character classes are not supported by Python re, so [[:alpha:]]+ should be replaced with some equivalent pattern, say, [A-Za-z]+ or [^\W\d_]+
Since it seems like you only expect a single match (there is only one August (month) name in the string), you do not need re.findall, you can use re.search. Only use re.findall when you need to extract multiple matches from a string.
Also, see these posts:
Python regex - r prefix
What does the "r" in pythons re.compile(r' pattern flags') mean?
What exactly do "u" and "r" string flags do, and what are raw string literals?
I am using python 2.7 with scrapy .20
I have this test
0552121152, +97143321090
I want to get the value before the comma and the value after it.
My regular expression is
\s*(.*), for the one before
and ,\s*(.*) for the one after
in both ways I got the whole test.
why please?
Edit
For who need a full details, here it is:
This is the page I am scraping http://www.justproperty.com/apartments/old-town/1092713-amazing-2br-apartment-in-the-old-town-island-with-a-burj-khalifa-and-address-hotel-view.html
and this is my scrapy code in cmd in shell:
s =sel.xpath("normalize-space(.//span[#class='content_agent']/span[last()]/span[2]/text())")
s.re(r'\s*(.*),')
Note
I am not asking about the solution to get these values, I am asking about why my mentioned-regular expression is not working
Try doing it this way:
import re
string = "0552121152, +97143321090"
pattern = re.compile(r"\s*(.*),\s*(.*)")
print pattern.search(string).groups()
Result:
('0552121152', '+97143321090')
[Finished in 0.3s]
The only difference might be because I used the .groups() method, that's why I'm getting it. If you can reproduce this on your end, then it's not your regular expression that is wrong but your coding style.
Other than that, I fail to see how the regex fails at all.
That because . matches also , and it's a greedy selection. Use something like more specific like [^,] or \d instead of . symbol.
Something like this would store the part before the comma in \1 and the part after the comma in \2
/(\s?[\d]+)(?:[, ]+)([+\d]+)/
This looks for an optional number of whitespace characters followed by some digits. Then it looks for either a space or comma (or both) followed by another string made up of a character class of numbers and/or a plus sign.
I am trying to match the following line in python however this line is not working.
Example text is:
usr/local/java/latest/bin/java-Djava.util.logging.config.file=/usr/local/tomcat/foobarcode/conf/
Example python:
re.match(r"/usr/local/java/latest/bin/java[-]Djava[.]util[.]logging[.]config[.]file[=]/usr/local/tomcat/(\.*)/conf/\.*", pidInfo)
Any help that can be provided will be greatly appreciated. Thanks!
The problem is that you escape the .. That makes it a literal period. You want to leave it unescaped so that it becomes the wildcard:
re.match(r"/usr/local/java/latest/bin/java-Djava[.]util[.]logging[.]config[.]file=/usr/local/tomcat/(.*)/conf/.*", pidInfo)
Also, your input example does not contain the leading /, but that might be a copy-paste mistake. (And note that there is no need to escape = and -).
Here's the fixed version of what you're trying to do:
>>> s = '/usr/local/java/latest/bin/java-Djava.util.logging.config.file=/usr/local/tomcat/foobarcode/conf/'
>>> re.match(r'/usr/local/java/latest/bin/java-Djava\.util\.logging\.config\.file=/usr/local/tomcat/(.*)/conf/.*', s)
<_sre.SRE_Match object at 0x1004a0648>
Issues:
You do not need to escape the . as a wildcard
You do need to (elegantly) escape \. when used as dots in filename ([.] works just fine, just less elegant IMO)
Missing root / in filename
- and = chars do not need escaping
I'm trying to write a regex to match any path that contains /? to determine whether it is a querystring or not.
a sample string to be matched would be this: /mysite/path/to/whatever/?page=1
so far I thought this would match re.match(r'/\?', '/mysite/path/to/whatever/?page=1')
but it doesn't seem to be matching
This code is already written for you. No need to reinvent the wheel:
import urlparse
print urlparse.urlparse('/mysite/path/to/whatever/?page=1')
http://docs.python.org/library/urlparse.html#module-urlparse
Your problem is that you're using re.match. That function looks for matches at the beginning of the string. So, either you change your regexp to '.*/\?', or use re.search instead.
You don't need a regular expression here. Just use the in operator: '/?' in the_string.
The problem is that re.match only looks at the beginning of the string.
You could use re.search instead, if you need the power of REs.