python regex dash - python

I am trying to match the following line in python however this line is not working.
Example text is:
usr/local/java/latest/bin/java-Djava.util.logging.config.file=/usr/local/tomcat/foobarcode/conf/
Example python:
re.match(r"/usr/local/java/latest/bin/java[-]Djava[.]util[.]logging[.]config[.]file[=]/usr/local/tomcat/(\.*)/conf/\.*", pidInfo)
Any help that can be provided will be greatly appreciated. Thanks!

The problem is that you escape the .. That makes it a literal period. You want to leave it unescaped so that it becomes the wildcard:
re.match(r"/usr/local/java/latest/bin/java-Djava[.]util[.]logging[.]config[.]file=/usr/local/tomcat/(.*)/conf/.*", pidInfo)
Also, your input example does not contain the leading /, but that might be a copy-paste mistake. (And note that there is no need to escape = and -).

Here's the fixed version of what you're trying to do:
>>> s = '/usr/local/java/latest/bin/java-Djava.util.logging.config.file=/usr/local/tomcat/foobarcode/conf/'
>>> re.match(r'/usr/local/java/latest/bin/java-Djava\.util\.logging\.config\.file=/usr/local/tomcat/(.*)/conf/.*', s)
<_sre.SRE_Match object at 0x1004a0648>
Issues:
You do not need to escape the . as a wildcard
You do need to (elegantly) escape \. when used as dots in filename ([.] works just fine, just less elegant IMO)
Missing root / in filename
- and = chars do not need escaping

Related

force re.search to include # and $

I am trying to get a substring between two markers using re in Python, for example:
import re
test_str = "#$ -N model_simulation 2022"
# these two lines work
# the output is: model_simulation
print(re.search("-N(.*)2022",test_str).group(1))
print(re.search(" -N(.*)2022",test_str).group(1))
# these two lines give the error: 'NoneType' object has no attribute 'group'
print(re.search("$ -N(.*)2022",test_str).group(1))
print(re.search("#$ -N(.*)2022",test_str).group(1))
I read the documentation of re here. It says that "#" is intentionally ignored so that the outputs look neater.
But in my case, I do need to include "#" and "$". I need them to identify the part of the string that I want, because the "-N" is not unique in my entire text string for real work.
Is there a way to force re to include those? Or is there a different way without using re?
Thanks.
You can escape both with \, for example,
print(re.search("\#\$ -N(.*)2022",test_str).group(1))
# output model_simulation
You can get rid of the special meaning by using the backslash prefix: $. This way, you can match the dollar symbol in a given string
# add backslash before # and $
# the output is: model_simulation
print(re.search("\$ -N(.*)2022",test_str).group(1))
print(re.search("\#\$ -N(.*)2022",test_str).group(1))
In regular expressions, $ signals the end of the string. So 'foo' would match foo anywhere in the string, but 'foo$' only matches foo if it appears at the end. To solve this, you need to escape it by prefixing it with a backslash. That way it will match a literal $ character
# is only the start of a comment in verbose mode using re.VERBOSE (which also ignores spaces), otherwise it just matches a literal #.
In general, it is also good practice to use raw string literals for regular expressions (r'foo'), which means Python will let backslashes alone so it doesn't conflict with regular expressions (that way you don't have to type \\\\ to match a single backslash \).
Instead of re.search, it looks like you actually want re.fullmatch, which matches only if the whole string matches.
So I would write your code like this:
print(re.search(r"\$ -N(.*)2022", test_str).group(1)) # This one would not work with fullmatch, because it doesn't match at the start
print(re.fullmatch(r"#\$ -N(.*)2022", test_str).group(1))
In a comment you mentioned that the string you need to match changes all the time. In that case, re.escape may prove useful.
Example:
prefix = '#$ - N'
postfix = '2022'
print(re.fullmatch(re.escape(prefix) + '(.*)' + re.escape(postfix), tst_str).group(1))

Regular expression error: unbalanced parenthesis at position n

I have been meaning to extract the month name from the following string with regex and despite the fact that my regex works on a platform like regex101, I can't seem to be able to extract the word "August".
import re
s = "word\anyword\2021\August\202108_filename.csv"
re.findall("\d+\\([[:alpha:]]+)\\\d+", s)
Which results in the following error:
error: unbalanced parenthesis at position 17
I also tried using re.compile, re.escape as per suggestions of the previous posts dealing with the same error but none of them seems to work.
Any help and also a little explanation on why this isn't working is greatly appreciated.
You can use
import re
s = r"word\anyword\2021\August\202108_filename.csv"
m = re.search(r"\d+\\([a-zA-Z]+)\\\d+", s)
if m:
print(m.group(1))
See the Python demo.
There are three main problems here:
The input string should be the same as used at regex101.com, i.e. you need to make sure you are using literal backslashes in the Python code, hence the use of raw string literals for both the input text and regex
The POSIX character classes are not supported by Python re, so [[:alpha:]]+ should be replaced with some equivalent pattern, say, [A-Za-z]+ or [^\W\d_]+
Since it seems like you only expect a single match (there is only one August (month) name in the string), you do not need re.findall, you can use re.search. Only use re.findall when you need to extract multiple matches from a string.
Also, see these posts:
Python regex - r prefix
What does the "r" in pythons re.compile(r' pattern flags') mean?
What exactly do "u" and "r" string flags do, and what are raw string literals?

how re.compile works

i got some confused in re.compile function bellow, i know it compile to detect all non printable chars. but i'm not sure about the meaning of the parameter put in the compile function. thank you guys!
re_print = re.compile('[^%s]' % re.escape(string.printable))
Breaking this down to see if it helps. Run the following code in your python3 interpreter:
import string
import re
# This will be the contents of the variable referenced
print(string.printable)
# This is what happens after all those characters are escaped by re
print(re.escape(string.printable)
# This is the whole value you are giving to re.compile (the re_print):
print('[^%s]' % re.escape(string.printable))
# Note the ^ in front means anything NOT printables
re_print is likely being used to check some text for non-printable characters (not in string.printable), but some of them need to be escaped or if not, re will not get the expected results because the special characters may be interpreted as a regex statement.

Python - regex to match url with mongo object id

I am trying to write a regex that matches a url of the following format:
/api/v1/users/<mongo_object_id>/submissions
Where an example of a mongo_object_id is 556b352f87d4693546d31185.
I have cooked up the following pattern, but it does not seems to work.
/api/v1/users\\/(?=[a-f\\d]{24}$)(\\d+[a-f]|[a-f]+\\d)\\/submissions
Any help is appreciated.
This will do (considering 24 hex chars), using raw keyword before string so no need to escape with double slashes:
r'\/api\/v1\/users\/([a-f\d]{24})\/submissions'
Python console:
>>> re.findall(r'\/api\/v1\/users\/([a-f\d]{24})\/submissions','/api/v1/users/556b352f87d4693546d31185/submissions')
['556b352f87d4693546d31185']
It looks like an object's ID is a hexadecimal number, which means that it's matched by something as simple as this:
[0-9a-f]+
If you want to make sure it's always 24 characters:
[0-9a-f]{24}
Toss that between the slashes:
/api/v1/users/([0-9a-f]{24})/submissions
And it should work.
Note: You will probably have to escape the slashes, depending on how Python's regex syntax works. If I remember right, you can do this:
import re
re.findall(r'/api/v1/users/([0-9a-f]{24})/submissions', url)
or
re.findall(r'/api/v1/users/([0-9a-f]{24})/submissions', url, re.I)
if you wanna make the whole thing case-insensitive.

Why this regular expression is not working

I am using python 2.7 with scrapy .20
I have this test
0552121152, +97143321090
I want to get the value before the comma and the value after it.
My regular expression is
\s*(.*), for the one before
and ,\s*(.*) for the one after
in both ways I got the whole test.
why please?
Edit
For who need a full details, here it is:
This is the page I am scraping http://www.justproperty.com/apartments/old-town/1092713-amazing-2br-apartment-in-the-old-town-island-with-a-burj-khalifa-and-address-hotel-view.html
and this is my scrapy code in cmd in shell:
s =sel.xpath("normalize-space(.//span[#class='content_agent']/span[last()]/span[2]/text())")
s.re(r'\s*(.*),')
Note
I am not asking about the solution to get these values, I am asking about why my mentioned-regular expression is not working
Try doing it this way:
import re
string = "0552121152, +97143321090"
pattern = re.compile(r"\s*(.*),\s*(.*)")
print pattern.search(string).groups()
Result:
('0552121152', '+97143321090')
[Finished in 0.3s]
The only difference might be because I used the .groups() method, that's why I'm getting it. If you can reproduce this on your end, then it's not your regular expression that is wrong but your coding style.
Other than that, I fail to see how the regex fails at all.
That because . matches also , and it's a greedy selection. Use something like more specific like [^,] or \d instead of . symbol.
Something like this would store the part before the comma in \1 and the part after the comma in \2
/(\s?[\d]+)(?:[, ]+)([+\d]+)/
This looks for an optional number of whitespace characters followed by some digits. Then it looks for either a space or comma (or both) followed by another string made up of a character class of numbers and/or a plus sign.

Categories