how re.compile works

how re.compile works - python

i got some confused in re.compile function bellow, i know it compile to detect all non printable chars. but i'm not sure about the meaning of the parameter put in the compile function. thank you guys!
re_print = re.compile('[^%s]' % re.escape(string.printable))

Breaking this down to see if it helps. Run the following code in your python3 interpreter:
import string
import re
# This will be the contents of the variable referenced
print(string.printable)
# This is what happens after all those characters are escaped by re
print(re.escape(string.printable)
# This is the whole value you are giving to re.compile (the re_print):
print('[^%s]' % re.escape(string.printable))
# Note the ^ in front means anything NOT printables
re_print is likely being used to check some text for non-printable characters (not in string.printable), but some of them need to be escaped or if not, re will not get the expected results because the special characters may be interpreted as a regex statement.

Related

force re.search to include # and $

I am trying to get a substring between two markers using re in Python, for example:
import re
test_str = "#$ -N model_simulation 2022"
# these two lines work
# the output is: model_simulation
print(re.search("-N(.*)2022",test_str).group(1))
print(re.search(" -N(.*)2022",test_str).group(1))
# these two lines give the error: 'NoneType' object has no attribute 'group'
print(re.search("$ -N(.*)2022",test_str).group(1))
print(re.search("#$ -N(.*)2022",test_str).group(1))
I read the documentation of re here. It says that "#" is intentionally ignored so that the outputs look neater.
But in my case, I do need to include "#" and "$". I need them to identify the part of the string that I want, because the "-N" is not unique in my entire text string for real work.
Is there a way to force re to include those? Or is there a different way without using re?
Thanks.

You can escape both with \, for example,
print(re.search("\#\$ -N(.*)2022",test_str).group(1))
# output model_simulation

You can get rid of the special meaning by using the backslash prefix: $. This way, you can match the dollar symbol in a given string
# add backslash before # and $
# the output is: model_simulation
print(re.search("\$ -N(.*)2022",test_str).group(1))
print(re.search("\#\$ -N(.*)2022",test_str).group(1))

In regular expressions, $ signals the end of the string. So 'foo' would match foo anywhere in the string, but 'foo$' only matches foo if it appears at the end. To solve this, you need to escape it by prefixing it with a backslash. That way it will match a literal $ character
# is only the start of a comment in verbose mode using re.VERBOSE (which also ignores spaces), otherwise it just matches a literal #.
In general, it is also good practice to use raw string literals for regular expressions (r'foo'), which means Python will let backslashes alone so it doesn't conflict with regular expressions (that way you don't have to type \\\\ to match a single backslash \).
Instead of re.search, it looks like you actually want re.fullmatch, which matches only if the whole string matches.
So I would write your code like this:
print(re.search(r"\$ -N(.*)2022", test_str).group(1)) # This one would not work with fullmatch, because it doesn't match at the start
print(re.fullmatch(r"#\$ -N(.*)2022", test_str).group(1))
In a comment you mentioned that the string you need to match changes all the time. In that case, re.escape may prove useful.
Example:
prefix = '#$ - N'
postfix = '2022'
print(re.fullmatch(re.escape(prefix) + '(.*)' + re.escape(postfix), tst_str).group(1))

Regular expression error: unbalanced parenthesis at position n

I have been meaning to extract the month name from the following string with regex and despite the fact that my regex works on a platform like regex101, I can't seem to be able to extract the word "August".
import re
s = "word\anyword\2021\August\202108_filename.csv"
re.findall("\d+\\([[:alpha:]]+)\\\d+", s)
Which results in the following error:
error: unbalanced parenthesis at position 17
I also tried using re.compile, re.escape as per suggestions of the previous posts dealing with the same error but none of them seems to work.
Any help and also a little explanation on why this isn't working is greatly appreciated.

You can use
import re
s = r"word\anyword\2021\August\202108_filename.csv"
m = re.search(r"\d+\\([a-zA-Z]+)\\\d+", s)
if m:
print(m.group(1))
See the Python demo.
There are three main problems here:
The input string should be the same as used at regex101.com, i.e. you need to make sure you are using literal backslashes in the Python code, hence the use of raw string literals for both the input text and regex
The POSIX character classes are not supported by Python re, so [[:alpha:]]+ should be replaced with some equivalent pattern, say, [A-Za-z]+ or [^\W\d_]+
Since it seems like you only expect a single match (there is only one August (month) name in the string), you do not need re.findall, you can use re.search. Only use re.findall when you need to extract multiple matches from a string.
Also, see these posts:
Python regex - r prefix
What does the "r" in pythons re.compile(r' pattern flags') mean?
What exactly do "u" and "r" string flags do, and what are raw string literals?

Need to Escape the Character After Special Characters in Python's regex?

I have the following python code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
line = 'div><div class="fieldRow jr_name"><div class="fieldLabel">name<'
regex0 = re.compile('(.+?)\v class="fieldLabel">name.+?', re.VERBOSE | re.UNICODE)
regex1 = re.compile('(.+?)v class="fieldLabel">name.+?', re.VERBOSE | re.UNICODE)
regex2 = re.compile('(.+?) class="fieldLabel">name.+?', re.VERBOSE | re.UNICODE)
m0 = regex0.match(line)
m1 = regex1.match(line)
m2 = regex2.match(line)
if m0:
print 'regex0 is good'
else:
print 'regex0 is no good'
if m1:
print 'regex1 is good'
else:
print 'regex1 is no good'
if m2:
print 'regex2 is good'
else:
print 'regex2 is no good'
The output is
regex0 is good
regex1 is no good
regex2 is good
I don't quite understand why I need to escape the character 'v' after "(.+?)" in regex0. If I don't escape, which will become regex1, then the matching will fail. However, for space right after "(.+?)" in regex3, I don't have to escape.
Any idea?
Thanks in advance.

So, there are some issues with your approach
The ones that contribute to your specific complaint are:
You do not mark te regexp string as raw (r' prefix) - that makes the Python compiler change some "\" prefixed characters inside the string before they even reach the re.match call.
"\v" happens to be one such character - a vertical tab that is replaced by "\0x0b"
You use the "re.VERBOSE" flag - that simply tells the regexp engine to ignore any whitesapce character. "\v" being a vertical tab is one character in this class and is ignored.
So, there is your match for regex0: the letter "v" os never seem as such.
Now, for the possible fixes on you approach, in the order that you should be trying to do them:
1) Don't use regular expressions to parse HTML. Really. There are a lot of packages that can do a good job on parsing HTML, and in missing those you can use stdlib's own HTMLParser (html.parser in Python3);
2) If possible, use Python 3 instead of Python 2 - you will be bitten on the first non-ASCII character inside yourt HTML body if you go on with the naive approach of treating Python2 strings as "real life" text. Python 3 automatic encoding handling (and explicit settings allowed to you when it is not automatic) .
Since you are probably not changing anyway, so try to use regex.findall instead of regex.match - this returns a list of matching strings and could retreive the attributes you are looking at once, without searching from the beggining of the file, or depending on line-breaks inside the HTML.

There is a special symbol in Python regex \v, about which you can read here:
https://docs.python.org/2/library/re.html
Python regex usually are written in r'your regex' block, where "r" means raw string. (https://docs.python.org/3/reference/lexical_analysis.html)
In your code all special characters should be escaped to be understood as normal characters. E.g. if you write s - this is space, \s is just "s". To make it work in an opposite way use raw strings.
The line below is the solution you need, I believe.
regex1 = re.compile(r'(.+?)v class="fieldLabel">name.+?', re.VERBOSE | re.UNICODE)

pep8 warning on regex string in Python, Eclipse

Why is pep8 complaining on the next string in the code?
import re
re.compile("\d{3}")
The warning I receive:
ID:W1401 Anomalous backslash in string: '\d'. String constant might be missing an r prefix.
Can you explain what is the meaning of the message? What do I need to change in the code so that the warning W1401 is passed?
The code passes the tests and runs as expected. Moreover \d{3} is a valid regex.

"\d" is same as "\\d" because there's no escape sequence for d. But it is not clear for the reader of the code.
But, consider \t. "\t" represent tab chracter, while r"\t" represent literal \ and t character.
So use raw string when you mean literal \ and d:
re.compile(r"\d{3}")
or escape backslash explicitly:
re.compile("\\d{3}")

Python is unable to parse '\d' as an escape sequence, that's why it produces a warning.
After that it's passed down to regex parser literally, works fine as an E.S. for regex.

python regex dash

I am trying to match the following line in python however this line is not working.
Example text is:
usr/local/java/latest/bin/java-Djava.util.logging.config.file=/usr/local/tomcat/foobarcode/conf/
Example python:
re.match(r"/usr/local/java/latest/bin/java[-]Djava[.]util[.]logging[.]config[.]file[=]/usr/local/tomcat/(\.*)/conf/\.*", pidInfo)
Any help that can be provided will be greatly appreciated. Thanks!

The problem is that you escape the .. That makes it a literal period. You want to leave it unescaped so that it becomes the wildcard:
re.match(r"/usr/local/java/latest/bin/java-Djava[.]util[.]logging[.]config[.]file=/usr/local/tomcat/(.*)/conf/.*", pidInfo)
Also, your input example does not contain the leading /, but that might be a copy-paste mistake. (And note that there is no need to escape = and -).

Here's the fixed version of what you're trying to do:
>>> s = '/usr/local/java/latest/bin/java-Djava.util.logging.config.file=/usr/local/tomcat/foobarcode/conf/'
>>> re.match(r'/usr/local/java/latest/bin/java-Djava\.util\.logging\.config\.file=/usr/local/tomcat/(.*)/conf/.*', s)
<_sre.SRE_Match object at 0x1004a0648>
Issues:
You do not need to escape the . as a wildcard
You do need to (elegantly) escape \. when used as dots in filename ([.] works just fine, just less elegant IMO)
Missing root / in filename
- and = chars do not need escaping

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.