how to match 'space' symbol (and only space) in python2 regex? - python

I am trying to use Python v2 (2.7.5 specifically) and the 're' module for regex matching. My problem is that for my application I need to match the 'space' symbol (i.e. 0x20 in hex) and ONLY that symbol as part of the match string. The first thing I tried for that was '\s' and that does not work because it also matches the newline, return, tab and form.
The end requirement is to match a string where the first three characters are digits ('\d'), there is a comma (',') and then eight symbols that are either digits ('\d') or spaces (???).
Any suggestions on how to do that? What I have already tried...
C:\Users\jlaird>python
Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> goodstring = '333,000000 2'
>>> badstring = '333,000000\t2'
>>> print badstring
333,000000 2
>>> sRegex = '\d\d\d,[\s\d][\s\d][\s\d][\s\d][\s\d][\s\d][\s\d][\s\d]'
>>> cRegex = re.compile(sRegex)
>>> cRegex.match(goodstring)
<_sre.SRE_Match object at 0x023A7A30>
>>> cRegex.match(badstring)
<_sre.SRE_Match object at 0x025E82C0>
>>>
I want 'badstring' to evaluate to None because it has the tab character instead of the space. How can I do this?

Thanks jonrsharpe...works. It is always something simple that I make complicated. Sorry...
>>> sRegex = '\d\d\d,[ \d][ \d][ \d][ \d][ \d][ \d][ \d][ \d]'
>>> cRegex = re.compile(sRegex)
>>> cRegex.match(goodstring)
<_sre.SRE_Match object at 0x023A7A30>
>>> cRegex.match(badstring)
>>>

Related

Evaluate ANSI escapes in Python string

Say I have the string '\033[2KResolving dependencies...\033[2KResolving dependencies...'
In the Python console, I can print this, and it'll only display once
Python 3.10.9 (main, Jan 19 2023, 07:59:38) [GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> output = '\033[2KResolving dependencies...\033[2KResolving dependencies...'
>>> print(output)
Resolving dependencies...
Is there a way to get a string that consists solely of the printed output? In other words, I would like there to be some function
def evaluate_ansi_escapes(input: str) -> str:
...
such that evaluate_ansi_escapes(output) == 'Resolving dependencies...' (ideally with the correct amount of whitespace in front)
edit: I've come up with the following stopgap solution
import re
def evaluate_ansi_escapes(input: str) -> str:
erases_regex = r"^.*(\\(033|e)|\x1b)\[2K"
erases = re.compile(erases_regex)
no_erases = []
for line in input.split("\n"):
while len(erases.findall(line)) > 0:
line = erases.sub("", line)
no_erases.append(line)
return "\n".join(no_erases)
This does successfully produce output that is close enough to I want:
>>> evaluate_ansi_escapes(output)
'Resolving dependencies...'
But I would love to know if there is a less hacky way to solve this problem, or if the whitespace preceding 'Resolving dependencies...' can be captured as well.

Python automatically converting some strings to raw strings?

Python seems to be automatically converting strings (not just input) into raw strings. Can somebody explain what is happening here?
Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit
(AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '\stest'
>>> s
'\\stest'
# looks like a raw string
>>> print(s)
\stest
>>> s = '\ntest'
>>> s
'\ntest'
# this one doesn't
>>> s = '\n test'
>>> s
'\n test'
>>> s = r'\n test'
>>> s
'\\n test'
>>> print(s)
\n test
The question marked as a duplicate for this one seems to be useful, but then I do not understand why
>>> s = '\n test'
>>> s
'\n test'
>>> repr(s)
"'\\n test'"
does not get two backslashes when called, and does when repr() is called on it.
\n is a valid escape sequence and '\n' is a length 1 string (new line character). In contrast, \s is an invalid escape sequence, so Python is assuming that what you wanted there was a two character string: a backlash character plus an s character.
>>> len('\s')
2
What you saw on terminal output was just the usual representation for such a length 2 string. Note that the correct way to create the string which Python gave you back here would have been with r'\s' or with '\\s'.
>>> r'\s' == '\\s' == '\s'
True
This is a deprecated behavior. In a future version of Python, likely the next point release, your code will be a syntax error.
Since you're using v3.7.1, you could enable warnings if you want to be informed about such uses of deprecated features:
$ python -Wall
>>> '\s'
<stdin>:1: DeprecationWarning: invalid escape sequence \s
'\\s'
As for your subsequent question after the edit:
>>> s = '\n test'
>>> s # this prints the repr(s)
'\n test'
>>> repr(s) # this prints the repr(repr(s))
"'\\n test'"

Removing non integers from a grep obtained string w/ Python and Bash

I am using grep to grab the text out of a file:
NELECT = 44.0000 total number of electrons,
and I need to save the number as a variable. I have tried a handful of methods I have found here such as using filters and findall. For some reason I can only get it to separate one zero.
So far the code looks like this:
wd=os.getcwd()
electrons=str(os.system("grep 'NELECT' "+wd+"/OUTCAR"))
VBM=(re.findall('\d+', electrons))
print VBM
And in return I get ['0'].
The result of os.system is the exit status of the command, not the output of the command -- see https://docs.python.org/3/library/os.html#os.system
$ cat OUTCAR
NELECT = 44.0000 total number of electrons,
$ python
Python 2.7.12 (default, Dec 4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> result = os.system("grep 'NELECT' "+os.getcwd()+"/OUTCAR")
NELECT = 44.0000 total number of electrons,
>>> result
0
The "NELECT" line was just printed by grep to stdout, but not captured in the result variable
>>> from subprocess import check_output
>>> result2 = check_output(["grep", "NELECT", os.getcwd()+"/OUTCAR"])
>>> result2
'NELECT = 44.0000 total number of electrons,\n'
>>> import re
>>> re.findall(r'\d+', result2)
['44', '0000']
Or, don't call out to grep, read the file yourself
>>> import os
>>> import re
>>> with open(os.getcwd() + "/OUTCAR") as f:
... for line in f:
... if "NELECT" in line:
... digits = re.findall(r'\d+', line)
... break
...
>>> digits
['44', '0000']
Or, maybe don't use a regular expression:
>>> words = line.split()
>>> words[2]
'44.0000'
>>> int(float(words[2]))
44
Are you sure that electrons has output specified? For me this regex returns list with two elements: ['44', '000'] and that's expected behavior. So most probably there is something wrong with grep call.
Your regex won't retrieve whole 44.000 as \d+ catches only continuous digit strings, no dot symbols. To get whole number use something like \b\d+\.\d+\b which means: any word (\b means word beginning/ending, dot must be escaped as . in regex matches any character) that contains at least 1 digit, dot and at least 1 more digit. If dot is optional, then something like this: \b(\d+(?:\.\d+)?)\b ((?:) creates group that will not be captured so your output will still be single element list).
Note that re.findall will return list of string matches. To retrieve number from first match: float(VBM[0])
Edit. Forgot to add: avoid using print statement, it works oddly with tuples and is completely removed in Python 3. Python 2 support ends in 2020 so it's better to prepare. You can replace print statement with Python 3 print function by adding from __future__ import print_function at the file beginning.

How should I remove one string from the start of another given that I know the longer string matches case-insensitively?

Suppose I have a workflow that involves examining the start of a long string (LS, say) to see whether it begins with a shorter string SS. If it does, I chop off the matching part of LS and do something with the remaining part. Otherwise, I do something else. (The specific case that prompted this question was a parsing library.)
def do_thing(LS, SS):
if (LS.startswith(SS)):
action_on_match(LS[len(SS):])
else:
action_on_no_match()
This is straightforward. Now, though, suppose that I want to do the same thing but this time I want the strings to be matched case-insensitively. It is possible to test whether "LS.startswith(SS) but case-insensitively". But how should I determine how much of LS to "chop off" when I pass it in to action_on_match()? It isn't sufficient to just use len(SS) as it was before, because if I'm uppercasing or lowercasing or casefolding things, then the length of the matching prefix of LS might not be what I expect: changing the case of a string can change its length. It is important that the part of LS passed to action_on_match() be exactly what the program received as input (after the cutoff point, of course).
Answerers have suggested using lower() and preserving the use of len(SS), but this will not work:
Python 3.4.2 (v3.4.2:ab2c023a9432, Oct 6 2014, 22:15:05) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> def action_on_match (s): return "Match: %s" % s
...
>>> def action_on_no_match (): return "No match"
...
>>> def do_thing (LS, SS):
... if LS.lower().startswith(SS.lower()):
... return action_on_match(LS[len(SS):])
... else:
... return action_on_no_match()
...
>>> do_thing('i\u0307asdf', '\u0130')
'Match: \u0307asdf'
>>>
Here we expect to see 'Match: asdf', but there is an extra character.
Simple enough:
def do_thing(LS, SS):
if LS.lower().startswith(SS.lower()):
action_on_match(LS[len(SS):])
else:
action_on_no_match()
All I'm doing is lower-casing both LS and SS and then comparing them. This will be much slower than a regex solution for very long strings, as it has to convert the entire string to lowercase first.
A regex solution would look like this:
import re
def do_thing(LS, SS):
if re.match("^%s" % SS, LS, re.I):
action_on_match(LS[len(SS):])
else:
action_on_no_match()
Performance
For short strings (len(LL) == 8 characters) over 1000000 iterations:
lower() method: 0.86s (winner)
re method: 1.91s
For long strings (len(LL) == 600 characters) over 1000000 iterations:
lower() method: 2.54s
re method: 1.96s (winner)
Unicode combining characters
For unicode combining characters, the data needs to be normalised first. This means converting any precomposed character into its component parts. You will find for example:
>>> '\u0130' == 'I\u0307'
False
>>> normalize("NFD", '\u0130') == normalize("NFD", 'I\u0307')
True
You will need to perform this normalisation process on your inputs:
SS = normalize("NFD", SS)
LS = normalize("NFD", LS)
Just use str.lower, the length of "FOO" is going to be the same as "foo".lower():
LS.lower().startswith(SS.lower())
def do_thing(ls, ss):
if ls.startswith(ss):
action_on_match(ls[len(ss):])
else:
action_on_no_match()

Python regular expression gives unexpected result

I'm trying to create an svn pre-commit hook, but can't get my regular expression to work as expected. It should print False for messages that do not look like "DEV-5 | some message". Why do I get True here?
Python 2.7.1+ (r271:86832, Apr 11 2011, 18:05:24)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> p = re.compile("^\[[A-Z]+-[0-9]+\] | .+$", re.DOTALL)
>>> message = "test message"
>>> match = p.search(message)
>>> bool(match)
True
>>> p = re.compile("^[A-Z]+-[0-9]+ \| .+$", re.DOTALL)
>>> print p.search("test message")
None
>>> print p.search("DEV-5 | some message")
<_sre.SRE_Match object at 0x800eb78b8>
you don't need \[ and \]
you need to escape |
The culprit is the trailing " | .+$" which is matching ' message' as an alternative to the first regex. As Roman pointed out you meant to match literal '|' so you have to escape it as '\|'.
To see what was being matched, you can do:
print match.group()
' message'
(By the way, a faster non-regex way to only handle lines containing vertical bar would use line.split('|'):
for line in ...:
parts = line.split('|',1)
if len(parts)==1: continue
(code,mesg) = parts
I haven't run the code, but I suspect that the part after the alternative (|) in your regexp matches any nonempty string starting with a space, in this case it's " message".

Categories