python ply syntax error,can't parse d[0-9]+ - python

I'm trying to parse this list:
d0,d1,d2,d3,....d456,d457....
To parse this in python-ply, I wrote this as expression :
t_DID = r'[d][0-9]+'
t_DID = r'd[0-9]+'
t_DID = r'\d[0-9]+'
But, it provides me error.
When, I enter 1, it gives me - DEBUG:root:Syntax error at '1'
And when I enter d, it gives me - DEBUG:root:Syntax error at 'd'
What would be the correct token, for this pattern?
How can I resolve this ?

None of those patterns match either d or 1.
r'[d][0-9]+' and r'd[0-9]+' match a d followed by at least one digit. So they will match d1 or d234, but they won't match d because it is not followed by a digit, and they will not match 1 because it doesn't start with d
r'\d[0-9]+' matches a digit (\d) followed by at least one digit more. So it won't match any string starting with d, and it won't match 1 because it requires at least two digits. But it will match 12, 274 and 29847502948375029384750293485702938750493875.
You can read about Python regular expressions in the Python docs (The \ escape codes, including \d, are here).
It's easy to build an interactive tool which lets you experiment with Python regular expressions. Here's a very simple example, which could be improved a lot:
$ python3
Python 3.6.9 (default, Nov 7 2019, 10:44:02)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> import readline
>>> def try_regex(regex):
... r = re.compile(regex)
... try:
... while True:
... match = r.match(input('--> '))
... if match:
... print(f"Matched {match.end()} characters: {match[0]}")
... else:
... print("No match")
... except EOFError:
... pass
...
>>> try_regex(r'd[0-9]+')
--> d1
Matched 2 characters: d1
--> d123
Matched 4 characters: d123
--> 1
No match
--> d
No match
--> d123 abc
Matched 4 characters: d123
--> d123abc
Matched 4 characters: d123

Related

How to extract a part of url from dictionary value in Python?

I have a dictionary where the key's value is
https://service-dmn1-region.com/info 4169 description
I'm interested in fetching dmn1-region from that URL part and print 4169 description as it is. So I intend to print result as:
dmn1-region :4169 description
Do you think it's possible without complex regex/regular expression. The script is in python and tried this -
import re
print re.sub('https://','',dictionary[key])
This just removes https:// part and shows result as service-dmn1-region.com/info 4169 description . But I'm not sure how to achieve the above intended way.
key-value pairs from dictionary looks like-
dictionary = {'service': 'https://service-dmn1-region.com/info 4169 description',
'service1': 'https://service1-dmn2-region2.com/info 5123 someDescription','service2': 'https://dmn1-region-service2.com/info'}
Any insights and help very much appreciated.
Given the information and the fact that you don't want to use regular expressions, you could do something like this:
dictionary = {'service': 'https://service-dmn1-region.com/info 4169 description',
'service1': 'https://service1-dmn2-region2.com/info 5123 someDescription'}
def extract(key, s):
info = '/info'
service = key + '-'
return s[s.find('service') + len(service):s.find('.com')], s[s.find(info) + len(info):].strip()
for key, value in dictionary.items():
region, info = extract(key, value)
print('{0}:{1}'.format(region, info))
Output
dmn2-region2:5123 someDescription
dmn1-region:4169 description
Note that the urls are the values of the dictionary and not the keys.
I'd use something like:
import re
for k, v in dictionary.items(): # .iteritems() for py2
print(re.sub(r"^.*?{}-([^.]+).*?(\d+)\s(.*?)$".format(k), r"\1 :\2 \3", v))
dmn1-region :4169 description
dmn2-region2 :5123 someDescription
DEMO
for values of the type https://service-dmn1-region.com/info 4169 description
you could just match on ^[^-]+-([^.]+)[^\s]+ (.*)$
[harald#localhost ~]$ python3
Python 3.6.6 (default, Jul 19 2018, 14:25:17)
[GCC 8.1.1 20180712 (Red Hat 8.1.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> val = 'https://service-dmn1-region.com/info
4169 description'
>>> res = re.match('^[^-]+-([^.]+)[^\s]+ (.*)$', val)
>>> res.group(1)
'dmn1-region'
>>> res.group(2)
'4169 description'
where ^[^-]+ matches from the start of the input (initial ^) anything that isn't an apostrophe - ([^-]+), so https://service
next you specify that there must be one apostrophe to follow ^[⁻]+- and that you wish to capture
everything that follows next that isn't a dot with ([^.]+). (as you guessed by now, precluding your pattern with ^ negates it and the pattern is written inside brackets [].
Leading us to ^[^-]+-([^.]+), next you wish ignore everything up to the next whitespace since that seperates the other values from the string so you add a pattern match for anything not whitespace (\s) leading to an extra [^\s]+, so ^[^-]+-([^.]+)[^\s]+
which you then wish to follow up by the whitespace seperator (if expect more than 1 whitespace you could use \s* instead of an actual space) and you add a final catch-all capture pattern (.*), which would capture 4169 description (the dot stands for all characters here) until the end of the input $ leading you to ^[^-]+-([^.]+)[^\s]+ (.*)$.

Removing non integers from a grep obtained string w/ Python and Bash

I am using grep to grab the text out of a file:
NELECT = 44.0000 total number of electrons,
and I need to save the number as a variable. I have tried a handful of methods I have found here such as using filters and findall. For some reason I can only get it to separate one zero.
So far the code looks like this:
wd=os.getcwd()
electrons=str(os.system("grep 'NELECT' "+wd+"/OUTCAR"))
VBM=(re.findall('\d+', electrons))
print VBM
And in return I get ['0'].
The result of os.system is the exit status of the command, not the output of the command -- see https://docs.python.org/3/library/os.html#os.system
$ cat OUTCAR
NELECT = 44.0000 total number of electrons,
$ python
Python 2.7.12 (default, Dec 4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> result = os.system("grep 'NELECT' "+os.getcwd()+"/OUTCAR")
NELECT = 44.0000 total number of electrons,
>>> result
0
The "NELECT" line was just printed by grep to stdout, but not captured in the result variable
>>> from subprocess import check_output
>>> result2 = check_output(["grep", "NELECT", os.getcwd()+"/OUTCAR"])
>>> result2
'NELECT = 44.0000 total number of electrons,\n'
>>> import re
>>> re.findall(r'\d+', result2)
['44', '0000']
Or, don't call out to grep, read the file yourself
>>> import os
>>> import re
>>> with open(os.getcwd() + "/OUTCAR") as f:
... for line in f:
... if "NELECT" in line:
... digits = re.findall(r'\d+', line)
... break
...
>>> digits
['44', '0000']
Or, maybe don't use a regular expression:
>>> words = line.split()
>>> words[2]
'44.0000'
>>> int(float(words[2]))
44
Are you sure that electrons has output specified? For me this regex returns list with two elements: ['44', '000'] and that's expected behavior. So most probably there is something wrong with grep call.
Your regex won't retrieve whole 44.000 as \d+ catches only continuous digit strings, no dot symbols. To get whole number use something like \b\d+\.\d+\b which means: any word (\b means word beginning/ending, dot must be escaped as . in regex matches any character) that contains at least 1 digit, dot and at least 1 more digit. If dot is optional, then something like this: \b(\d+(?:\.\d+)?)\b ((?:) creates group that will not be captured so your output will still be single element list).
Note that re.findall will return list of string matches. To retrieve number from first match: float(VBM[0])
Edit. Forgot to add: avoid using print statement, it works oddly with tuples and is completely removed in Python 3. Python 2 support ends in 2020 so it's better to prepare. You can replace print statement with Python 3 print function by adding from __future__ import print_function at the file beginning.

how to match 'space' symbol (and only space) in python2 regex?

I am trying to use Python v2 (2.7.5 specifically) and the 're' module for regex matching. My problem is that for my application I need to match the 'space' symbol (i.e. 0x20 in hex) and ONLY that symbol as part of the match string. The first thing I tried for that was '\s' and that does not work because it also matches the newline, return, tab and form.
The end requirement is to match a string where the first three characters are digits ('\d'), there is a comma (',') and then eight symbols that are either digits ('\d') or spaces (???).
Any suggestions on how to do that? What I have already tried...
C:\Users\jlaird>python
Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> goodstring = '333,000000 2'
>>> badstring = '333,000000\t2'
>>> print badstring
333,000000 2
>>> sRegex = '\d\d\d,[\s\d][\s\d][\s\d][\s\d][\s\d][\s\d][\s\d][\s\d]'
>>> cRegex = re.compile(sRegex)
>>> cRegex.match(goodstring)
<_sre.SRE_Match object at 0x023A7A30>
>>> cRegex.match(badstring)
<_sre.SRE_Match object at 0x025E82C0>
>>>
I want 'badstring' to evaluate to None because it has the tab character instead of the space. How can I do this?
Thanks jonrsharpe...works. It is always something simple that I make complicated. Sorry...
>>> sRegex = '\d\d\d,[ \d][ \d][ \d][ \d][ \d][ \d][ \d][ \d]'
>>> cRegex = re.compile(sRegex)
>>> cRegex.match(goodstring)
<_sre.SRE_Match object at 0x023A7A30>
>>> cRegex.match(badstring)
>>>

How should I remove one string from the start of another given that I know the longer string matches case-insensitively?

Suppose I have a workflow that involves examining the start of a long string (LS, say) to see whether it begins with a shorter string SS. If it does, I chop off the matching part of LS and do something with the remaining part. Otherwise, I do something else. (The specific case that prompted this question was a parsing library.)
def do_thing(LS, SS):
if (LS.startswith(SS)):
action_on_match(LS[len(SS):])
else:
action_on_no_match()
This is straightforward. Now, though, suppose that I want to do the same thing but this time I want the strings to be matched case-insensitively. It is possible to test whether "LS.startswith(SS) but case-insensitively". But how should I determine how much of LS to "chop off" when I pass it in to action_on_match()? It isn't sufficient to just use len(SS) as it was before, because if I'm uppercasing or lowercasing or casefolding things, then the length of the matching prefix of LS might not be what I expect: changing the case of a string can change its length. It is important that the part of LS passed to action_on_match() be exactly what the program received as input (after the cutoff point, of course).
Answerers have suggested using lower() and preserving the use of len(SS), but this will not work:
Python 3.4.2 (v3.4.2:ab2c023a9432, Oct 6 2014, 22:15:05) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> def action_on_match (s): return "Match: %s" % s
...
>>> def action_on_no_match (): return "No match"
...
>>> def do_thing (LS, SS):
... if LS.lower().startswith(SS.lower()):
... return action_on_match(LS[len(SS):])
... else:
... return action_on_no_match()
...
>>> do_thing('i\u0307asdf', '\u0130')
'Match: \u0307asdf'
>>>
Here we expect to see 'Match: asdf', but there is an extra character.
Simple enough:
def do_thing(LS, SS):
if LS.lower().startswith(SS.lower()):
action_on_match(LS[len(SS):])
else:
action_on_no_match()
All I'm doing is lower-casing both LS and SS and then comparing them. This will be much slower than a regex solution for very long strings, as it has to convert the entire string to lowercase first.
A regex solution would look like this:
import re
def do_thing(LS, SS):
if re.match("^%s" % SS, LS, re.I):
action_on_match(LS[len(SS):])
else:
action_on_no_match()
Performance
For short strings (len(LL) == 8 characters) over 1000000 iterations:
lower() method: 0.86s (winner)
re method: 1.91s
For long strings (len(LL) == 600 characters) over 1000000 iterations:
lower() method: 2.54s
re method: 1.96s (winner)
Unicode combining characters
For unicode combining characters, the data needs to be normalised first. This means converting any precomposed character into its component parts. You will find for example:
>>> '\u0130' == 'I\u0307'
False
>>> normalize("NFD", '\u0130') == normalize("NFD", 'I\u0307')
True
You will need to perform this normalisation process on your inputs:
SS = normalize("NFD", SS)
LS = normalize("NFD", LS)
Just use str.lower, the length of "FOO" is going to be the same as "foo".lower():
LS.lower().startswith(SS.lower())
def do_thing(ls, ss):
if ls.startswith(ss):
action_on_match(ls[len(ss):])
else:
action_on_no_match()

Python regular expression gives unexpected result

I'm trying to create an svn pre-commit hook, but can't get my regular expression to work as expected. It should print False for messages that do not look like "DEV-5 | some message". Why do I get True here?
Python 2.7.1+ (r271:86832, Apr 11 2011, 18:05:24)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> p = re.compile("^\[[A-Z]+-[0-9]+\] | .+$", re.DOTALL)
>>> message = "test message"
>>> match = p.search(message)
>>> bool(match)
True
>>> p = re.compile("^[A-Z]+-[0-9]+ \| .+$", re.DOTALL)
>>> print p.search("test message")
None
>>> print p.search("DEV-5 | some message")
<_sre.SRE_Match object at 0x800eb78b8>
you don't need \[ and \]
you need to escape |
The culprit is the trailing " | .+$" which is matching ' message' as an alternative to the first regex. As Roman pointed out you meant to match literal '|' so you have to escape it as '\|'.
To see what was being matched, you can do:
print match.group()
' message'
(By the way, a faster non-regex way to only handle lines containing vertical bar would use line.split('|'):
for line in ...:
parts = line.split('|',1)
if len(parts)==1: continue
(code,mesg) = parts
I haven't run the code, but I suspect that the part after the alternative (|) in your regexp matches any nonempty string starting with a space, in this case it's " message".

Categories