How to use variable in python regex? - python

I am trying to process user input in regex from variable. After a lot of searching I have come up with following:
Explaination of code variables:
step is a string used as input for regex
e.g.
replace|-|space ,
replace|*|null,
replace|/|\|squot|space
b is a list of elements. Element is fetched and modified as per regex.
i is integer received from other function to access list b using i as index
I process the above string to get array, then use the last element of array as substitution string
First element is deleted as it is not required.
All other elements need to be replaced with substitution string.
def replacer(step,i,b):
steparray = step.split('|')
del steparray[0]
final = steparray.pop()
if final == "space":
subst = u" "
elif final == "squot":
subst = u"'"
elif final == "dquot":
subst = u"\""
else:
subst = u"%s"%final
for input in xrange(0,len(steparray)):
test=steparray[input]
regex = re.compile(ur'%s'%test)
b[i] = re.sub(regex, subst, b[i])
print b[i]
However, when I run above code, following error is shown:
File "CSV_process.py", line 78, in processor
replacer(step,i,b)
File "CSV_process.py", line 115, in replacer
regex = re.compile(ur'%s'%test)
File "/usr/lib/python2.7/re.py", line 190, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.7/re.py", line 242, in _compile
raise error, v # invalid expression
sre_constants.error: nothing to repeat
I tried a lot but dont understand how regex works. Please help with error.
Final requirement is to get a special character from user input and replace it with another character (again from user input)
PS: Also, the code does not have 242 lines but error is on line 242. Is the error occurring after end of array in for loop?

Some special characters like * should be escaped to match literally.
>>> import re
>>> re.compile('*')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\re.py", line 194, in compile
return _compile(pattern, flags)
File "C:\Python27\lib\re.py", line 251, in _compile
raise error, v # invalid expression
sre_constants.error: nothing to repeat
Using re.escape, you can escape them:
>>> print(re.escape('*'))
\*
>>> re.compile(re.escape('*'))
<_sre.SRE_Pattern object at 0x000000000273DF10>
BTW, if you want to simply replace them, regular expression is not necessary. Why don't you use str.replace?
replaced_string = string_object.replace(old, new)

Related

Python re search on binary strings

I am trying to use python re to find binary substrings but I get a somewhat puzzling error.
Here is a small example to demonstrate the issue (python3):
import re
memory = b"\x07\x00\x42\x13"
query1 = (7).to_bytes(1, byteorder="little", signed=False)
query2 = (42).to_bytes(1, byteorder="little", signed=False)
# Works
for match in re.finditer(query1, memory):
print(match.group(0))
# Causes error
for match in re.finditer(query2, memory):
print(match.group(0))
The first loop correctly prints b'\x07' while the second gives the following error:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python3.7/re.py", line 230, in finditer
return _compile(pattern, flags).finditer(string)
File "/usr/lib/python3.7/re.py", line 286, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python3.7/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python3.7/sre_parse.py", line 930, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "/usr/lib/python3.7/sre_parse.py", line 426, in _parse_sub
not nested and not items))
File "/usr/lib/python3.7/sre_parse.py", line 651, in _parse
source.tell() - here + len(this))
re.error: nothing to repeat at position
For context I am trying to find specific integers within the memory space of a program in similar fashion to tools like cheat engine. It is being done using python scripts within gdb.
-- Note 1 --
I have a suspicion that this may be related to the fact that 42 is representable in ascii as * while 7 is not. For example if you print the query strings you get:
>>> print(query1)
b'\x07'
>>> print(query2)
b'*'
-- Note 2 --
Actually it looks like this is unrelated to whether the string is representable in ascii. If you run:
import re
memory = b"\x07\x00\x42\x13"
for i in range(255):
query = i.to_bytes(1, byteorder="little", signed=False)
try:
for match in re.finditer(query, memory):
pass
except:
print(str(i) + " failed -- as ascii: " + chr(i))
It gives:
40 failed -- as ascii: (
41 failed -- as ascii: )
42 failed -- as ascii: *
43 failed -- as ascii: +
63 failed -- as ascii: ?
91 failed -- as ascii: [
92 failed -- as ascii: \
All of the failed bytes represent characters which are special to re syntax. This makes me think that python re is first printing the query string and then parsing it to do that search. I guess that is not entirely unreasonable but still odd.
Actually in writing this question I've found a solution which is to first wrap the query in re.escape(query) which will insert a \ before each special character but I will still post this question in case it may be helpful to others or if anyone has more to add.
\x42 is corresponds to *, which is a special regex character. You can instead use
re.finditer(re.escape(query2), memory)
which will escape the query (convert * to \*) and find the character * in the string.

Python regex problem when using the sub() method to replace content in html code to respective translations [duplicate]

This question already has an answer here:
python re.sub group: number after \number
(1 answer)
Closed 3 years ago.
I am a beginner at Python and met with some coding problem that I can't solve.
What I have:
the source sentences and their respective translations in two columns in a spreadsheet;
the html code which contains sentences and html tags
What I'm trying to do: use Python regex method - sub() to find and replace english sentences to their respective translated sentences.
For example: three sentences in html codes -
Pumas are large animals.
They are found in America.
They don't eat grass
I have the translations of each sentence in the html code. I want to replace the sentences one at a time and also keep the html tags. Normally I can use the sub() method like this:
regex1 = re.compile(r'(\>.*)SOURCE_SENTENCE_HERE ?(.*\<)')
resultCode = regex1.sub(r'\1TRANSLATION_SENTENCE_HERE\2', originalHtmlCode)
I've written a python script to do this. I save the html code in a txt file and access it in my Python code (succeeded). Then I create a dictionary to store the source-target paires in the spreadsheet mentioned above (succeeded). Lastly, I use rexgex sub() method to find and replace the sentences in the html code (failed). This last part didn't work at all for some reason. Link to my Python code - https://pastebin.com/ZSUNB4yg or below:
import re, openpyxl, pyperclip
buynavFile = open('C:\\Users\\zs\\Documents\\PythonScripts\\buynavCode.txt')
buynavCode = buynavFile.read()
buynavFile.close()
wb = openpyxl.load_workbook('buynavSegments.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')
segDict = {}
maxRow = sheet.max_row
for i in range(2, maxRow + 1):
segDict[sheet.cell(row=i, column=3).value] = sheet.cell(row=i, column=4).value
for k, v in segDict.items():
k = '(\\>.*)' + str(k) + ' ?(.*\\<)'
v = '\\1' + str(v) + '\\2'
buynavRegex = re.compile(k)
buynavResult = buynavRegex.sub(v, buynavCode)
pyperclip.copy(buynavResult)
print('Result copied to clipboard')
Error message below:
Traceback (most recent call last):
File "C:\Users\zs\Documents\PythonScripts\buynav.py", line 20, in
buynavResult = buynavRegex.sub(v, buynavCode)
File "C:\Users\zs\AppData\Local\Programs\Python\Python36\lib\re.py",
line 326, in _subx
template = _compile_repl(template, pattern)
File "C:\Users\zs\AppData\Local\Programs\Python\Python36\lib\re.py",
line 317, in _compile_repl
return sre_parse.parse_template(repl, pattern)
File
"C:\Users\zs\AppData\Local\Programs\Python\Python36\lib\sre_parse.py",
line 943, in parse_template
addgroup(int(this[1:]), len(this) - 1)
File
"C:\Users\zs\AppData\Local\Programs\Python\Python36\lib\sre_parse.py",
line 887, in addgroup
raise s.error("invalid group reference %d" % index, pos)
sre_constants.error: invalid group reference 11 at position 1
Could someone enlighten me on this please? I would really appreciate it.
Consider if you want to use a replacement text where you have to put the contents of group 1 and concatenate them to the string 2. You could write r'\12' but this wont work because the regex parser will think that you are referencing group 12 instead of the group 1 followed by the string 2!
>>> re.sub(r'(he)llo', r'\12', 'hello')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.6/re.py", line 191, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/usr/lib/python3.6/re.py", line 326, in _subx
template = _compile_repl(template, pattern)
File "/usr/lib/python3.6/re.py", line 317, in _compile_repl
return sre_parse.parse_template(repl, pattern)
File "/usr/lib/python3.6/sre_parse.py", line 943, in parse_template
addgroup(int(this[1:]), len(this) - 1)
File "/usr/lib/python3.6/sre_parse.py", line 887, in addgroup
raise s.error("invalid group reference %d" % index, pos)
sre_constants.error: invalid group reference 12 at position 1
You can solve this using the \g<1> syntax to refer to the group: r'\g<1>2':
>>> re.sub(r'(he)llo', r'\g<1>2', 'hello')
'he2'
In your case your replacement string contains dynamic contents like str(v) which can be anything. If it happens to start with a number you end up in the case described before so you want to use \g<1> to avoid this issue.

Python lex - TypeError: Unknown text

I'm trying to write a simple lex parser. The cope is currently:
from ply import lex
tokens = (
'COMMENT',
'OTHER'
)
t_COMMENT = r'^\#.*\n'
t_OTHER = r'^[^\#].*\n'
def t_error(t):
raise TypeError("Unknown text '%s'" % (t.value,))
lex.lex()
lex.input(yaml)
for tok in iter(lex.token, None):
print repr(tok.type), repr(tok.value)
But is fails to parse simple input file:
# This is a real comment
#And this one also
#/*
# *
# *Variable de feeu
# */
ma_var: True
It is done, over, kaput
With the following output:
l
'COMMENT' '# This is a real comment\n'
Traceback (most recent call last):
File "parser_adoc.py", line 62, in <module>
main2()
File "parser_adoc.py", line 57, in main2
for tok in iter(lex.token, None):
File "/usr/lib/python2.7/site-packages/ply/lex.py", line 384, in token
newtok = self.lexerrorf(tok)
File "parser_adoc.py", line 44, in t_error
raise TypeError("Unknown text '%s'" % (t.value,))
TypeError: Unknown text '#And this one also
#/*
# *
# *Variable de feeu
# */
ma_var: True
this is done
'
So in summary, I defined 2 regex:
One for line beginning with #
One for lines beginning not with #
But it's not working.
I don't understand what's wrong with my regex.
Could you help?
Simon
In python regexes (which PLY uses), ^ refers to the beginning of the string, not the beginning of the line, unless multi-line mode has been set. So since both of your rules start with ^, they can only match on the first line.
You could fix this by wrapping your regexes in (?m:...), which enables multi-line mode, but that's not even necessary here. Instead you can just remove the ^ from the beginning of your rules and it will work as you intend. Since both of your rules always match the entire line, the next token will always start at the beginning of the line - no need to anchor them.

What is the python vesion of this?

How would you do this in python?
(It goes through a file and print the string in between author": ", " and text": ", \ and then print them to their files)
Here is an example string before it goes through this:
{"text": "Love this series!\ufeff", "time": "Hace 11 horas", "author": "HasRah", "cid": "UgyvXmvSiMjuDrOQn-l4AaABAg"}
#!/bin/bash
cat html.txt | awk -F 'author": "' {'print $2'} | cut -d '"' -f1 >> users.txt
cat html.txt | awk -F 'text": "' {'print $2'} | cut -d '\' -f1 >> comments.txt
I tried to do it like this in python (Didn't work):
import re
start = '"author": "'
end = '", '
st = open("html.txt", "r")
s = st.readlines()
u = re.search('%s(.*)%s' % (start, end), s).group(1)
#print u.group(1)
Not sure if I'm close.
I get this error code:
Traceback (most recent call last):
File "test.py", line 9, in <module>
u = re.search('%s(.*)%s' % (start, end), s).group(1)
File "/usr/lib/python2.7/re.py", line 146, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or buffer`
Before getting into any of this: As chepner pointed out in a comment, this input looks like, and therefore is probably intended to be, JSON. Which means you shouldn't be parsing it with regular expressions; just parse it as JSON:
>>> s = ''' {"text": "Love this series!\ufeff", "time": "Hace 11 horas", "author": "HasRah", "cid": "UgyvXmvSiMjuDrOQn-l4AaABAg"}'''
>>> obj = json.loads(s)
>>> obj['author']
'HasRah'
Actually, it's not clear whether your input is a JSON file (a file containing one JSON text), or a JSONlines file (a file containing a bunch of lines, each of which is a JSON text with no embedded newlines).1
For the former, you want to parse it like this:
obj = json.load(st)
For the latter, you want to loop over the lines, and parse each one like this:
for line in st:
obj = json.loads(line)
… or, alternatively, you can get a JSONlines library off PyPI.
But meanwhile, if you want to understand what's wrong with your code:
The error message is telling you the problem, although maybe not in the user-friendliest way:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/re.py", line 148, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object
See the docs for search make clear:
re.search(pattern, string, flags=0)
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance…
You haven't passed it a string, you've passed it a list of strings. That's the whole point of readlines, after all.
There are two obvious fixes here.
First, you could read the whole file into a single string, instead of reading it into a list of strings:
s = st.read()
u = re.search('%s(.*)%s' % (start, end), s).group(1)
Alternatively, you could loop over the lines, trying to match each one. And, if you do this, you still don't need readlines(), because a file is already an iterable of lines:
for line in st:
u = re.search('%s(.*)%s' % (start, end), line).group(1)
While we're at it, if any of your lines don't match the pattern, this is going to raise an AttributeError. After all, search returns None if there's no match, but then you're going to try to call None.group(1).
There are two obvious fixes here as well.
You could handle that error:
try:
u = re.search('%s(.*)%s' % (start, end), line).group(1)
except AttributeError:
pass
… or you could check whether you got a match:
m = re.search('%s(.*)%s' % (start, end), line)
if m:
u = m.group(1)
1. In fact, there are at least two other formats that are nearly, but not quite, identical to JSONlines. I think that if you only care about reading, not creating files, and you don't have any numbers, you can parse all of them with a loop around json.loads or with a JSONlines library. But if you know who created the file, and know that they intended it to be, say, NDJ rather than JSONlines, you should read the docs on NDJ, or get a library made for NDJ, rather than just trusting that some guy on the internet thinks it's OK to treat it as JSONlines.

python regular expression error

I am trying do a pattern search and if match then set a bitarray on the counter value.
runOutput = device[router].execute (cmd)
runOutput = output.split('\n')
print(runOutput)
for this_line,counter in enumerate(runOutput):
print(counter)
if re.search(r'dev_router', this_line) :
#want to use the counter to set something
Getting the following error:
if re.search(r'dev_router', this_line) :
2016-07-15T16:27:13: %ERROR: File
"/auto/pysw/cel55/python/3.4.1/lib/python3.4/re.py", line 166,
in search 2016-07-15T16:27:13: %-ERROR: return _compile(pattern,
flags).search(string)
2016-07-15T16:27:13: %-ERROR: TypeError: expected string or buffer
You mixed up the arguments for enumerate() - first goes the index, then the item itself. Replace:
for this_line,counter in enumerate(runOutput):
with:
for counter, this_line in enumerate(runOutput):
You are getting a TypeError in this case because this_line is an integer and re.search() expects a string as a second argument. To demonstrate:
>>> import re
>>>
>>> this_line = 0
>>> re.search(r'dev_router', this_line)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/user/.virtualenvs/so/lib/python2.7/re.py", line 146, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or buffer
By the way, modern IDEs like PyCharm can detect this kind of problems statically:
(Python 3.5 is used for this screenshot)

Categories