ElementTree.ParseError: reference to invalid character number - python

I get
ElementTree.ParseError: reference to invalid character number
when parsing XML that contains the following as a tag value: locat
My code looks like:
respXML = httpResponse.content
#also possible respXML = httpResponse.content.decode("utf-8")
#but both get the same error
#this line throws the error
respRoot = ET.fromstring(respXML)
How can I bulletproof my parser against seemingly invalid character numbers?

That looks like html. See if using the html package on the input string before anything else.
https://pypi.python.org/pypi/html
>>> import html
>>> test = "locat"
>>> html.unescape(test)
'local'
Then convert some known unicode characters to their equivalents. i.e
“ => "
’ => '
...
Finally replace double spaces to single space.
Since it'll be pretty cumbersome to address everything successfully upfront - I recommend placing specific exceptions and writing the bad line to file.
One by one address each error in the output file by adding more rules.
Good luck.

I sometimes find useful to save the original input characters with an regex pattern, such as (re.sub(r'&#([a-zA-Z0-9]+);?', r'[#\1;]', s). For example, with
from xml.etree import ElementTree as ET
import re
s = "<Tag>locat</Tag>"
using html.unescape produces
ET.fromstring(html.unescape(s)).text
#Out: 'locat'
but the regex pattern mentioned produces
ET.fromstring(re.sub(r'&#([a-zA-Z0-9]+);?', r'[#\1;]', s)).text
#Out: 'loca[#1;]t'
which preserves the "bad characters".

Related

Pyparsing finds first occurence in file

I'm parsing file via
output=wilcard.parseFile(myfile)
print output
And I do get only first match of string.
I have a big config file to parse, with "entries" which are surrounded by braces.
I expect to see all the matches that are in file or exception for not matching.
How do I achieve that?
By default, pyparsing will find the longest match, starting at the first character. So, if your parse is given by num = Word('0123456789'), parsing either "462" or "462-780" will both return the same value. However, if the parseAll=True option is passed, the parse will attempt to parse the entire string. In this case, "462" would be matched, but parsing "462-780" would raise a ParseException, because the parser doens't know how to deal with the dash.
I would recommend constructing something that will match the entirety of the file, then using the parseAll=True flag in parseFile(). If I understand your description of each entry being separated by braces correctly, one could do the following.
entire_file = OneOrMore('[' + wildcard + ']')
output = wildcard.parseFile(myfile,parseAll=True)
print output

Django: Bad group name

I faced an error on "bad group name".
Here is the code:
for qitem in q['display']:
if qitem['type'] == 1:
for keyword in keywordTags.split('|'):
p = re.compile('^' + keyword + '$')
newstring=''
for word in qitem['value'].split():
if word[-1:] == ',':
word = word[0:len(word)-1]
newstring += (p.sub('<b>'+word+'</b>', word) + ', ')
else:
newstring += (p.sub('<b>'+word+'</b>', word) + ' ')
qitem['value']=newstring
And here's the error:
error at /result/1/
bad group name
Request Method: GET
Django Version: 1.4.1
Exception Type: error
Exception Value: bad group name
Exception Location: C:\Python27\lib\re.py in _compile_repl, line 257
Python Executable: C:\Python27\python.exe
Python Version: 2.7.3 Python
Path: ['D:\ExamPapers', 'C:\Windows\SYSTEM32\python27.zip',
'C:\Python27\DLLs', 'C:\Python27\lib',
'C:\Python27\lib\plat-win', 'C:\Python27\lib\lib-tk',
'C:\Python27', 'C:\Python27\lib\site-packages']
Server time: Sun,3 Mar 2013 15:31:05 +0800
Traceback Switch to copy-and-paste view
C:\Python27\lib\site-packages\django\core\handlers\base.py in get_response
response = callback(request, *callback_args, **callback_kwargs) ... ▶ Local vars ?
D:\ExamPapers\views.py in result
newstring += (p.sub(''+word+'', word) + ' ') ... ▶ Local vars
In summary, the error is at:
newstring += (p.sub('<b>'+word+'</b>', word) + ' ')
So you're trying to highlight in bold an occurrence of a set of keywords. Right now this code is broken in quite a lot of ways. You're using the re module right now to match the keywords but you're also breaking the keywords and the strings down into individual words, you don't need to do both and the interaction between these two different approaches to the solving the problem are what is causing you issues.
You can use regular expressions to match multiple possible strings at the same time, that's what they're good for! So instead of "^keyword$" to match just "keyword" you could use "^keyword|hello$" to match either "keyword" or "hello". You also use the ^ and $ characters which only match the beginning or end of the entire string, but what you probably wanted originally was to match the beginning or end of words, for this you can use \b like this r"\b(keyword|hello)\b". Note that in the last example I added a r character before the string, this stands for "raw" and turns off pythons usual handling of back slash characters which conflicts with regular expressions, it's good practice to always use the r before the string when the string contains a regular expression. I also used brackets to group together the words.
The regular expression sub method allows you to substitute things matched by a regular expression with another string. It also allow you to make "back references" in the replacing string that include parts of original string that matched. The parts that it includes are called "groups" and are indicated with brackets in the original regular expression, in the example above there is only one set of brackets and these are the first so they're indicated by the back reference \1. The cause of the actual error message you asked about is that your replacement string contained what looked like a backref but there weren't any groups in your regular expression.
Using that you do something like this:
keywordMatcher = re.compile(r"\b(keyword|hello)\b")
value = keywordMatcher.sub(r"<b>\1</b>", value)
Another thing that isn't directly related to what you're asking but is incredibly important is that you are taking source plain text strings (I assume) and making them into HTML, this gives a lot of chance for script injection vulnerabilities which if you don't take the time to understand and avoid will allow bad guys to hack the applications you build (they can do this in an automated way, so even if you think your app will be too small for anyone to notice it can still get hacked and used for all sorts of bad things, don't let this happen!). The basic rule is that it's ok to convert text to HTML but you need to "escape" it first, this is very simple:
from django.utils import html
html_safe = html.escape(my_text)
All this does is convert characters like < to < which the browser will show as < but won't interpret as the beginning of a tag. So if a bad guy types <script> into one of your forms and it gets processed by your code it will display it as <script> and not execute it as a script.
Likewise, if you use an text in a regular expression that you don't intend to have special regular expression characters then you must escape that too! You can do this using re.escape:
import re
my_regexp = re.compile(r"\b%s\b" % (re.escape(my_word),))
Ok, so now we've got that out of the way here is a method you could use to do what you wanted:
value = "this is my super duper testing thingy"
keywords = "super|my|test"
from django.utils import html
import re
# first we must split up the keywords
keywords = keywords.split("|")
# Next we must make each keyword safe for use in a regular expression,
# this is similar to the HTML escaping we discussed above but not to
# be confused with it.
keywords = [re.escape(k) for k in keywords]
# Now we reform the keywordTags string, but this time we know each keyword is regexp-safe
keywords = "|".join(keywords)
# Finally we create a regular expression that matches *any* of the keywords
keywordMatcher = re.compile(r'\b(%s)\b' % (keywords,))
# We are going to make the value into HTML (by adding <b> tags) so must first escape it
value = html.escape(value)
# We can then apply the regular expression to the value. We use a "back reference" `\0` to say
# that each keyword found should be replace with itself wrapped in a <b> tag
value = keywordMatcher.sub(r"<b>\1</b>", value)
print value
I urge you to take the time to understand what this does, otherwise you're just going to get yourself into a mess! It's always easier to just cut and paste and move on but this leads to crappy broken code and worse of all means you yourself don't improve and don't learn. All great coders started of as beginner coders who took the time to understand things :)

Python replace with re-using unknown strings

I have an XML in which I'd like to rename one of the tag groups like this:
<string>ABC</string>
<string>unknown string</string>
should be
<xyz>ABC</xyz>
<xyz>unknown string</xyz>
ABC is always the same, so that's no issue. However, "unknown string" is always different, but since I need this information extracted, I also want to keep the same string in the replacement.
Here's what I got so far:
import re
#open the xml file for reading:
file = open('path/file','r+')
#convert to string:
data = file.read()
file.write(re.sub("<string>ABC</string>(\s+)<string>(.*)</string>","<xyz>ABC</xyz>[\1]<xyz>[\2]</xyz>",data))
print (data)
file.close()
I tried to use capture groups, but didn't do it correctly. The string is replaced with weird symbols in my XML. Plus, it's printed twice. I have both the unchanged and the changed version in my XML, which I don't want.
The problem you're experiencing is not due to your regex pattern. The backslash (\) in the strings are escaping proceeding characters thus resulting in the weird symbols that you see.
>>> print "hello\1world"
helloworld
>>> print r"hello\1world"
hello\1world
Always use the raw string notation to define your re patterns.
>>> data = """
... <string>ABC</string>
... <string>unknown string</string>
... """
>>> print re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data)
<xyz>ABC</xyz>
<xyz>unknown string</xyz>
Why are you including the content in your replacement operation? All you need to do is:
Replace <string> by <xyz>.
Replace </string> by </xyz>.
It would take two operations but the intent of your code would be clear and you don't need to know what unknown string is.

Find a string with newline or whitespace to fix broken xml input

Hello I have trouble finding a string within a file which consists of the following signs and a whitespace or a newline.
I want to find the broken tag
</answ
to replace it later... the xml file lokks like the following:
"
Normally i thought i could find this by
search = i.find('</answ ')
#or newline by:
vorkommen = i.find('</answ \n ')
But it returns both -1...and thats not true...
Thanks a lot for any help!
You could broaden your set of whitespace characters to include tabs as follows.
import re
search = re.search(r'</answ\s', i).start()
Why don't you use an xml parser to locate the errors?
etree.fromstring(u'<foo>text</fo\no>') raises XMLSyntaxError: expected '>', line 2, column 1, so as long as you're keeping your text in some kind of stream, you can manipulate it to remove the newline, and re-parse.
The exception raised sets the position property, as well as the code property.
Alternatively, you can configure lxml to try to be more robust:
In [39]: parser = etree.XMLParser(recover=True)
In [40]: etree.fromstring(u'<foo>text</fo\no>', parser)
Out[40]: <Element foo at 0x55fd798>
See: http://lxml.de/parsing.html and also, the API reference at http://lxml.de/api/index.html and http://lxml.de/api.html#error-handling-on-exceptions

Converting html entities into their values in python

I use this regex on some input,
[^a-zA-Z0-9##]
However this ends up removing lots of html special characters within the input, such as
#227;, #1606;, #1588; (i had to remove the & prefix so that it wouldn't
show up as the actual value..)
is there a way that I can convert them to their values so that it will satisfy the regexp expression? I also have no idea why the text decided to be so big.
Given that your text appears to have numeric-coded, not named, entities, you can first convert your byte string that includes xml entity defs (ampersand, hash, digits, semicolon) to unicode:
import re
xed_re = re.compile(r'&#(\d+);')
def usub(m): return unichr(int(m.group(1)))
s = 'ã, ن, ش'
u = xed_re.sub(usub, s)
if your terminal emulator can display arbitrary unicode glyphs, a print u will then show
ã, ن, ش
In any case, you can now, if you wish, use your original RE and you won't accidentally "catch" the entities, only ascii letters, digits, and the couple of punctuation characters you listed. (I'm not sure that's what you really want -- why not accented letters but just ascii ones, for example? -- but, if it is what you want, it will work).
If you do have named entities in addition to the numeric-coded ones, you can also apply the htmlentitydefs standard library module recommended in another answer (it only deals with named entities which map to Latin-1 code points, however).
You can adapt the following script:
import htmlentitydefs
import re
def substitute_entity (match):
name = match.group (1)
if name in htmlentitydefs.name2codepoint:
return unichr (htmlentitydefs.name2codepoint[name])
elif name.startswith ('#'):
try:
return unichr (int (name[1:]))
except:
pass
return '?'
print re.sub ('&(#?\\w+);', substitute_entity, 'x « y &wat; z {')
Produces the following answer here:
x « y ? z {
EDIT: I understood the question as "how to get rid of HTML entities before further processing", hope I haven't wasted time on answering a wrong question ;)
Without knowing what the expression is being used for I can't tell exactly what you need.
This will match special characters or strings of characters excluding letters, digits, #, and #:
[^a-zA-Z0-9##]*|#[0-9A-Za-z]+;

Categories