Parsing xml with "not well-formed" characters in python

Parsing xml with "not well-formed" characters in python - python

I am getting xml data from an application, which I want to parse in python:
#!/usr/bin/python
import xml.etree.ElementTree as ET
import re
xml_file = 'tickets_prod.xml'
xml_file_handle = open(xml_file,'r')
xml_as_string = xml_file_handle.read()
xml_file_handle.close()
xml_cleaned = re.sub(u'[^\x01-\x7f]+',u'',xml_as_string)
root = ET.fromstring(xml_cleaned)
It works for smaller datasets with example data, but when I go to real live data, I get
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 364658, column 72
Looking at the xml file, I see this line 364658:
WARNING - (1 warnings in check_logfiles.protocol-2013-05-28-12-53-46) - ^[[0:36mnotice: Scope(Class[Hwsw]): Not required on ^[[0m</description>
I guess it is the ^[ which makes python choke - it is also highlighted blue in vim. Now I was hoping that I could clean the data with my regex substitution, but that did not work.
The best thing would be fixing the application which generated the xml, but that is out of scope. So I need to deal with the data as it is. How can I work around this? I could live with just throwing away "illegal" characters.

You already do:
xml_cleaned = re.sub(u'[^\x01-\x7f]+',u'',xml_as_string)
but the character ^[ is probably Python's \x1b. If xml.parser.expat chokes on it, you need simply to clean up more, by only accepting some characters below 0x20 (space). For example:
xml_cleaned = re.sub(u'[^\n\r\t\x20-\x7f]+',u'',xml_as_string)

I know this is pretty old, but stumbled upon the following url that has a list of all of the primary characters and their encodings.
https://medium.com/interview-buddy/handling-ascii-character-in-python-58993859c38e

Related

Python reading from an xml file without the special characters

from lxml import etree
import xml.etree.ElementTree as ET
tree2 = ET.parse(r'C:\Users\W\Desktop\220-01.xml')
root = tree2.getroot()
txt = ""
for c in root:
txt += c.text
break;
I wrote this above code a month ago so apologies for importing both libraries, I think I only use one. I only need to read from the first root, but the issue is the way the text is stored has special characters, for example:
"\n\n\nPatient went"
Is there a way to get rid of the \n's? I have similar issue with other special characters too, I want the text to look exactly like it does within the xml document because the indices are very important for my work.
Thanks
EDIT: I found a working solution for the time being, after some more searching I ran into this post: replace special characters in a string python
And with the suggestion from user 'Kobi K' what I did was replace all the \n's with " " 's, this somehow maintained the integrity of the document and the indices still match how I want.

ElementTree.ParseError: reference to invalid character number

I get
ElementTree.ParseError: reference to invalid character number
when parsing XML that contains the following as a tag value: locat
My code looks like:
respXML = httpResponse.content
#also possible respXML = httpResponse.content.decode("utf-8")
#but both get the same error
#this line throws the error
respRoot = ET.fromstring(respXML)
How can I bulletproof my parser against seemingly invalid character numbers?

That looks like html. See if using the html package on the input string before anything else.
https://pypi.python.org/pypi/html
>>> import html
>>> test = "locat"
>>> html.unescape(test)
'local'
Then convert some known unicode characters to their equivalents. i.e
“ => "
’ => '
...
Finally replace double spaces to single space.
Since it'll be pretty cumbersome to address everything successfully upfront - I recommend placing specific exceptions and writing the bad line to file.
One by one address each error in the output file by adding more rules.
Good luck.

I sometimes find useful to save the original input characters with an regex pattern, such as (re.sub(r'&#([a-zA-Z0-9]+);?', r'[#\1;]', s). For example, with
from xml.etree import ElementTree as ET
import re
s = "<Tag>locat</Tag>"
using html.unescape produces
ET.fromstring(html.unescape(s)).text
#Out: 'locat'
but the regex pattern mentioned produces
ET.fromstring(re.sub(r'&#([a-zA-Z0-9]+);?', r'[#\1;]', s)).text
#Out: 'loca[#1;]t'
which preserves the "bad characters".

Python: Removing particular character (u"\u2610") from string

I have been wrestling with decoding and encoding in Python, and I can't quite figure out how to resolve my problem. I am looping over xml text files (sample) that are apparently coded in utf-8, using Beautiful Soup to parse each file, then looking to see if any sentence in the file contains one or more words from two different list of words. Because the xml files are from the eighteenth century, I need to retain the em dashes that are in the xml. The code below does this just fine, but it also retains a pesky box character that I wish to remove. I believe the box character is this character.
(You can find an example of the character I wish to remove in line 3682 of the sample file above. On this webpage, the character looks like an 'or' pipe, but when I read the xml file in Komodo, it looks like a box. When I try to copy and paste the box into a search engine, it looks like an 'or' pipe. When I print to console, though, the character looks like an empty box.)
To sum up, the code below runs without errors, but it prints the empty box character that I would like to remove.
for work in glob.glob(pathtofiles):
openfile = open(work)
readfile = openfile.read()
stringfile = str(readfile)
decodefile = stringfile.decode('utf-8', 'strict') #is this the dodgy line?
soup = BeautifulSoup(decodefile)
textwithtags = soup.findAll('text')
textwithtagsasstring = str(textwithtags)
#this method strips everything between anglebrackets as it should
textwithouttags = stripTags(textwithtagsasstring)
#clean text
nonewlines = textwithouttags.replace("\n", " ")
noextrawhitespace = re.sub(' +',' ', nonewlines)
print noextrawhitespace #the boxes appear
I tried to remove the boxes by using
noboxes = noextrawhitespace.replace(u"\u2610", "")
But Python threw an error flag:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 280: ordinal not in range(128)
Does anyone know how I can remove the boxes from the xml files? I would be grateful for any help others can offer.

The problem is that you're mixing unicode and str. Whenever you do that, Python has to convert one to the other, which is does by using sys.getdefaultencoding(), which is usually ASCII, which is almost never what you want.*
If the exception comes from this line:
noboxes = noextrawhitespace.replace(u"\u2610", "")
… the fix is simple… except that you have to know whether noextrawhitespace is supposed to be a unicode object or a UTF-8-encoding str object). If the former, it's this:
noboxes = noextrawhitespace.replace(u"\u2610", u"")
If the latter, it's this:
noboxes = noextrawhitespace.replace(u"\u2610".encode('utf-8'), "")
But really, you have to get all of the strings consistent in your code; mixing the two up is going to cause problems in more places than this one.
Since I don't have your XML files to test, I wrote my own:
<xml>
<text>abc☐def</text>
</xml>
Then, I added these two lines to the bottom of your code (and a bit to the top to just open my file instead of globbing for whatever):
noboxes = noextrawhitespace.replace(u"\u2610".encode('utf-8'), "")
print noboxes
The output is now:
[<text>abc☐def</text>]
[<text>abc☐def</text>]
[<text>abcdef</text>]
So, I think that's what you want here.
* Sure sometimes you want ASCII… but those aren't usually the times when you have unicode objects…

Give this a try:
noextrawhitespace.replace("\\u2610", "")
I think you are just missing that extra '\'
This might also work.
print(noextrawhitespace.decode('unicode_escape').encode('ascii','ignore'))

Reading your sample, the following are the non-ASCII characters in the document:
0x2223 DIVIDES
0x2022 BULLET
0x3009 RIGHT ANGLE BRACKET
0x25aa BLACK SMALL SQUARE
0x25ca LOZENGE
0x3008 LEFT ANGLE BRACKET
0x2014 EM DASH
0x2026 HORIZONTAL ELLIPSIS
\u2223 is the actual character in question in line 3682, and it is being used as a soft hyphen. The others are used in markup for tagging illegible characters, such as:
<GAP DESC="illegible" RESP="oxf" EXTENT="4+ letters" DISP="\u2022\u2022\u2022\u2022\u2026"/>
Here's some code to do what your code is attempting. Make sure to process in Unicode:
from bs4 import BeautifulSoup
import re
with open('k000039.000.xml') as f:
soup = BeautifulSoup(f) # BS figures out the encoding
text = u''.join(soup.strings) # strings is a generator for just the text bits.
text = re.sub(ur'\s+',ur' ',text) # Simplify all white space.
text = text.replace(u'\u2223',u'') # Get rid of the DIVIDES character.
print text
Output:
[[truncated]] reckon my self a Bridegroom too. Buckle. I doubt Kickey won't find him such. [Aside.] Mrs. Sago. Well,—poor Keckky's bound to good Behaviour, or she had lost quite her Puddy's Favour. Shall I for this repine at Fortune?—No. I'm glad at Heart that I'm forgiven so. Some Neighbours Wives have but too lately shown, When Spouse had left 'em all their Friends were flown. Then all you Wives that wou'd avoid my Fate. Remain contented with your present State FINIS.

read xml file using lxml get error EntityRef

i use lxml to read a xml file which has structure like bellow
<domain>http://www.trademe.co.nz</domain>
<start>http://www.trademe.co.nz/Browse/CategoryAttributeSearchResults.aspx?search=1&cid=5748&sidebar=1&rptpath=350-5748-4233-&132=FLAT&134=&153=&29=&122=0&122=0&59=0&59=0&178=0&178=0&sidebarSearch_keypresses=0&sidebarSearch_suggested=0</start>
and my python code is:
from lxml import etree
tree = etree.parse('metaWeb.xml')
when i run it i get
entityref: expecting ';' error
however, when i remove & symbol in xml file, everything is fine.
how can i solve that error?

The problem is that this isn't valid XML. In XML, a & symbol always starts an entity reference, like Ӓ for the character U+04D2 (aka Ӓ), " for the character ", or some custom entity defined in your document/DTD/schema.*
If you want to put a literal & into a string, you have to replace it with something else, typically &, which is a character entity reference for the ampersand character.
So, if you're sure there are no actual entity references in your document, just un-escaped ampersands, you can fix it pretty simply:
with open('metaWeb.xml') as f:
xml = f.read().replace('&', '&')
tree = etree.fromstring(xml)
However, a better solution, if possible, is to fix whatever program is generating this incorrect XML.
* This is slightly misleading quite true; a numeric character reference is not actually an entity reference. Also, a character entity reference like " or & is the same as any other reference with replacement text, the entities just happen to be implicitly defined by the XML/HTML base DTDs. But lxml, like most XML software, uses the term "entity reference" slightly more broadly than the standard.

Replace & with & in your xml file, othewise your xml is not compliant to the XML standard.

How should I deal with an XMLSyntaxError in Python's lxml while parsing a large XML file?

I'm trying to parse an XML file that's over 2GB with Python's lxml library. Unfortunately, the XML file does not have a line telling the character encoding, so I have to manually set it. While iterating through the file though, there are still some strange characters that come up once in a while.
I'm not sure how to determine the character encoding of the line, but furthermore, lxml will raise an XMLSyntaxError from the scope of the for loop. How can I properly catch this error, and deal with it correctly? Here's a simplistic code snippet:
from lxml import etree
etparse = etree.iterparse(file("my_file.xml", 'r'), events=("start",), encoding="CP1252")
for event, elem in etparse:
if elem.tag == "product":
print "Found the product!"
elem.clear()
This eventually produces the error:
XMLSyntaxError: PCDATA invalid Char value 31, line 1565367, column 50
That line of the file looks like this:
% sed -n "1565367 p" my_file.xml
<romance_copy>Ravioli Florentine. Tender Ravioli Filled With Creamy Ricotta Cheese And
The 'F' of filled actually looks like this in my terminal:

The right thing to do here is make sure that the creator of the XML file makes sure that:
A.) that the encoding of the file is declared
B.) that the XML file is well formed (no invalid characters control characters, no invalid characters that are not falling into the encoding scheme, all elements are properly closed etc.)
C.) use a DTD or an XML schema if you want to ensure that certain attributes/elements exist, have certain values or correspond to a certain format (note: this will take a performance hit)
So, now to your question. LXml supports a whole bunch of arguments when you use it to parse XML. Check out the documentation. You will want to look at these two arguments:
--> recover --> try hard to parse through broken XML
--> huge_tree --> disable security restrictions and support very deep trees and very long text content (only affects libxml2 2.7+)
They will help you to some degree, but certain invalid characters can just not be recovered from, so again, ensuring that the file is written correctly is your best bet to clean/well working code.
Ah yeah and one more thing. 2GB is huge. I assume you have a list of similar elements in this file (example list of books). Try to split the file up with a Regex Expression on the OS, then start multiple processes to part the pieces. That way you will be able to use more of your cores on your box and the processing time will go down. Of course you then have to deal with the complexity of merging the results back together. I can not make this trade off for you, but wanted to give it to you as "food for thought"
Addition to post:
If you have no control over the input file and have bad characters in it, I would try to replace/remove these bad characters by iterating over the string before parsing it as a file. Here a code sample that removes Unicode control characters that you wont need:
#all unicode characters from 0x0000 - 0x0020 (33 total) are bad and will be replaced by "" (empty string)
for line in fileinput.input(xmlInputFileLocation, inplace=1):
for pos in range(0,len(line)):
if unichr(line[pos]) < 32:
line[pos] = None
print u''.join([c for c in line if c])

I ran into this too, getting \x16 in data (the unicode 'synchronous idle' or 'SYN' character, displayed in the xml as ^V) which leads to an error when parsing the xml: XMLSyntaxError: PCDATA invalid Char value 22. The 22 is because because ord('\x16') is 22.
The answer from #michael put me on the right track. But some control characters below 32 are fine, like the return or the tab, and a few higher characters are still bad. So:
# Get list of bad characters that would lead to XMLSyntaxError.
# Calculated manually like this:
from lxml import etree
from StringIO import StringIO
BAD = []
for i in range(0, 10000):
try:
x = etree.parse(StringIO('<p>%s</p>' % unichr(i)))
except etree.XMLSyntaxError:
BAD.append(i)
This leads to a list of 31 characters that can be hardcoded instead of doing the above calculation in code:
BAD = [
0, 1, 2, 3, 4, 5, 6, 7, 8,
11, 12,
14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
# Two are perfectly valid characters but go wrong for different reasons.
# 38 is '&' which gives: xmlParseEntityRef: no name.
# 60 is '<' which gives: StartTag: invalid element namea different error.
]
BAD_BASESTRING_CHARS = [chr(b) for b in BAD]
BAD_UNICODE_CHARS = [unichr(b) for b in BAD]
Then use it like this:
def remove_bad_chars(value):
# Remove bad control characters.
if isinstance(value, unicode):
for char in BAD_UNICODE_CHARS:
value = value.replace(char, u'')
elif isinstance(value, basestring):
for char in BAD_BASESTRING_CHARS:
value = value.replace(char, '')
return value
If value is 2 Gigabyte you might need to do this in a more efficient way, but I am ignoring that here, although the question mentions it. In my case, I am the one creating the xml file, but I need to deal with these characters in the original data, so I will use this function before putting data in the xml.

Found this thread from Google and while #Michael's answer ultimately lead me to a solution (to my problem at least) I wanted to provide a bit more of a copy/paste answer here for issues that can be solved so simply:
from lxml import etree
# Create a parser
parser = etree.XMLParser(recover=True)
parsed_file = etree.parse('/path/to/your/janky/xml/file.xml', parser=parser)
I was facing an issue where I had no control over the XML pre-processing and was being given a file with invalid characters. #Michael's answer goes on to elaborate on a way to approach invalid characters from which recover=True can't address. Fortunately for me, this was enough to keep things moving along.

The codecs Python module supply an EncodedFile class that works as a wrapper to a file -
you should pass an object of this class to lxml, set to replace unknown characters with XML char entities --
Try doing this:
from lxml import etree
import codecs
enc_file = codecs.EncodedFile(file("my_file.xml"), "ASCII", "ASCII", "xmlcharrefreplace")
etparse = etree.iterparse(enc_file, events=("start",), encoding="CP1252")
...
The "xmlcharrefreplace" constant passed is the "errors" parameter, and specifies what to do with unknown characters. It could be "strict" (raises an error), "ignore" (leave as is), "replace" (replaces char with "?"), "xmlrefreplace" (creates an "&#xxxx;" xml reference) or "backslahreplace" (creates a Python valid backslash reference). For more information, check:
http://docs.python.org/library/codecs.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing xml with "not well-formed" characters in python - python

I know this is pretty old, but stumbled upon the following url that has a list of all of the primary characters and their encodings. https://medium.com/interview-buddy/handling-ascii-character-in-python-58993859c38e

Related

Python reading from an xml file without the special characters

ElementTree.ParseError: reference to invalid character number

Python: Removing particular character (u"\u2610") from string

read xml file using lxml get error EntityRef

How should I deal with an XMLSyntaxError in Python's lxml while parsing a large XML file?

Categories

Resources