Python reading from an xml file without the special characters - python

from lxml import etree
import xml.etree.ElementTree as ET
tree2 = ET.parse(r'C:\Users\W\Desktop\220-01.xml')
root = tree2.getroot()
txt = ""
for c in root:
txt += c.text
break;
I wrote this above code a month ago so apologies for importing both libraries, I think I only use one. I only need to read from the first root, but the issue is the way the text is stored has special characters, for example:
"\n\n\nPatient went"
Is there a way to get rid of the \n's? I have similar issue with other special characters too, I want the text to look exactly like it does within the xml document because the indices are very important for my work.
Thanks
EDIT: I found a working solution for the time being, after some more searching I ran into this post: replace special characters in a string python
And with the suggestion from user 'Kobi K' what I did was replace all the \n's with " " 's, this somehow maintained the integrity of the document and the indices still match how I want.

Related

re.sub() does not keep blanks and new lines

I have an xml file with the following line :
<CREATION_DATE>2009-12-20T10:47:07.000Z</CREATION_DATE>
That I would like to replace with the following :
<CREATION_DATE>XXX</CREATION_DATE>
Thought it would be pretty straightforward using the re module in the python script I'm supposed to modify. I did something of the sort:
if '</CREATION_DATE>' in ligne:
out_lines[i] = re.sub(r'(^.*<CREATION_DATE>).*(</CREATION_DATE>.*$)', r'\1XXX\2', ligne)
The field with the date is correctly replaced, but the trailing new line and indentation are lost in the process. I tried converting ligne and the result of the sub function to a raw string with .encode('string-escape'), with no success. I am a noob in python, but I am a bit accustomed to regex's, and I really cannot see what it is I am doing wrong.
An alternative, a simpler and a more reliable way to replace the text of an XML element would be to use an XML parser. There is even one in the Python Standard Library:
>>> import xml.etree.ElementTree as ET
>>>
>>> s = '<ROOT><CREATION_DATE>2009-12-20T10:47:07.000Z</CREATION_DATE></ROOT>'
>>> root = ET.fromstring(s)
>>> root.find("CREATION_DATE").text = 'XXX'
>>> ET.tostring(root)
'<ROOT><CREATION_DATE>XXX</CREATION_DATE></ROOT>'
As stated in comments, the variable ligne was stripped of blanks and new lines with ligne = ligne.strip() elsewhere in the code... I am not deleting my question though because alecxe's answer on the xml module is very informative.

ElementTree.ParseError: reference to invalid character number

I get
ElementTree.ParseError: reference to invalid character number
when parsing XML that contains the following as a tag value: locat
My code looks like:
respXML = httpResponse.content
#also possible respXML = httpResponse.content.decode("utf-8")
#but both get the same error
#this line throws the error
respRoot = ET.fromstring(respXML)
How can I bulletproof my parser against seemingly invalid character numbers?
That looks like html. See if using the html package on the input string before anything else.
https://pypi.python.org/pypi/html
>>> import html
>>> test = "locat"
>>> html.unescape(test)
'local'
Then convert some known unicode characters to their equivalents. i.e
“ => "
’ => '
...
Finally replace double spaces to single space.
Since it'll be pretty cumbersome to address everything successfully upfront - I recommend placing specific exceptions and writing the bad line to file.
One by one address each error in the output file by adding more rules.
Good luck.
I sometimes find useful to save the original input characters with an regex pattern, such as (re.sub(r'&#([a-zA-Z0-9]+);?', r'[#\1;]', s). For example, with
from xml.etree import ElementTree as ET
import re
s = "<Tag>locat</Tag>"
using html.unescape produces
ET.fromstring(html.unescape(s)).text
#Out: 'locat'
but the regex pattern mentioned produces
ET.fromstring(re.sub(r'&#([a-zA-Z0-9]+);?', r'[#\1;]', s)).text
#Out: 'loca[#1;]t'
which preserves the "bad characters".

Strip all html lines/code from string in python

Given the following string parsed from an email body...
s = "Keep all of this <h1>But remove this including the tags</h1> this is still good, but
<p>there also might be new lines like this that need to be removed</p>
<p> and even other lines like this all the way down here with whitespace after being parsed from email that need to be removed.</p>
But this is still okay."
How do I remove all the html code and lines from the string to simply return "Keep all of this this is still good But this is still okay." on one line? I've looked at bleach and lxml but they are simply just removing the html <> and returning what's inside, whereas I don't want any of it.
You can still use lxml to get all of the root element's text nodes:
import lxml.html
html = '''
Keep all of this <h1>But remove this including the tags</h1> this is still good, but
<p>there also might be new lines like this that need to be removed</p>
<p> and even other lines like this all the way down here with whitespace after being parsed from email that need to be removed.</p>
But this is still okay.
'''
root = lxml.html.fromstring('<div>' + html + '</div>')
text = ' '.join(t.strip() for t in root.xpath('text()') if t.strip())
Seems to work fine:
>>> text
'Keep all of this this is still good, but But this is still okay.'
Simple solution that requires no external packages:
import re
while '<' in s:
s = re.sub('<.+?>.+?<.+?>', '', s)
Not very efficient, since it passes over the target string many times, but it should work. Note there must be absolutely no < or > characters on the string.
This one?
import re
s = # Your string here
print re.sub('[\s\n]*<.+?>.+?<.+?>[\s\n]*', ' ', s)
Edit: Just made a few mods to #BoppreH answer albeit with an extra space.

read xml file using lxml get error EntityRef

i use lxml to read a xml file which has structure like bellow
<domain>http://www.trademe.co.nz</domain>
<start>http://www.trademe.co.nz/Browse/CategoryAttributeSearchResults.aspx?search=1&cid=5748&sidebar=1&rptpath=350-5748-4233-&132=FLAT&134=&153=&29=&122=0&122=0&59=0&59=0&178=0&178=0&sidebarSearch_keypresses=0&sidebarSearch_suggested=0</start>
and my python code is:
from lxml import etree
tree = etree.parse('metaWeb.xml')
when i run it i get
entityref: expecting ';' error
however, when i remove & symbol in xml file, everything is fine.
how can i solve that error?
The problem is that this isn't valid XML. In XML, a & symbol always starts an entity reference, like Ӓ for the character U+04D2 (aka Ӓ), " for the character ", or some custom entity defined in your document/DTD/schema.*
If you want to put a literal & into a string, you have to replace it with something else, typically &, which is a character entity reference for the ampersand character.
So, if you're sure there are no actual entity references in your document, just un-escaped ampersands, you can fix it pretty simply:
with open('metaWeb.xml') as f:
xml = f.read().replace('&', '&')
tree = etree.fromstring(xml)
However, a better solution, if possible, is to fix whatever program is generating this incorrect XML.
* This is slightly misleading quite true; a numeric character reference is not actually an entity reference. Also, a character entity reference like " or & is the same as any other reference with replacement text, the entities just happen to be implicitly defined by the XML/HTML base DTDs. But lxml, like most XML software, uses the term "entity reference" slightly more broadly than the standard.
Replace & with & in your xml file, othewise your xml is not compliant to the XML standard.

Parsing xml with "not well-formed" characters in python

I am getting xml data from an application, which I want to parse in python:
#!/usr/bin/python
import xml.etree.ElementTree as ET
import re
xml_file = 'tickets_prod.xml'
xml_file_handle = open(xml_file,'r')
xml_as_string = xml_file_handle.read()
xml_file_handle.close()
xml_cleaned = re.sub(u'[^\x01-\x7f]+',u'',xml_as_string)
root = ET.fromstring(xml_cleaned)
It works for smaller datasets with example data, but when I go to real live data, I get
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 364658, column 72
Looking at the xml file, I see this line 364658:
WARNING - (1 warnings in check_logfiles.protocol-2013-05-28-12-53-46) - ^[[0:36mnotice: Scope(Class[Hwsw]): Not required on ^[[0m</description>
I guess it is the ^[ which makes python choke - it is also highlighted blue in vim. Now I was hoping that I could clean the data with my regex substitution, but that did not work.
The best thing would be fixing the application which generated the xml, but that is out of scope. So I need to deal with the data as it is. How can I work around this? I could live with just throwing away "illegal" characters.
You already do:
xml_cleaned = re.sub(u'[^\x01-\x7f]+',u'',xml_as_string)
but the character ^[ is probably Python's \x1b. If xml.parser.expat chokes on it, you need simply to clean up more, by only accepting some characters below 0x20 (space). For example:
xml_cleaned = re.sub(u'[^\n\r\t\x20-\x7f]+',u'',xml_as_string)
I know this is pretty old, but stumbled upon the following url that has a list of all of the primary characters and their encodings.
https://medium.com/interview-buddy/handling-ascii-character-in-python-58993859c38e

Categories