Python parse xml file - python

I need to parse an xml file which method would be best for my case. beautifulsoup4, ElementTree, etc. it's a pretty big file.
I have windows 10 64bit running python 2.7.11 32bit
xml file:
http://pastebin.com/jTDRwCZr
I'm trying to get this output from the xml file it contains different languages using " div xml:lang="English" " for english. any help on how i can use beautifulsoup with lxml to achieve this? thanks for your time.
<tt xmlns="http://www.w3.org/2006/04/ttaf1" xmlns:tts="http://www.w3.org/2006/04/ttaf1#styling">
<head>
<styling>
<style id="1" tts:textOutline='#000000 2px 2px' tts:color="white"/>
</styling>
</head>
<body>
<div xml:lang="English">
<p begin="00:00:28.966" end="00:00:31.385" style="1">
text text text...
</p>
</div>
</body>
</tt>

The file that you link to is not that large that you need to worry about alternative methods of parsing and processing it.
Assuming that you are trying to remove all non-English language divs you can do it with BeautifulSoup:
from bs4 import BeautifulSoup
with open('input.xml') as infile:
soup = BeautifulSoup(infile, 'lxml')
for e in soup.find_all('div', attrs={'xml:lang': lambda value: value != 'English'}):
_ = e.extract()
with open('output.xml', 'w') as outfile:
outfile.write(soup.prettify(soup.original_encoding))
In the code above the soup.find_all() finds all divs that have an xml:lang attribute that is a value other than 'English'. It then removes the matching elements with extract(). Finally the resulting document is written out to a new file using the same encoding as the input (otherwise it will default to UTF-8).

Usually DOM approach is quick and easy to use (upto 10 MB). However, if it is the really big xml file (> 50 MB), then XML DOM approach cannot be used since it parses the entire XML object into the memory. It takes upto 3-4 GB of RAM for parsing only upto 100 MBs of data and get significantly slower.
So the another option would be to do iterative or event based parsing of XML files.
For iterative parsing, elementTree or lxml approaches can be used. Usually elementTree is quite slow, so I would recommend to use the cElementTree, similar API but implemented in C which is significantly faster than elementTree.
I recently used the elementTree for parsing >100 MB large XML files and it's been working really good for me so far! I'm not sure about lxml.
I would check out online for more information on how to use XML parsing APIs.

Related

Why does beautiful-soup change the html?

I have an HTML file. I am trying to open it and read the contents as
with open("M_ALARM_102.HTML", "r") as f:
contents = f.read()
print(contents)
when I print the contents in the above command it prints perfectly. But when I pass the contents to BeautifulSoup and print the soup it changes the HTML code
soup = BeautifulSoup(contents, html.parser)
print(soup)
here is the output from BeautifulSoup
ÿþ<html>
<head>
<meta charset="UTF-8">
<title>ARRÊT SERVOS</title>
<style type="text/css">
I am not getting why it is doing this. I need to extract 3 tags from it but it keeps giving None as output.
Can anyone help me, please?
&lt is < this symbol and &gt is > this symbol. İt is for security to protect web site by XSS ( Cross Site Scripting ) attacks.
It might be that the parser used by BeautifulSoup did not recognize that file as html.
I see two "strange" characters in that output: ÿþ. They look like something added the BOM (byte order mark) to the file, while the parser expected valid utf-8.
There is a good chance that this is the problem.
One way to fix the BOM problem is to open the file in notepad, and save it as UTF-8. Notepad is pretty good at doing this kind of stuff.
You might also be able to fix it by opening the file in python as utf-16, using with open("M_ALARM_102.HTML", "r", encoding="utf-16") as f:. Note that here you specify the encoding directly (see more from python documentation about unicode).
Note that I did not personally try the latter approach, so I am not sure it will actually remove the BOM -- the best option is still to not introduce it at all in your workflow.

Failing to parse 1 MB XML File with BS4

I have two SVG maps of the world, downloaded here. My goal is to do some editing of these maps in python, working with them via BeautifulSoup4. This works perfectly with the low-res file (132.5 Kb). However, the BS4 parser (using lxml) fails entirely when I attempt to use it with the high-res file (1.2 Mb).
The code is as such:
import lxml
from bs4 import BeautifulSoup as Soup
with open('worldHigh.svg','r') as f:
handler = f.read()
soup = Soup(handler,'xml')
print(soup.prettify())
When I run that with the worldHigh.svg fifle, the only thing that is printed is
<?xml version="1.0" encoding="utf-8"?>
When I run the equivalent, but changing worldHigh.svg for worldLow.svg, it prints the XML correctly (as desired).
Both SVG files work fine when opened by themselves (i.e., they show the map). However, one fails when I try to parse it, the other succeeds. I am at a loss for what is going wrong. I would understand if the parser fails at large sizes, but 1.2 MB does not seem large.
The XML parser needs the raw sequence of unencoded bytes. Use open(...,'rb') when parsing XML.
The reason why one worked and the other didn't is worldHigh.svg has a BOM at the beginning of the file.

How to auto-close xml tags in truncated file?

I receive an email when a system in my company generates an error. This email contains XML all crammed onto a single line.
I wrote a notepad++ Python script that parses out everything except XML and pretty prints it. Unfortunately some of the emails contain too much XML data and it gets truncated. In general, the truncated data isn't that important to me. I would like to be able to just auto-close any open tags so that my Python script works. It doesn't need to be smart or correct, it just needs to make the xml well-enough formed that the script runs. Is there a way to do this?
I am open to Python scripts, online apps, downloadable apps, etc.
I realize that the right solution is to get the non-truncated xml, but pulling the right lever to get things done will be far more work than just dealing with it.
Use Beautiful Soup
>>> import bs4
>>> s= bs4.BeautifulSoup("<asd><xyz>asd</xyz>")
>>> s
<html><head></head><body><asd><xyz>asd</xyz></asd></body></html>
>>
>>> s.body.contents[0]
<asd><xyz>asd</xyz></asd>
Notice that it closed the "asd" tag automagically"
To create a notepad++ script to handle this,
download the tarball and extract the files
Copy the bs4 directory to your PythonScript/scripts folder.
In notepad++ add the following code to your python script
#import Beautiful Soup
import bs4
#get text in document
text = editor.getText()
#soupify it to fix XML
soup = bs4.BeautifulSoup(text)
#convert soup object to string again
text = str(soup)
#clear editor and replace bad xml with fixed xml
editor.clearAll()
editor.addText(text)
#change language to xml
notepad.menuCommand( MENUCOMMAND.LANG_XML )
#soup has its own prettify, but I like the XML tools version better
notepad.runMenuCommand('XML Tools', 'Pretty print (XML only - with line breaks)', 1)
If you have BeautifulSoup and lxml installed, it's straightforward:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""
... <?xml version="1.0" encoding="utf-8"?>
... <a>
... <b>foo</b>
... <c>bar</""", "xml")
>>> soup
<?xml version="1.0" encoding="utf-8"?>
<a>
<b>foo</b>
<c>bar</c></a>
Note the second "xml" argument to the constructor to avoid the XML being interpreted as HTML.

Parsing log4j in Python

I have a log file (log4j xml-ish format) that I am trying to pull info out of and use in my Python module. Could I treat this file as if it were XML? My gut is telling me no... If not, what is the best way to parse the data? Below is a section of the log file. The file does not include your standard doctype or version headers which is why I said "xml-ish."
<log4j:event
logger="com.hp.cp.elk.impl.subscriptions.AsyncSimpleSubscriptionManager"
timestamp="1352320517430" level="DEBUG" thread="Thread-77">
<log4j:message><![CDATA[Broadcasting signals to subscribers...]]></log4j:message>
</log4j:event>
<log4j:event logger="com.hp.cp.jdf.idp.queue.IDPJobProgressMonitor"
timestamp="1352320517430" level="DEBUG" thread="IDPJobProgressMonitorThread">
<log4j:message><![CDATA[[JDFQueueEntry[ --> JDFAutoQueueEntry[ --> JDFElement[
--> <?xml version="1.0" encoding="UTF-8"?><QueueEntry
xmlns="http://www.CIP4.org/JDFSchema_1_1"
DescriptiveName="H44E61-6.pdf" DeviceID="HPPRO1-SM1"
EndTime="2012-11-07T10:58:18-08:00" JobID="Default" Priority="50"
QueueEntryID="d5fbbe98a1194e0da573b51a0c8040fb" Status="Completed"
SubmissionTime="2012-11-06T16:35:06-08:00"> <Comment AgentName="CIP4 JDF Writer
Java" AgentVersion="1.4a BLD 63" ID="c_121106_163506894_000804"
Name="JobSpec">WBG_4C_Flat_21up_BusCards_Duplex</Comment>
</QueueEntry>
] ] ]] queue entries changed.]]></log4j:message>
</log4j:event>
<log4j:event logger="com.hp.cp.jdf.idp.queue.IDPJobProgressMonitor"
timestamp="1352320517430" level="DEBUG" thread="IDPJobProgressMonitorThread">
<log4j:message><![CDATA[no active queue entries changed.]]></log4j:message>
</log4j:event>
Sorry for the messy code, I just wanted to make you all can get an idea of the formatting. Anyway, I'm currently just trying to pull the value from QueueEntryID="d5fbbe98a1194e0da573b51a0c8040fb" Any suggestions? Thank you!
I would imagine that you could use standard XML tools like DOM or SAX to parse this. Otherwise, have fun with re or htmllib.

xml to Python data structure using lxml

How can I convert xml to Python data structure using lxml?
I have searched high and low but can't find anything.
Input example
<ApplicationPack>
<name>Mozilla Firefox</name>
<shortname>firefox</shortname>
<description>Leading Open Source internet browser.</description>
<version>3.6.3-1</version>
<license name="Firefox EULA">http://www.mozilla.com/en-US/legal/eula/firefox-en.html</license>
<ms-license>False</ms-license>
<vendor>Mozilla Foundation</vendor>
<homepage>http://www.mozilla.org/firefox</homepage>
<icon>resources/firefox.png</icon>
<download>http://download.mozilla.org/?product=firefox-3.6.3&os=win&lang=en-GB</download>
<crack required="0"/>
<install>scripts/install.sh</install>
<postinstall name="Clean Up"></postinstall>
<run>C:\\Program Files\\Mozilla Firefox\\firefox.exe</run>
<uninstall>c:\\Program Files\\Mozilla Firefox\\uninstall\\helper.exe /S</uninstall>
<requires name="autohotkey" />
</ApplicationPack>
>>> from lxml import etree
>>> treetop = etree.fromstring(anxmlstring)
converts the xml in the string to a Python data structure, and so does
>>> othertree = etree.parse(somexmlurl)
where somexmlurl is the path to a local XML file or the URL of an XML file on the web.
The Python data structure these functions provide (known as an "element tree", whence the etree module name) is well documented here -- all the classes, functions, methods, etc, that the Python data structure in question supports. It closely matches one supported in the Python standard library, by the way.
If you want some different Python data structure, you'll have to walk through the Python data structure which lxml returns, as above mentioned, and build your different data structure yourself based on the information thus collected; lxml can't specifically help you, except by offering several helpers for finding information in the parsed structure it returns, so that collecting said info is a flexible, easy task (again, see the documentation URL above).
It's not entirely clear what kind of data structure you're looking for, but here's a link to a code sample to convert XML to python dictionary of lists via lxml.etree.

Categories