Failing to parse 1 MB XML File with BS4 - python

I have two SVG maps of the world, downloaded here. My goal is to do some editing of these maps in python, working with them via BeautifulSoup4. This works perfectly with the low-res file (132.5 Kb). However, the BS4 parser (using lxml) fails entirely when I attempt to use it with the high-res file (1.2 Mb).
The code is as such:
import lxml
from bs4 import BeautifulSoup as Soup
with open('worldHigh.svg','r') as f:
handler = f.read()
soup = Soup(handler,'xml')
print(soup.prettify())
When I run that with the worldHigh.svg fifle, the only thing that is printed is
<?xml version="1.0" encoding="utf-8"?>
When I run the equivalent, but changing worldHigh.svg for worldLow.svg, it prints the XML correctly (as desired).
Both SVG files work fine when opened by themselves (i.e., they show the map). However, one fails when I try to parse it, the other succeeds. I am at a loss for what is going wrong. I would understand if the parser fails at large sizes, but 1.2 MB does not seem large.

The XML parser needs the raw sequence of unencoded bytes. Use open(...,'rb') when parsing XML.
The reason why one worked and the other didn't is worldHigh.svg has a BOM at the beginning of the file.

Related

cannot open files at BeatifulSoup lxml parser

I am trying to read the contents of an HTML file with BeautifulSoup, but I'm receiving an UnicodeDecodeError.
I also tried changing the parser to html.parser instead of the lxml but it doesn't work.
however, if I use the requests library to request the URL, it works, but not if I read the HTML file locally.
answer:
I needed to add a Unicode and it was should have something like that: with open('lap.html', encoding="utf8") as html_file:
You are passing a file to 'BeautifulSoup' instead you have to pass the content of the file.
try :
content = html_file.read() source = BeautifulSoup(content, 'lxml')
First of all, fix the soruce to source, then make a gap between the equal sign and the text and then find out, what might not be encodable by the coding standart you use, because that error refers to a sign which cant be decoded/encoded

Python parse xml file

I need to parse an xml file which method would be best for my case. beautifulsoup4, ElementTree, etc. it's a pretty big file.
I have windows 10 64bit running python 2.7.11 32bit
xml file:
http://pastebin.com/jTDRwCZr
I'm trying to get this output from the xml file it contains different languages using " div xml:lang="English" " for english. any help on how i can use beautifulsoup with lxml to achieve this? thanks for your time.
<tt xmlns="http://www.w3.org/2006/04/ttaf1" xmlns:tts="http://www.w3.org/2006/04/ttaf1#styling">
<head>
<styling>
<style id="1" tts:textOutline='#000000 2px 2px' tts:color="white"/>
</styling>
</head>
<body>
<div xml:lang="English">
<p begin="00:00:28.966" end="00:00:31.385" style="1">
text text text...
</p>
</div>
</body>
</tt>
The file that you link to is not that large that you need to worry about alternative methods of parsing and processing it.
Assuming that you are trying to remove all non-English language divs you can do it with BeautifulSoup:
from bs4 import BeautifulSoup
with open('input.xml') as infile:
soup = BeautifulSoup(infile, 'lxml')
for e in soup.find_all('div', attrs={'xml:lang': lambda value: value != 'English'}):
_ = e.extract()
with open('output.xml', 'w') as outfile:
outfile.write(soup.prettify(soup.original_encoding))
In the code above the soup.find_all() finds all divs that have an xml:lang attribute that is a value other than 'English'. It then removes the matching elements with extract(). Finally the resulting document is written out to a new file using the same encoding as the input (otherwise it will default to UTF-8).
Usually DOM approach is quick and easy to use (upto 10 MB). However, if it is the really big xml file (> 50 MB), then XML DOM approach cannot be used since it parses the entire XML object into the memory. It takes upto 3-4 GB of RAM for parsing only upto 100 MBs of data and get significantly slower.
So the another option would be to do iterative or event based parsing of XML files.
For iterative parsing, elementTree or lxml approaches can be used. Usually elementTree is quite slow, so I would recommend to use the cElementTree, similar API but implemented in C which is significantly faster than elementTree.
I recently used the elementTree for parsing >100 MB large XML files and it's been working really good for me so far! I'm not sure about lxml.
I would check out online for more information on how to use XML parsing APIs.

Reading a page and parsing it with minidom.parse or minidom.parseString in Python?

I have either of these codes:
import urllib
from xml.dom import minidom
res = urllib.urlopen('https://www.google.com/webhp#q=apple&start=10')
dom = minidom.parse(res)
which gives me the error xml.parsers.expat.ExpatError: syntax error: line 1, column 0
Or this:
import urllib
from xml.dom import minidom
res = urllib.urlopen('https://www.google.com/webhp#q=apple&start=10')
dom = minidom.parseString(res.read())
which gives me the same error. res.read() reads fine and is a string.
I would like to parse through the code later. How can I do this using xml.dom.minidom?
The reason you're getting this error is that the page isn't valid XML. It's HTML 5. The doctype right at the top tells you this, even if you ignore the content type. You can't parse HTML with an XML parser.*
If you want to stick with what's in the stdlib, you can use html.parser (Python 3.x) / HTMLParser (2.x).** However, you may want to consider third-party libraries like lxml (which, despite the name, can parse HTML), html5lib, or BeautifulSoup (which wraps up a lower-level parser in a really nice interface).
* Well, unless it's XHTML, or the XML output of HTML5, but that's not the case here.
** Do not use htmllib unless you're using an old version of Python without a working HTMLParser. This module is deprecated for a reason.

How to auto-close xml tags in truncated file?

I receive an email when a system in my company generates an error. This email contains XML all crammed onto a single line.
I wrote a notepad++ Python script that parses out everything except XML and pretty prints it. Unfortunately some of the emails contain too much XML data and it gets truncated. In general, the truncated data isn't that important to me. I would like to be able to just auto-close any open tags so that my Python script works. It doesn't need to be smart or correct, it just needs to make the xml well-enough formed that the script runs. Is there a way to do this?
I am open to Python scripts, online apps, downloadable apps, etc.
I realize that the right solution is to get the non-truncated xml, but pulling the right lever to get things done will be far more work than just dealing with it.
Use Beautiful Soup
>>> import bs4
>>> s= bs4.BeautifulSoup("<asd><xyz>asd</xyz>")
>>> s
<html><head></head><body><asd><xyz>asd</xyz></asd></body></html>
>>
>>> s.body.contents[0]
<asd><xyz>asd</xyz></asd>
Notice that it closed the "asd" tag automagically"
To create a notepad++ script to handle this,
download the tarball and extract the files
Copy the bs4 directory to your PythonScript/scripts folder.
In notepad++ add the following code to your python script
#import Beautiful Soup
import bs4
#get text in document
text = editor.getText()
#soupify it to fix XML
soup = bs4.BeautifulSoup(text)
#convert soup object to string again
text = str(soup)
#clear editor and replace bad xml with fixed xml
editor.clearAll()
editor.addText(text)
#change language to xml
notepad.menuCommand( MENUCOMMAND.LANG_XML )
#soup has its own prettify, but I like the XML tools version better
notepad.runMenuCommand('XML Tools', 'Pretty print (XML only - with line breaks)', 1)
If you have BeautifulSoup and lxml installed, it's straightforward:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""
... <?xml version="1.0" encoding="utf-8"?>
... <a>
... <b>foo</b>
... <c>bar</""", "xml")
>>> soup
<?xml version="1.0" encoding="utf-8"?>
<a>
<b>foo</b>
<c>bar</c></a>
Note the second "xml" argument to the constructor to avoid the XML being interpreted as HTML.

xml to Python data structure using lxml

How can I convert xml to Python data structure using lxml?
I have searched high and low but can't find anything.
Input example
<ApplicationPack>
<name>Mozilla Firefox</name>
<shortname>firefox</shortname>
<description>Leading Open Source internet browser.</description>
<version>3.6.3-1</version>
<license name="Firefox EULA">http://www.mozilla.com/en-US/legal/eula/firefox-en.html</license>
<ms-license>False</ms-license>
<vendor>Mozilla Foundation</vendor>
<homepage>http://www.mozilla.org/firefox</homepage>
<icon>resources/firefox.png</icon>
<download>http://download.mozilla.org/?product=firefox-3.6.3&os=win&lang=en-GB</download>
<crack required="0"/>
<install>scripts/install.sh</install>
<postinstall name="Clean Up"></postinstall>
<run>C:\\Program Files\\Mozilla Firefox\\firefox.exe</run>
<uninstall>c:\\Program Files\\Mozilla Firefox\\uninstall\\helper.exe /S</uninstall>
<requires name="autohotkey" />
</ApplicationPack>
>>> from lxml import etree
>>> treetop = etree.fromstring(anxmlstring)
converts the xml in the string to a Python data structure, and so does
>>> othertree = etree.parse(somexmlurl)
where somexmlurl is the path to a local XML file or the URL of an XML file on the web.
The Python data structure these functions provide (known as an "element tree", whence the etree module name) is well documented here -- all the classes, functions, methods, etc, that the Python data structure in question supports. It closely matches one supported in the Python standard library, by the way.
If you want some different Python data structure, you'll have to walk through the Python data structure which lxml returns, as above mentioned, and build your different data structure yourself based on the information thus collected; lxml can't specifically help you, except by offering several helpers for finding information in the parsed structure it returns, so that collecting said info is a flexible, easy task (again, see the documentation URL above).
It's not entirely clear what kind of data structure you're looking for, but here's a link to a code sample to convert XML to python dictionary of lists via lxml.etree.

Categories