BeautifulSoup XML Only printing first line - python

I'm using BeautifulSoup4 (And lxml) to parse an XML file, for some reason when I print soup.prettify() it only prints the first line:
from bs4 import BeautifulSoup
f = open('xmlDoc.xml', "r")
soup = BeautifulSoup(f, 'xml')
print soup.prettify()
#>>> <?xml version="1.0" encoding="utf-8"?>
Any idea why it's not grabbing everything?
UPDATE:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!-- Data Junction generated file.
Macro type "1000" is reserved. -->
<djmacros>
<macro name="Test" type="5000" value="TestValue">
<description>test</description>
</macro>
<macro name="AnotherTest" type="0" value="TestValue2"/>
<macro name="TestLocation" type="1000" value="C:\RandomLocation">
<description> </description>
</macro>
<djmacros>

The file position is at EOF:
>>> soup = BeautifulSoup("", 'xml')
>>> soup.prettify()
'<?xml version="1.0" encoding="utf-8">\n'
Or the content is not valid xml:
>>> soup = BeautifulSoup("no <root/> element", 'xml')
>>> soup.prettify()
'<?xml version="1.0" encoding="utf-8">\n'

As per J.F.Sebastion's answer, the XML is invalid.
Your final tag is incorrect:
<djmacros>
The correct tag is:
</djmacros>
You can confirm this with an XML validator. Eg http://www.w3schools.com/xml/xml_validator.asp

I had the same problem with a valid XML file.
The problem was that the XML file is encoded in UTF-8 with BOM.
I discovered that by printing the raw content:
content = open(path, "r").read()
print(content)
And I got (see this thread: What's  sign at the beginning of my source file?):
<?xml version="1.0" encoding="utf-8"?>

If the encoding is UTF-8-BOM instead of UTF-8 it may have problems even if the XML is otherwise valid.

Related

Write over xml root, but keep header comment

I have a script set up to use parse from lxml library in Python to read in xml files and provided some logic to remove a specified attribute. I would like write over the xml elements but keep a header comment.
Example.xml
<?xml version="1.0" encoding="utf-8"?>
<!-- Original Header
Some more info -->
<Foo Name = "Bar" Pet = "Able">
<Foo2 Name = "Bar2" />
<Foo3 Name = "Bar3" />
</Foo>
I would like to write back into the xml file after going through the processing logic such that the Example.xml would look like this:
<?xml version="1.0" encoding="utf-8"?>
<!-- Original Header
Some more info -->
<Foo Name = "Bar">
<Foo2 Name = "Bar2" />
<Foo3 Name = "Bar3" />
</Foo>
The removing of the attribute is something I have already figured out. Writing back into the same xml file and preserving the header is what I can't think of a solution to. Worst case scenario is I make an output folder and manually do a BeyondCompare with the files, but I would like to automate this completely.
You can do it very easily with BeautifulSoup using the lxml parser. First, open the file for reading, parse the contents using BeautifulSoup, make changes as required then open the file for writing and write to the file.
from bs4 import BeautifulSoup
with open('./Example.xml', 'r') as f:
xml = f.read()
soup = BeautifulSoup(xml, 'lxml-xml')
foo = soup.find('Foo')
del foo['Pet']
with open('./Example.xml', 'w') as f:
f.write(soup.prettify())
It can be written more concisely as:
from bs4 import BeautifulSoup
with open('./Example.xml', 'r') as f:
soup = BeautifulSoup(f.read(), 'xml')
del soup.find('Foo')['Pet']
with open('./Example.xml', 'w') as f:
f.write(soup.prettify())

lxml.etree.XMLSyntaxError, Document labelled UTF-16 but has UTF-8 content

lxml.etree.XMLSyntaxError, Document labelled UTF-16 but has UTF-8 content
I am getting the error using lxml lib in python. Other solutions/hacks are replacing utf-16 to utf-8 in file php. What is the pythonic way to solve this?
python code:
import lxml.etree as etree
tree = etree.parse("req.xml")
req.xml:
<?xml version="1.0" encoding="utf-16"?>
<test
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
</test>
Have a look at the documentation of the XMLParser constructor:
>>> help(etree.XMLParser)
Among other options, there is an encoding parameter, which allows you to "override the document encoding", as the docs say.
That's exactly what you need:
parser = etree.XMLParser(encoding='UTF-8')
tree = etree.parse("req.xml", parser=parser)
If the error message is right (ie. there aren't any other problems with the document), then I expect this to work.
You can parse xml content using BeautifulSoup which is more pythonic way as you required.
NOTE: If your data is encoded in utf-16 it can easily parse by decoding in utf-8 during reading/PARSE the file content.
So below is the code:
sample.xml contains following data:
<?xml version="1.0" encoding="utf-16"?>
<test
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
</test>
Code:
from bs4 import BeautifulSoup
with open("sample.xml", "r") as f: # opening xml file
content = f.read().decode('utf-8', 'ignore') # xml content stored in this variable and decode to utf-8
soup = BeautifulSoup(content, 'html.parser') #parse content to BeautifulSoup Module
data = [data.attrsfor data in soup.findAll("test")]
print data
Output:
{u'xmlns:xsi': u'http://www.w3.org/2001/XMLSchema-instance', u'xmlns:xsd': u'http://www.w3.org/2001/XMLSchema'}

BeautifulSoup does not parse content of the tag like first-name that contains '-'

Hi i have a response like below
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<person>
<first-name>hede</first-name>
<last-name>hodo</last-name>
<headline>Python Developer at hede</headline>
<site-standard-profile-request>
<url>http://www.linkedin.com/profile/view?id=hede&authType=godasd*</url>
</site-standard-profile-request>
</person>
And I want to parse the content returned from linkedin api.
I am using beautifulsoup like below
ipdb> hede = BeautifulSoup(response.content)
ipdb> hede.person.headline
<headline>Python Developer at hede</headline>
But when i do
ipdb> hede.person.first-name
*** NameError: name 'name' is not defined
Any ideas ?
Python attribute names can not contain a hypen.
Instead use
hede.person.findChild('first-name')
Also, to parse XML with BeautifulSoup, use
hede = bs.BeautifulSoup(content, 'xml')
or if you have lxml installed,
hede = bs.BeautifulSoup(content, 'lxml')

Can CDATA sections be preserved by BeautifulSoup?

I'm using BeautifulSoup to read, modify, and write an XML file. I'm having trouble with CDATA sections being stripped out. Here's a simplified example.
The culprit XML file:
<?xml version="1.0" ?>
<foo>
<bar><![CDATA[
!##$%^&*()_+{}|:"<>?,./;'[]\-=
]]></bar>
</foo>
And here's the Python script.
from bs4 import BeautifulSoup
xmlfile = open("cdata.xml", "r")
soup = BeautifulSoup( xmlfile, "xml" )
print(soup)
Here's the output. Note the CDATA section tags are missing.
<?xml version="1.0" encoding="utf-8"?>
<foo>
<bar>
!##$%^&*()_+{}|:"<>?,./;'[]\-=
</bar>
</foo>
I also tried printing soup.prettify(formatter="xml") and got the same result with slightly different whitespace. There isn't much in the docs about reading in CDATA sections, so maybe this is an lxml thing?
Is there a way to tell BeautifulSoup to preserve CDATA sections?
Update Yes, it's an lxml thing. http://lxml.de/api.html#cdata So, the question becomes, is it possible to tell BeautifulSoup to initialize lxml with strip_cdata=False?
In my case if I use
soup = BeautifulSoup( xmlfile, "lxml-xml" )
then cdata is preserved and accesible.

How to get XML tag value in Python

I have some XML in a unicode-string variable in Python as follows:
<?xml version='1.0' encoding='UTF-8'?>
<results preview='0'>
<meta>
<fieldOrder>
<field>count</field>
</fieldOrder>
</meta>
<result offset='0'>
<field k='count'>
<value><text>6</text></value>
</field>
</result>
</results>
How do I extract the 6 in <value><text>6</text></value> using Python?
With lxml:
import lxml.etree
# xmlstr is your xml in a string
root = lxml.etree.fromstring(xmlstr)
textelem = root.find('result/field/value/text')
print textelem.text
Edit: But I imagine there could be more than one result...
import lxml.etree
# xmlstr is your xml in a string
root = lxml.etree.fromstring(xmlstr)
results = root.findall('result')
textnumbers = [r.find('field/value/text').text for r in results]
BeautifulSoup is the most simple way to parse XML as far as I know...
And assume that you have read the introduction, then just simply use:
soup = BeautifulSoup('your_XML_string')
print soup.find('text').string

Categories