I have a text file where I have some XML data and some HTML data. Both start with "<". Now I want to extract only XML data and save it in another file. How can I do it?
File example:
xyz data:
<note>
<to>john</to>
<from>doe</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
xyz data
<bold>xyz</bold>
text
text
text
<bold>xyz</bold>
again XML data
Note: This file is in .txt format.
I would treat your whole input not as XML, but as an HTML fragment. HTML can contain non-standard elements, so <note> etc. is fine.
For convenience I suggest pyquery (link) to deal with HTML. It works pretty much the same way as jQuery, so if you've worked with that before, it should be familiar.
It's pretty straight-forward. Load your data, wrap it in "<html></html>", parse it, query it.
from pyquery import PyQuery as pq
data = """xyz data:
<note>
<to>john</to>
<from>doe</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
xyz data
<bold>xyz</bold>
text
text
text
<bold>xyz</bold>
again XML data"""
doc = pq(f"<html><body>{data}</body></html>")
note = doc.find("note")
print(note.find("body").text())
which prints "Don't forget me this weekend!".
Related
I have multiple xml files in which the tags are not closed properly. They are often moved to the next line.
My XML
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
<item id='my-img' href='images/dp.jpg' media-type='image/jpeg'/
>
</note>
Is there any pythonic way of converting it to desired XML like below? :
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
<item id='my-img' href='images/dp.jpg' media-type='image/jpeg'/>
</note>
I have tried using beautifulsoup. But it doesn't seem to format this properly.
The simplest way i could think of would be to remove line breaks entirely, load the xml and save a prettyfied version. The default repr only fixes the tags but keeps the format the same.
from bs4 import BeautifulSoup as bs
with open("file.xml","r") as f:
xml_str = f.read(); # Load file
xml = bs(xml_str, "lxml") # parse file
with open("file.xml","w") as f:
f.write(repr(xml)) # write pretty file
Note: we are using the lxml parser which needs to be installed using
pip3 install lxml
Edit: no need to remove the line breaks
It all depends on how many xml fields you need to fix and if there is a pattern in it. From what you put as an example, might be easier to use find and replace (for example using regular expressions).
Let's take the StackOverflow public export of Tags, which looks something like this:
<?xml version="1.0" encoding="utf-8"?>
<tags>
<row Id="1" TagName=".net" Count="303362" ExcerptPostId="3624959" WikiPostId="3607476" />
<row Id="2" TagName="html" Count="1038358" ExcerptPostId="3673183" WikiPostId="3673182" />
<row Id="3" TagName="javascript" Count="2130783" ExcerptPostId="3624960" WikiPostId="3607052" />
...
Let's assume that this object wouldn't fit in memory, but since it's line-separated I think it'd be OK to process without doing too much trickery. What might be a good approach to process a file like this? My thought was just to process it by row, building faking the xml structure, something like:
for line in file:
node = etree.fromstring('<x>%s</x>' % line).find('row')
...
Is this a common approach for handling xml that is "row-oriented" that is too big to fit in memory? I see this commonly with DB exports (actually i think the db client I use does that format, though I never use xml-exports from a db).
SAX is a abbreviation of Simple API to XML and provides a very fast and lightweight access to XML data. In contrast to DOM, which reads the whole XML document and creates a in-memory tree representation from it, SAX gives the program available information immediately after they are read from the document during parsing. The workflow is a follows - the parser reads the document and each time it finds a distinct type of XML data, such as start of an element, end of an element, processing instruction, etc. it passes the information to the so called handler
See a concrete example here: http://python.zirael.org/e-sax1.html
I have an xml file with this as the header
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type='text/xsl' href='\\segotn12805\ppr\PPRData3\StyleSheet\PPRData3.xslt'?>
when I modify the file I use .write (for example)
mytree.write('output.xml')
but the output file does not contain the header info.
The first two lines of the output file look like this
<ns0:pprdata xmlns:ns0="http://ManHub.PPRData">
<ns0:Group name="Models">
any ideas on how I can add the header info to the output file?
The first line is the XML declaration. It is optional, and a parser will assume UTF-8 if not specified.
The second line is a processing instruction.
It would be helpful if you provided more code to show what you are doing, but I suspect that you are using ElementTree. The documentation has this note indicating that by default these are skipped:
Note Not all elements of the XML input will end up as elements of the parsed tree. Currently, this module skips over any XML comments, processing instructions, and document type declarations in the input. Nevertheless, trees built using this module’s API rather than parsing from XML text can have comments and processing instructions in them; they will be included when generating XML output. A document type declaration may be accessed by passing a custom TreeBuilder instance to the XMLParser constructor.
As suggested in this answer, you might want to try using lxml
I'm trying to build a xml file in python so I can write it out to a file, but I'm getting complications with new lines and tabbing etc...
I cannot use a module to do this - because Im using a cut down version of python 2. It must all be in pure python.
For instance, how is it possible to create a xml file with this type of formatting, which keeps all the new lines and tabs (whitespace)?
e.g.
<?xml version="1.0" encoding="UTF-8"?>
<myfiledata>
<mydata>
blahblah
</mydata>
</myfiledata>
I've tried enclosing each line
' <myfiledata>' +\n
' blahblah' +\n
etc.
However, the output Im getting from the script is not anything close to how it looks in my python file, there is extra white space and the new lines arent properly working.
Is there any definitive way to do this? I would rather be editing a file that looks somewhat like what I will end up with - for clarity sake...
You can use XMLGenerator from saxutils to generate the XML and xml.dom.minidom to parse it and print the pretty xml (both modules from standard library in Python 2).
Sample code creating a XML and pretty-printing it:
from __future__ import print_function
from xml.sax.saxutils import XMLGenerator
import io
import xml.dom.minidom
def pprint_xml_string(s):
"""Pretty-print an XML string with minidom"""
parsed = xml.dom.minidom.parse(io.BytesIO(s))
return parsed.toprettyxml()
# create a XML file in-memory:
fp = io.BytesIO()
xg = XMLGenerator(fp)
xg.startDocument()
xg.startElement('root', {})
xg.startElement('subitem', {})
xg.characters('text content')
xg.endElement('subitem')
xg.startElement('subitem', {})
xg.characters('text content for another subitem')
xg.endElement('subitem')
xg.endElement('root')
xg.endDocument()
# pretty-print it
xml_string = fp.getvalue()
pretty_xml = pprint_xml_string(xml_string)
print(pretty_xml)
Output is:
<?xml version="1.0" ?>
<root>
<subitem>text content</subitem>
<subitem>text content for another subitem</subitem>
</root>
Note that the text content elements (wrapped in <subitem> tags) aren't indented because doing so would change their content (XML doesn't ignore whitespace like HTML does).
The answer was to use xml.element.tree and from xml.dom import minidom
Which are all available on python 2.5
I receive an email when a system in my company generates an error. This email contains XML all crammed onto a single line.
I wrote a notepad++ Python script that parses out everything except XML and pretty prints it. Unfortunately some of the emails contain too much XML data and it gets truncated. In general, the truncated data isn't that important to me. I would like to be able to just auto-close any open tags so that my Python script works. It doesn't need to be smart or correct, it just needs to make the xml well-enough formed that the script runs. Is there a way to do this?
I am open to Python scripts, online apps, downloadable apps, etc.
I realize that the right solution is to get the non-truncated xml, but pulling the right lever to get things done will be far more work than just dealing with it.
Use Beautiful Soup
>>> import bs4
>>> s= bs4.BeautifulSoup("<asd><xyz>asd</xyz>")
>>> s
<html><head></head><body><asd><xyz>asd</xyz></asd></body></html>
>>
>>> s.body.contents[0]
<asd><xyz>asd</xyz></asd>
Notice that it closed the "asd" tag automagically"
To create a notepad++ script to handle this,
download the tarball and extract the files
Copy the bs4 directory to your PythonScript/scripts folder.
In notepad++ add the following code to your python script
#import Beautiful Soup
import bs4
#get text in document
text = editor.getText()
#soupify it to fix XML
soup = bs4.BeautifulSoup(text)
#convert soup object to string again
text = str(soup)
#clear editor and replace bad xml with fixed xml
editor.clearAll()
editor.addText(text)
#change language to xml
notepad.menuCommand( MENUCOMMAND.LANG_XML )
#soup has its own prettify, but I like the XML tools version better
notepad.runMenuCommand('XML Tools', 'Pretty print (XML only - with line breaks)', 1)
If you have BeautifulSoup and lxml installed, it's straightforward:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""
... <?xml version="1.0" encoding="utf-8"?>
... <a>
... <b>foo</b>
... <c>bar</""", "xml")
>>> soup
<?xml version="1.0" encoding="utf-8"?>
<a>
<b>foo</b>
<c>bar</c></a>
Note the second "xml" argument to the constructor to avoid the XML being interpreted as HTML.