Find all titles in an XML with Elementree from a bz2 file - python

I'm new to parsing in XML and am stuck with my code regarding finding all titles (title tags) in an XML. This is what I came up with, but it is returning just an empty list, while there should be titles in there.
import bz2
from xml.etree import ElementTree as etree
def parse_xml(filename):
with bz2.BZ2File(filename) as f:
doc = etree.parse(f)
titles = doc.findall('.//{http://www.mediawiki.org/xml/export-0.7/}title')
print titles[:10]
Can someone tell me why this is not working properly? Just to be clear; I need to find all text inside title tags stored in a list, taken from an XML wrapped in a bz2 file (as far as I read the best way is without unzipping).

Related

How can i see the content of a xml file in python

I'm struggling to find a way of seeing the content of a xml file. I have done a lot of searching and the only progress I am making is to keep running my code without any results
Have a look at the BeautifulSoup package and use the lxml parser.
Based on this url:
https://pymotw.com/2/xml/etree/ElementTree/parse.html
the relevant code:
from xml.etree import ElementTree
with open('example.xml', 'rt') as f:
tree = ElementTree.parse(f)
print tree
This will print the XML file.
It's also good for parsing the file and search elements.

Python lxml error "namespace not defined."

I am being driven crazy by some oddly formed xml and would be grateful for some pointers:
The documents are defined like this:
<sphinx:document id="18059090929806848187">
<url>http://www.some-website.com</url>
<page_number>104</page_number>
<size>7865</size>
</sphinx:document>
Now, I need to read lots (500m+ of these files which are all gz compresed) and grab the text values form a few of the contained tags.
sample code:
from lxml import objectify, etree
import gzip
with open ('file_list','rb') as file_list:
for file in file_list:
in_xml = gzip.open(file.strip('\n'))
xml2 = etree.iterparse(in_xml)
for action, elem in xml2:
if elem.tag == "page_number":
print elem.text + str(file)
the first value elem.text is returned but only for the first file in the list and quickly followed by the error:
lxml.etree.XMLSyntaxError: Namespace prefix sphinx on document is not defined, line 1, column 20
Please excuse my ignorance but xml really hurts my head and I have been struggling with this for a while. Is there a way that I can either define the namespace prefix or handle this in some other more intelligent manner?
Thanks
Your input file is not well formed XML. I assume that it is a snippet from a larger XML document.
Your choices are:
Reconstruct the larger document. How you do this is specific to your application. You may have to consult with the people that created the file you are parsing.
Parse the file in spite of its errors. To do that, use the recover keyword from lxml.etree.iterparse:
xml2 =etree.iterparse(in_xml, recover=True)

Python ElementTree doesn't seem to recognize text nodes

I am trying to parse a simple XML document located at http://www.webservicex.net/airport.asmx/getAirportInformationByAirportCode?airportCode=jfk using the ElementTree module. The code (so far):
import urllib2
from xml.etree import ElementTree
from xml.etree.ElementTree import Element
from xml.etree.ElementTree import SubElement
url = "http://www.webservicex.net/airport.asmx/getAirportInformationByAirportCode?airportCode=jfk"
s = urllib2.urlopen(url)
print s
document = ElementTree.parse(s)
root = document.getroot()
print root
dataset = SubElement(root, 'NewDataSet')
print dataset
table = SubElement(dataset, 'Table')
print table
airportName = SubElement(table, 'CityOrAirportName')
print airportName.text
The final line yields "none" not the name of the airport in the XML. Can anyone assist? This should be realtively simply, but I am missing something.
Look at the documentation for that module. It says, among other things:
The SubElement() function also provides a convenient way to create new sub-elements for a given element
In particular note the word create. You are creating a new element, not reading the elements that are already there.
If you want to locate certain elements within the parsed XML, read the rest of the documentation on that page to understand how to use the library to do that.

Trouble parsing html files (to csv) using ElementTree xpath in python

I am trying to parse a few thousand html files and dump the variables into a csv file (excel spreadsheet). I've come up against several roadblocks--the first one which was (thankfully) solve here, a few days ago. The (hopefully) final roadblock is this: I can not get it to properly parse the file using xpath. Below is a brief explanation, the python code and example of the html code.
The trouble starts here:
for node in tree.iter():
name = node.attrib.get('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
if category =='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
category=node.text
It runs, but does not parse. I do not get any traceback errors.
I think I am misunderstanding the logic of parsing with ElementTree.
There are several headers that are the same--it is therefor difficult to find a unique id/header. Here is an example of the html:
<span class="s1">Business: Give Back to the Community and Save Money
on Equipment, Technology, Promotional Products, and Market<span
class="Apple-converted-space"> </span></span>
For which the xpath is:
/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]
/table/tbody/tr[1]/td[1]/p/span
I would like to scrape the text from this span (among others) and put it in the excel spreadsheet.
You can see an example of a similar page HERE
At any rate, because many spans/headers are no uniquely identified, I think I should use xpath. However, I have yet to be able to figure out how to successfully use xpath commands with ElementTree. In searching the documentation, the answer to this question (as well as the logic) eludes me. I have read up on http://lxml.de/parsing.html as well as on this site and have yet to find something that works.
So far, the code iterates through all the files (in dropbox) nicely. It also creates the csv file and creates the headers (though not in separate columns, only as one line separated by semicolons-- but that should be easy to fix).
In sum, I would like it to parse the text from different lines on in each file (webpage) and dump it into the excel file.
Any input would be greatly appreciated.
The python code:
import xml.etree.ElementTree as ET
import csv, codecs, os
from cStringIO import StringIO
# Note: you need to download and install this..
import unicodecsv
import lxml.html
# TODO: make into command line params (instead of constant)
CSV_FILE='output.csv'
HTML_PATH='/Users/C/data/Folder_NS'
f = open(CSV_FILE, 'wb')
w = unicodecsv.writer(f, encoding='utf-8', delimiter=';')
w.writerow(['file', 'category', 'about', 'title', 'subtitle', 'date', 'bodyarticle'])
# redundant declarations:
category=''
about=''
title=''
subtitle=''
date=''
bodyarticle=''
print "headers created"
allFiles = os.listdir(HTML_PATH)
#with open(CSV_FILE, 'wb') as csvfile:
print "all defined"
for file in allFiles:
#print allFiles
if '.html' in file:
print "in html loop"
tree = lxml.html.parse(HTML_PATH+"/"+file)
print '===================='
print 'Parsing file: '+file
print '===================='
for node in tree.iter():
name = node.attrib.get('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
if category =='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
print 'Category:'
category=node.text
f.close()
14 June 2015 (most recent change); I have just changed this section
for node in tree.iter():
name = node.attrib.get('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
if category =='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
print 'Category:'
category=node.text
to this:
for node in tree.iter():
row = dict.fromkeys(cols)
Category_name = tree.xpath('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
row['category'] = Category_name[0].text_content().encode('utf-8')
It still runs, but does not parse.
Try following code:
from lxml import etree
import requests
from StringIO import StringIO
data = requests.get('http://www.usprwire.com/Detailed/Banking_Finance_Investment/Confused.com_reveals_that_Life_Insurance_is_more_than_a_form_of_future_protection_284764.shtml').content
parser = etree.HTMLParser()
root = etree.parse(StringIO(data), parser)
category = root.xpath('//table/td/font/text()')
print category[0]
It uses requests library to download the html code of the page. You can choose whatever method that fits your needs. The important part is the xpath that searches any <table> followed by <td> followed by <font>, and it returns a list with two elements. The second one are blank characters and the first one contains the text.
Run it and yields just the sentence you are looking for:
Banking, Finance & Investment: Confused.com reveals that Life Insurance is more than a form of future protection

Writing modified Beautiful Soup tree to file, while maintaining original XML formatting

We have an XML document that has a tag we wish to alter:
...<version>1.0</version>...
It's buried deep in the XML file, but we're successfully able to use Beautiful Soup to replace its contents with a command-line parameter.
The problem is that after modifying the tree, we need to write back to the file we read it from. But, we want to maintain the original formatting of the document. When I use:
fileForWriting = open(myXmlFile, 'w')
fileForWriting.write(soup.prettify())
The prettify() call breaks the formatting, and I end up with:
<version>
1.0
</version>
Is there any way to maintain the original formatting of the XML document, while replacing that single tag text?
Note: Using simply:
fileForWriting.write(str(soup))
Keeps the text and tags on the same line, but eliminates the indents and extra newlines that had been human-added for readability. Close, but no cigar.
By request, the entire script:
from BeautifulSoup import BeautifulSoup as bs
import sys
xmlFile = sys.argv[1:][0]
version = sys.argv[1:][1]
fileForReading = open(xmlFile, 'r')
xmlString = fileForReading.read()
fileForReading.close()
soup = bs(xmlString)
soup.findAll('version')[1].contents[0].replaceWith(version)
fileForWriting = open(xmlFile, 'w')
fileForWriting.write(str(soup))
fileForWriting.close()
The script is then run using:
python myscript.py someFile.xml 1.2
And if you use xml.elementtree, the tree.write(file) method replaces the CRLF by LF only, which also creates issues when trying to import the XML file into i.e. PyXB.
The solution I found is to use ElementTree just to find what I have to replace. Then I do source_XML = 'new value'.join(source_XML.split('what you need to replace)) Finally a file.write(source_XML)
it's not nice, but it solves the issue. However, I do not mind about the indentations, so on this I can't really say. I would only use pprint.pprint() whenever I need to print it.

Categories