Reading 1000s of XML documents with BeautifulSoup - python

I'm trying to read a bunch of xml files and do stuff to them. The first thing I want to do is rename them based on a number that's inside the file.
You can see a sample of the data hereWarning this will initiate a download of a 108MB zip file!. That's a huge xml file with thousands of smaller xml files inside it. I've broken those out into individual files. I want to rename the files based on a number inside (part of preprocessing). I have the following code:
from __future__ import print_function
from bs4 import BeautifulSoup # To get everything
import os
def rename_xml_files(directory):
xml_files = [xml_file for xml_file in os.listdir(directory) ]
for filename in xml_files:
filename = filename.strip()
full_filename = directory + "/" +filename
print (full_filename)
f = open(full_filename, "r")
xml = f.read()
soup = BeautifulSoup(xml)
del xml
del soup
f.close()
If I comment out the "soup =" and "del" lines, it works perfectly. If I add the "soup = ..." line, it will work for a moment and then it will eventually crap out - it just crashes the python kernel. I'm using Enthought Canopy, but I've tried it running from the command line and it craps out there, too.
I thought, perhaps, it was not deallocating the space for the variable "soup" so I tried adding the "del" commands. Same problem.
Any thoughts on how to circumvent this? I'm not stuck on BS. If there's a better way of doing this, I would love it, but I need a little sample code.

Try using cElementTree.parse() from Python's standard xml library instead of BeautifulSoup. 'Soup is great for parsing normal web pages, but cElementTree is blazing fast.
Like this:
import xml.etree.cElementTree as cET
# ...
def rename_xml_files(directory):
xml_files = [xml_file for xml_file in os.listdir(directory) ]
for filename in xml_files:
filename = filename.strip()
full_filename = directory + "/" +filename
print(full_filename)
parsed = cET.parse(full_filename)
del parsed
If your XML formatted correctly this should parse it. If your machine is still unable to handle all that data in memory, you should look into streaming the XML.

I would not separate that file out into many small files and then process them some more, I would process them all in one go.
I would just use a streaming api XML parser and parse the master file, get the name and write out the sub-files once with the correct name.
There is no need for BeautifulSoup which is primarily designed to handle HTML and uses a document model instead of a streaming parser.
There is no need for what you are doing to build an entire DOM just to get a single element all at once.

Related

How to open/save docx file to be aditted as xml and save the result as docx after editting using python

I have a docx file in which I need to edit its paragraphs (the paragraphs might contain equations). I tried to do these jobs using python-docx but it was not successful since editing the text of each paragraph and replacing it with the edited new paragraph needs to call p.add_paragraphs(editText(paragraph.text)) which ignores and omits any mathematical equation.
By searching for a method to gain this goal I found that this job is possible through XML codes by finding <w:t> tags and editing their content like this:
tree= ET.parse(filename)
root=tree.getroot()
for par in root.findall('w:p'):
if par.find('w:r'):
myText= par.find('w:r').find('w:t')
myText.text= editText(myText.text)
Then I must save the result as docx.
My quation is: what the format of filename is? should it be a document.xml file? If so, how can I reach that from my original document.docx file? and one more question is that how can I save the result as a .docx file again?
For saving docx as xml, I have given a try to save it by document.save('Document2.xml'). But the content of the result was not correct.
Would you give me some advice how to do them?
Not experienced with this at all, but perhaps this is what you were looking for?
https://virantha.com/2013/08/16/reading-and-writing-microsoft-word-docx-files-with-python/
From the article:
import zipfile
from lxml import etree
class DocsWriter:
def __init__(self, docx_file):
self.zipfile = zipfile.ZipFile(docx_file)
def _write_and_close_docx (self, xml_content, output_filename):
""" Create a temp directory, expand the original docx zip.
Write the modified xml to word/document.xml
Zip it up as the new docx
"""
tmp_dir = tempfile.mkdtemp()
self.zipfile.extractall(tmp_dir)
with open(os.path.join(tmp_dir,'word/document.xml'), 'w') as f:
xmlstr = etree.tostring (xml_content, pretty_print=True)
f.write(xmlstr)
# Get a list of all the files in the original docx zipfile
filenames = self.zipfile.namelist()
# Now, create the new zip file and add all the filex into the archive
zip_copy_filename = output_filename
with zipfile.ZipFile(zip_copy_filename, "w") as docx:
for filename in filenames:
docx.write(os.path.join(tmp_dir,filename), filename)
# Clean up the temp dir
shutil.rmtree(tmp_dir)
From what I can tell, this code block writes an xml document as .docx. Refer to the article for more context.
Python is not the best tool for this. Use VBA if you need to automate something in a Word document, or multiple Word documents. I can't tell what you are even trying to do here, but let's start at the beginning, with something simple. If, for instance, you want to loop through all paragraphs in your Word document, and select only the equations, you can run the code below to do just that.
Sub SelectAllEquations()
Dim xMath As OMath
Dim I As Integer
With ActiveDocument
.DeleteAllEditableRanges wdEditorEveryone
For I = 1 To .OMaths.Count
Set xMath = .OMaths.Item(I)
xMath.Range.Paragraphs(1).Range.Editors.Add wdEditorEveryone
Next
.SelectAllEditableRanges wdEditorEveryone
.DeleteAllEditableRanges wdEditorEveryone
End With
End Sub
Again, I don't know what your end game is, but I think it's worthwhile to start with something like that, and build on your foundation.

Trouble parsing html files (to csv) using ElementTree xpath in python

I am trying to parse a few thousand html files and dump the variables into a csv file (excel spreadsheet). I've come up against several roadblocks--the first one which was (thankfully) solve here, a few days ago. The (hopefully) final roadblock is this: I can not get it to properly parse the file using xpath. Below is a brief explanation, the python code and example of the html code.
The trouble starts here:
for node in tree.iter():
name = node.attrib.get('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
if category =='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
category=node.text
It runs, but does not parse. I do not get any traceback errors.
I think I am misunderstanding the logic of parsing with ElementTree.
There are several headers that are the same--it is therefor difficult to find a unique id/header. Here is an example of the html:
<span class="s1">Business: Give Back to the Community and Save Money
on Equipment, Technology, Promotional Products, and Market<span
class="Apple-converted-space"> </span></span>
For which the xpath is:
/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]
/table/tbody/tr[1]/td[1]/p/span
I would like to scrape the text from this span (among others) and put it in the excel spreadsheet.
You can see an example of a similar page HERE
At any rate, because many spans/headers are no uniquely identified, I think I should use xpath. However, I have yet to be able to figure out how to successfully use xpath commands with ElementTree. In searching the documentation, the answer to this question (as well as the logic) eludes me. I have read up on http://lxml.de/parsing.html as well as on this site and have yet to find something that works.
So far, the code iterates through all the files (in dropbox) nicely. It also creates the csv file and creates the headers (though not in separate columns, only as one line separated by semicolons-- but that should be easy to fix).
In sum, I would like it to parse the text from different lines on in each file (webpage) and dump it into the excel file.
Any input would be greatly appreciated.
The python code:
import xml.etree.ElementTree as ET
import csv, codecs, os
from cStringIO import StringIO
# Note: you need to download and install this..
import unicodecsv
import lxml.html
# TODO: make into command line params (instead of constant)
CSV_FILE='output.csv'
HTML_PATH='/Users/C/data/Folder_NS'
f = open(CSV_FILE, 'wb')
w = unicodecsv.writer(f, encoding='utf-8', delimiter=';')
w.writerow(['file', 'category', 'about', 'title', 'subtitle', 'date', 'bodyarticle'])
# redundant declarations:
category=''
about=''
title=''
subtitle=''
date=''
bodyarticle=''
print "headers created"
allFiles = os.listdir(HTML_PATH)
#with open(CSV_FILE, 'wb') as csvfile:
print "all defined"
for file in allFiles:
#print allFiles
if '.html' in file:
print "in html loop"
tree = lxml.html.parse(HTML_PATH+"/"+file)
print '===================='
print 'Parsing file: '+file
print '===================='
for node in tree.iter():
name = node.attrib.get('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
if category =='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
print 'Category:'
category=node.text
f.close()
14 June 2015 (most recent change); I have just changed this section
for node in tree.iter():
name = node.attrib.get('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
if category =='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
print 'Category:'
category=node.text
to this:
for node in tree.iter():
row = dict.fromkeys(cols)
Category_name = tree.xpath('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
row['category'] = Category_name[0].text_content().encode('utf-8')
It still runs, but does not parse.
Try following code:
from lxml import etree
import requests
from StringIO import StringIO
data = requests.get('http://www.usprwire.com/Detailed/Banking_Finance_Investment/Confused.com_reveals_that_Life_Insurance_is_more_than_a_form_of_future_protection_284764.shtml').content
parser = etree.HTMLParser()
root = etree.parse(StringIO(data), parser)
category = root.xpath('//table/td/font/text()')
print category[0]
It uses requests library to download the html code of the page. You can choose whatever method that fits your needs. The important part is the xpath that searches any <table> followed by <td> followed by <font>, and it returns a list with two elements. The second one are blank characters and the first one contains the text.
Run it and yields just the sentence you are looking for:
Banking, Finance & Investment: Confused.com reveals that Life Insurance is more than a form of future protection

Using Python to comment out XML

I'm trying (and failing) to comment out the HornetQ configuration from a JBoss 6.2 domain.xml file, instead of inserting a comment around the stanza I want to remove, I'm managing to delete everything remaining in the file.
The code I have so far is
from xml.dom import minidom
import os, time, shutil
domConf=('/home/test/JBoss/jboss-eap-6.2/domain/configuration/domain.xml')
commentSub=('urn:jboss:domain:messaging:1.4')
now=str(int(time.time()))
bkup=(domConf+now)
shutil.copy2(domConf, bkup)
xmldoc = minidom.parse(domConf)
itemlist = xmldoc.getElementsByTagName('subsystem')
for s in itemlist:
if commentSub in s.attributes['xmlns'].value:
s.parentNode.insertBefore(xmldoc.createComment(s.toxml()), s)
file = open(domConf, "wb")
xmldoc.writexml(file)
file.write('\n')
file.close()
the configuration I'm trying to comment out is -
<subsystem xmlns="urn:jboss:domain:messaging:1.4">
<hornetq-server>
<persistence-enabled>true</persistence-enabled>
<journal-type>NIO</journal-type>
<journal-min-files>2</journal-min-files>
<connectors>
[....]
</pooled-connection-factory>
</jms-connection-factories>
</hornetq-server>
</subsystem>
Thanks!
The problem you were running into was that the sections you are trying to comment out already contain XML comments. Nested comments are not allowed in XML. (See Nested comments in XML? for more info.)
I think what you need to do is this:
from xml.dom import minidom
import os, time, shutil
domConf=('/home/test/JBoss/jboss-eap-6.2/domain/configuration/domain.xml')
resultFile='result.xml'
commentSub=('urn:jboss:domain:messaging:1.4')
now=str(int(time.time()))
bkup=(domConf+now)
shutil.copy2(domConf, bkup)
xmldoc = minidom.parse(domConf)
itemlist = xmldoc.getElementsByTagName('subsystem')
for s in itemlist:
if commentSub in s.attributes['xmlns'].value:
commentText = s.toxml()
commentText = commentText.replace('--', '- -')
s.parentNode.insertBefore(xmldoc.createComment(commentText), s)
s.parentNode.removeChild(s)
file = open("result.xml", "wb")
xmldoc.writexml(file)
file.write('\n')
file.close()
shutil.copy2(resultFile, domConf)
This finds the comment as you do, but before inserting it, changes any nested XML comments so they are no longer comments by replacing '--' with '- -'. (Note this will probably break the XML file structure if you uncomment that section. You'll have to reverse the process if you want it to parse again.) After inserting, the script deletes the original node. Then it writes everything to a temporary file, and uses shutil to copy it back over the original.
I tested this on my system, using the file you posted to the pastebin in the comment below, and it works.
Note that it's kind of a quick and dirty hack - because the script will also replace '--' with '- -' everywhere in that section, and if there is other text as part of an XML node that has '--' in it, it too will get replaced...
The right way to do this would probably be to use lxml's elementtree implementation, use lxml's XSL to select only comments within the section, and either delete or transform them appropriately - so you don't mess up non-commented text. But that's probably beyond the scope of what you asked. (Python's built-in elementtree doesn't have a complete XSL implementation and probably can't be used to select comments.)

Writing modified Beautiful Soup tree to file, while maintaining original XML formatting

We have an XML document that has a tag we wish to alter:
...<version>1.0</version>...
It's buried deep in the XML file, but we're successfully able to use Beautiful Soup to replace its contents with a command-line parameter.
The problem is that after modifying the tree, we need to write back to the file we read it from. But, we want to maintain the original formatting of the document. When I use:
fileForWriting = open(myXmlFile, 'w')
fileForWriting.write(soup.prettify())
The prettify() call breaks the formatting, and I end up with:
<version>
1.0
</version>
Is there any way to maintain the original formatting of the XML document, while replacing that single tag text?
Note: Using simply:
fileForWriting.write(str(soup))
Keeps the text and tags on the same line, but eliminates the indents and extra newlines that had been human-added for readability. Close, but no cigar.
By request, the entire script:
from BeautifulSoup import BeautifulSoup as bs
import sys
xmlFile = sys.argv[1:][0]
version = sys.argv[1:][1]
fileForReading = open(xmlFile, 'r')
xmlString = fileForReading.read()
fileForReading.close()
soup = bs(xmlString)
soup.findAll('version')[1].contents[0].replaceWith(version)
fileForWriting = open(xmlFile, 'w')
fileForWriting.write(str(soup))
fileForWriting.close()
The script is then run using:
python myscript.py someFile.xml 1.2
And if you use xml.elementtree, the tree.write(file) method replaces the CRLF by LF only, which also creates issues when trying to import the XML file into i.e. PyXB.
The solution I found is to use ElementTree just to find what I have to replace. Then I do source_XML = 'new value'.join(source_XML.split('what you need to replace)) Finally a file.write(source_XML)
it's not nice, but it solves the issue. However, I do not mind about the indentations, so on this I can't really say. I would only use pprint.pprint() whenever I need to print it.

Scraping Multiple html files to CSV

I am trying to scrape rows off of over 1200 .htm files that are on my hard drive. On my computer they are here 'file:///home/phi/Data/NHL/pl07-08/PL020001.HTM'. These .htm files are sequential from *20001.htm until *21230.htm. My plan is to eventually toss my data in MySQL or SQLite via a spreadsheet app or just straight in if I can get a clean .csv file out of this process.
This is my first attempt at code (Python), scraping, and I just installed Ubuntu 9.04 on my crappy pentium IV. Needless to say I am newb and have some roadblocks.
How do I get mechanize to go through all the files in the directory in order. Can mechanize even do this? Can mechanize/Python/BeautifulSoup read a 'file:///' style url or is there another way to point it to /home/phi/Data/NHL/pl07-08/PL020001.HTM? Is it smart to do this in 100 or 250 file increments or just send all 1230?
I just need rows that start with this "<tr class="evenColor">" and end with this "</tr>". Ideally I only want the rows that contain "SHOT"|"MISS"|"GOAL" within them but I want the whole row (every column). Note that "GOAL" is in bold so do I have to specify this? There are 3 tables per htm file.
Also I would like the name of the parent file (pl020001.htm) to be included in the rows I scrape so I can id them in their own column in the final database. I don't even know where to begin for that. This is what I have so far:
#/usr/bin/python
from BeautifulSoup import BeautifulSoup
import re
from mechanize import Browser
mech = Browser()
url = "file:///home/phi/Data/NHL/pl07-08/PL020001.HTM"
##but how do I do multiple urls/files? PL02*.HTM?
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
##this confuses me and seems redundant
pl = open("input_file.html","r")
chances = open("chancesforsql.csv,"w")
table = soup.find("table", border=0)
for row in table.findAll 'tr class="evenColor"'
#should I do this instead of before?
outfile = open("shooting.csv", "w")
##how do I end it?
Should I be using IDLE or something like it? just Terminal in Ubuntu 9.04?
You won't need mechanize. Since I do not exactly know the HTML content, I'd try to see what matches, first. Like this:
import glob
from BeautifulSoup import BeautifulSoup
for filename in glob.glob('/home/phi/Data/*.htm'):
soup = BeautifulSoup(open(filename, "r").read()) # assuming some HTML
for a_tr in soup.findAll("tr", attrs={ "class" : "evenColor" }):
print a_tr
Then pick the stuff you want and write it to stdout with commas (and redirect it > to a file). Or write the csv via python.
MYYN's answer looks like a great start to me. One thing I'd point out that I've had luck with is:
import glob
for file_name in glob.glob('/home/phi/Data/*.htm'):
#read the file and then parse with BeautifulSoup
I've found both the os and glob imports to be really useful for running through files in a directory.
Also, once you're using a for loop in this way, you have the file_name which you can modify for use in the output file, so that the output filenames will match the input filenames.

Categories