I am trying to parse an xml file using BeautifulSoup. Consider a sampleinpt xml file as follows:
<DOC>
<DOCNO>1</DOCNO>
....
</DOC>
<DOC>
<DOCNO>2</DOCNO>
....
</DOC>
...
This file consists for 130 <DOC> tags. However, when I tried to parse it using BeautifulSoup's findAll function, it retrieves a random number of tags (usually between 15 - 25) but never 130. The code I used was as follows:
from bs4 import BeautifulSoup
z = open("filename").read()
soup = BeautifulSoup(z, "lxml")
print len(soup.findAll('doc'))
#more code involving manipulation of results
Can anybody tell me what wrong am I doing? Thanks in advance!
You are telling BeautifulSoup to use the HTML parser provided by lxml. If you have an XML document, you should stick to the XML parser option:
soup = BeautifulSoup(z, 'xml')
otherwise the parser will attempt to 'repair' the XML to fit HTML rules. XML parsing in BeautifulSoup is also handled by the lxml library.
Note that XML is case sensitive so you'll need to search for the DOC element now.
For XML documents it may be that the ElementTree API offered by lxml is more productive; it supports XPath queries for example, while BeautifulSoup does not.
However, from your sample it looks like there is no one top level element; it is as if your document consists of a whole series of XML documents instead. This makes your input invalid, and a parser may just stick to only parsing the first element as the top-level document instead.
Related
I am using BeautifulSoup to extract information from HTML files. I would like to be able to capture the location of the information, that is the offset within the HTML file of the tag that of a BS tag object.
Is there a way to do this?
I am currently using the lxml parser as it is the default.
If I'm reading your question correctly, you are parsing some html with BeautifulSoup and then using the soup to identify a tag. Once you have the tag, you are trying to find the index position of the tag within the original html string.
The problem with capturing the index position of a tag using BeautifulSoup is that the soup will alter the structure of the html based on the given parser. The lxml parsing might not provide a character for character representation, especially after finding a tag within the soup.
It's iffy if this will consistently work, but you might try using a string's find method to find the position of your tag's text contents, which should remain largely unchanged.
#!python
# html is a string containing your html document
soup = BeautifulSoup(html,'lxml')
# target is the tag you want to find
target = soup.find('p')
# now we locate the text of the target inside of the html document
html.find((target.text))
This method will not start at the beginning of the tag, but should be able to locate the tag's contents within the html.
If you wanted to know the index of a tag in the body of your soup, that would be much more feasible.
I am trying to use BeautifulSoup to parse an HTML file consists of many individual documents downloaded as a batch from LexisNexis (legal database).
My first task is to split the HTML file into its constituent documents. I thought this would be easy since the documents are surrounded by <DOC NUMBER=1>body of the 1st document</DOC> and so on.
However, this <DOC> tag is an XML tag, not an HTML tag (all other tags in the file are HTML). Due to this, with the regular HTML parser, this tag is not available in the tree.
How can I build a parser in bs4 that will pick up this XML tag?
I enclose the relevant section of the HTML file:
<!-- Hide XML section from browser <DOC NUMBER=1> <DOCFULL> --> BODY <!-- Hide XML section from browser </DOCFULL> </DOC> -->
You can specify xml in bs4 when your BeautifulSoup object is instantiated:
xml_soup = BeautifulSoup(xml_object, 'xml')
This should take care of your issue. You can use the xml_soup object to parse the remaining html, however I'd recommend instantiating another soup object specifically for html:
soup = BeautifulSoup(html_object)
I've file which contains name of scientist in following format
<scientist_names>
<scientist>abc</scientist>
</scientist_names>
i want to use python to strip out name of scientists from above format How should I do it??
I would like to use regular epressions but don't know how to use it...please help
DO NOT USE REGULAR EXPRESSIONS! (all reasons well explained [here])
Use an xml/html parser, take a look at BeautifulSoup.
This is XML and you should use a XML parser like lxml instead of regular expressions (because XML is not a regular language).
Here is an example:
from lxml import etree
text = """<scientist_names> <scientist>abc</scientist> </scientist_names>"""
tree = etree.fromstring(text)
for scientist in tree.xpath("//scientist"):
print scientist.text
As noted, this appears to be xml. In that case, you should use an xml parser to parse this document; I recommend lxml ( http://lxml.de ).
Given your requirements, you may find it more convenient to use SAX-style parsing, rather than DOM-style, because SAX parsing simply involves registering handlers when the parser encounters a particular tag, as long as the meaning of a tag is not dependent on context, and you have more than one type of tag to process (which may not be the case here).
In case your input document may be incorrectly formed, you may wish to use Beautiful Soup: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing XML
Here is an simple example that should handle the xml tags for you
#import library to do http requests:
import urllib2
#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations
#download the file if it's not on the same machine otherwise just use a path:
file = urllib2.urlopen('http://www.somedomain.com/somexmlfile.xml')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName,
#in your case <scientist>:
xmlTag = dom.getElementsByTagName('scientist')[0].toxml()
#strip off the tag (<tag>data</tag> ---> data):
xmlData=xmlTag.replace('<scientist>','').replace('</scientist>','')
#print out the xml tag and data in this format: <tag>data</tag>
print xmlTag
#just print the data
print xmlData
If you find anything unclear just let me know
I need to parse through html but I do not need the python parsing library to attempt to "fix" the html. Any suggestions on a tool or method to use (in python)? In my situation, if the html is malformed then my script needs to end the processing. I tried BeautifulSoup but it fixed things that I did not want it to fix. I'm creating a tool to parse template files and output another converted template style.
The book Foundations of Python Network Programming has a detailed comparison of what it looks like to scrape the same web page with Beautiful Soup and with the lxml library; but, in general, you will find that lxml is faster, more effective, and has an API which adheres closely to a Python standard (the ElementTree API, which comes with the Python Standard Library). See this blog post by the inimitable Ian Bicking for an idea of why you should be looking at lxml instead of the old-fashioned Beautiful Soup library for parsing HTML:
http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/
I believe BeautifulStoneSoup can do this if you pass in a list of selfclosing tags
The most common shortcoming of BeautifulStoneSoup is that it doesn't
know about self-closing tags. HTML has a fixed set of self-closing
tags, but with XML it depends on what the DTD says. You can tell
BeautifulStoneSoup that certain tags are self-closing by passing in
their names as the selfClosingTags argument to the constructor:
from BeautifulSoup import BeautifulStoneSoup
xml = "<tag>Text 1<selfclosing>Text 2"
print BeautifulStoneSoup(xml).prettify()
# <tag>
# Text 1
# <selfclosing>
# Text 2
# </selfclosing>
# </tag>
print BeautifulStoneSoup(xml, selfClosingTags=['selfclosing']).prettify()
# <tag>
# Text 1
# <selfclosing />
# Text 2
# </tag>
I have downloaded the web page into an html file. I am wondering what's the simplest way to get the content of that page. By content, I mean I need the strings that a browser would display.
To be clear:
Input:
<html><head><title>Page title</title></head>
<body><p id="firstpara" align="center">This is paragraph <b>one</b>.
<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>
Output:
Page title This is paragraph one. This is paragraph two.
putting together:
from BeautifulSoup import BeautifulSoup
import re
def removeHtmlTags(page):
p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''')
return p.sub('', page)
def removeHtmlTags2(page):
soup = BeautifulSoup(page)
return ''.join(soup.findAll(text=True))
Related
Python HTML removal
Extracting text from HTML file using Python
What is a light python library that can eliminate HTML tags? (and only text)
Remove HTML tags in AppEngine Python Env (equivalent to Ruby’s Sanitize)
RegEx match open tags except XHTML self-contained tags (famous don't use regex to parse html rant)
Parse the HTML with Beautiful Soup.
To get all the text, without the tags, try:
''.join(soup.findAll(text=True))
Personally, I use lxml because it's a swiss-army knife...
from lxml import html
print html.parse('http://someurl.at.domain').xpath('//body')[0].text_content()
This tells lxml to retrieve the page, locate the <body> tag then extract and print all the text.
I do a lot of page parsing and a regex is the wrong solution most of the time, unless it's a one-time-only need. If the author of the page changes their HTML you run a good risk of your regex breaking. A parser is a lot more likely to continue working.
The big problem with a parser is learning how to access the sections of the document you are after, but there are a lot of XPATH tools you can use inside your browser that simplify the task.
You want to look at Extracting data from HTML documents - Dive into Python because HERE it does (almost)exactly what you want.
The best modules for this task are lxml or html5lib; Beautifull Soap is imho not worth to use anymore. And for recursive models regular expressions are definitly the wrong method.
If I am getting your question correctly, this can simply be done by using urlopen function of urllib. Just have a look at this function to open an url and read the response which will be the html code of that page.
The quickest way to get a usable sample of what a browser would display is to remove any tags from the html and print the rest. This can, for example, be done using python's re.