Need python lxml syntax help for parsing html - python

I am brand new to python, and I need some help with the syntax for finding and iterating through html tags using lxml. Here are the use-cases I am dealing with:
HTML file is fairly well formed (but not perfect). Has multiple tables on screen, one containing a set of search results, and one each for a header and footer. Each result row contains a link for the search result detail.
I need to find the middle table with the search result rows (this one I was able to figure out):
self.mySearchTables = self.mySearchTree.findall(".//table")
self.myResultRows = self.mySearchTables[1].findall(".//tr")
I need to find the links contained in this table (this is where I'm getting stuck):
for searchRow in self.myResultRows:
searchLink = patentRow.findall(".//a")
It doesn't seem to actually locate the link elements.
I need the plain text of the link. I imagine it would be something like searchLink.text if I actually got the link elements in the first place.
Finally, in the actual API reference for lxml, I wasn't able to find information on the find and the findall calls. I gleaned these from bits of code I found on google. Am I missing something about how to effectively find and iterate over HTML tags using lxml?

Okay, first, in regards to parsing the HTML: if you follow the recommendation of zweiterlinde and S.Lott at least use the version of beautifulsoup included with lxml. That way you will also reap the benefit of a nice xpath or css selector interface.
However, I personally prefer Ian Bicking's HTML parser included in lxml.
Secondly, .find() and .findall() come from lxml trying to be compatible with ElementTree, and those two methods are described in XPath Support in ElementTree.
Those two functions are fairly easy to use but they are very limited XPath. I recommend trying to use either the full lxml xpath() method or, if you are already familiar with CSS, using the cssselect() method.
Here are some examples, with an HTML string parsed like this:
from lxml.html import fromstring
mySearchTree = fromstring(your_input_string)
Using the css selector class your program would roughly look something like this:
# Find all 'a' elements inside 'tr' table rows with css selector
for a in mySearchTree.cssselect('tr a'):
print 'found "%s" link to href "%s"' % (a.text, a.get('href'))
The equivalent using xpath method would be:
# Find all 'a' elements inside 'tr' table rows with xpath
for a in mySearchTree.xpath('.//tr/*/a'):
print 'found "%s" link to href "%s"' % (a.text, a.get('href'))

Is there a reason you're not using Beautiful Soup for this project? It will make dealing with imperfectly formed documents much easier.

Related

bs4 BeautifulSoup - can't find what looks like custom tag to save my life

I'm admittedly beginner to intermediate with Python and novice to BeautifulSoup/web-scraping. However, I have successfully built a couple of scrapers. Normal tags = no problem (e.g., div, a, li, etc)
However, can't find how to reference this tag with .select or .find or attrs="" or anything:
..........
<react type="sad" msgid="25314120" num="2"
..........
I ultimately want what looks like the "num" attribute from whatever this ghastly thing is ... a "react" tag (though I don't think that's a thing?)?
.find() works the same way as you'd find other tags such as div, p and a tags. Therefore, we search for the 'react' tag.
react_tag = soup.find('react')
Then, access the num attribute like so.
num_value = react_tag['num']
Should print out:
2
As per bs4 Documentation .find('tag') returns the single tag and .find_all('tag') returns list of tags in html.
In your case if there are multiple react tags use this
for reactTag in soup.find_all('react'):
print(reactTag.get('num'))
To get only first tag use this
print(soup.find('react').get('num'))
The user "s n" was spot on! These are dynamically created javascript which I didn't know anything about, but was pretty easy to figure out. Using the SeleniumLibrary in Python and a "headless" WebChromeDriver together, you can use Selenium selectors like Xpath and many others to find these tags.

BeautifulSoup, getting more returns than expected with regex

Using BeautifulSoup, I have the following line:
dimensions = SOUP.select(".specs__title > h4", text=re.compile(r'Dimensions'))
However, it's returning more than just the tags that have a text of 'Dimensions' as shown in these results:
[<h4>Dimensions</h4>, <h4>Details</h4>, <h4>Warranty / Certifications</h4>]
Am I using the regex incorrectly with the way SOUP works?
The select interface doesn't have a text keyword. Before we go further, the following is assuming you are using BeautifulSoup 4.7+.
If you'd like to filter by text, you might be able to do something like this:
dimensions = SOUP.select(".specs__title > h4:contains(Dimensions)")
More information on the :contains() pseudo-class implementation is available here: https://facelessuser.github.io/soupsieve/selectors/#:contains.
EDIT: To clarify, there is no way to incorporate regex directly into a select call currently. You would have to filter the elements after the fact to use regex. In the future there may be a way to use regex via some custom pseudo-class, but currently there is no such feature available in Soup Sieve (Beautiful Soup's select implementation in 4.7+).

Python Regular Expressions - extract every table cell content [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
If I have a string that looks something like...
"<tr><td>123</td><td>234</td>...<td>697</td></tr>"
Basically a table row with n cells.
What's the easiest way in python to get the values of each cell. That is I just want the values "123", "234", "697" stored in a list or array or what ever is easiest.
I've tried to use regular expressions, when I use
re.match
I am not able to get it to find anything. If I try with
re.search
I can only get the first cell. But I want to get all the cells. If I can't do this with n cells, how would you do it with a fixed number of cells?
If that markup is part of a larger set of markup, you should prefer a tool with a HTML parser.
One such tool is BeautifulSoup.
Here's one way to find what you need using that tool:
>>> markup = '''"<tr><td>123</td><td>234</td>...<td>697</td></tr>"'''
>>> from bs4 import BeautifulSoup as bs
>>> soup = bs(markup)
>>> for i in soup.find_all('td'):
... print(i.text)
Result:
123
234
697
Don't do this. Just use a proper HTML parser, and use something like xpath to get the elements you want.
A lot of people like lxml. For this task, you will probably want to use the BeautifulSoup backend, or use BeautifulSoup directly, because this is presumably not markup from a source known to generate well-formed, valid documents.
When using lxml, an element tree gets created. Each element in the element tree holds information about a tag.
from lxml import etree
root = etree.XML("<root><a x='123'>aText<b/><c/><b/></a></root>")
elements = root.findall(".//a")
tag = elements[0].tag
attr = elements[0].attrib

How to get the content of a Html page in Python

I have downloaded the web page into an html file. I am wondering what's the simplest way to get the content of that page. By content, I mean I need the strings that a browser would display.
To be clear:
Input:
<html><head><title>Page title</title></head>
<body><p id="firstpara" align="center">This is paragraph <b>one</b>.
<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>
Output:
Page title This is paragraph one. This is paragraph two.
putting together:
from BeautifulSoup import BeautifulSoup
import re
def removeHtmlTags(page):
p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''')
return p.sub('', page)
def removeHtmlTags2(page):
soup = BeautifulSoup(page)
return ''.join(soup.findAll(text=True))
Related
Python HTML removal
Extracting text from HTML file using Python
What is a light python library that can eliminate HTML tags? (and only text)
Remove HTML tags in AppEngine Python Env (equivalent to Ruby’s Sanitize)
RegEx match open tags except XHTML self-contained tags (famous don't use regex to parse html rant)
Parse the HTML with Beautiful Soup.
To get all the text, without the tags, try:
''.join(soup.findAll(text=True))
Personally, I use lxml because it's a swiss-army knife...
from lxml import html
print html.parse('http://someurl.at.domain').xpath('//body')[0].text_content()
This tells lxml to retrieve the page, locate the <body> tag then extract and print all the text.
I do a lot of page parsing and a regex is the wrong solution most of the time, unless it's a one-time-only need. If the author of the page changes their HTML you run a good risk of your regex breaking. A parser is a lot more likely to continue working.
The big problem with a parser is learning how to access the sections of the document you are after, but there are a lot of XPATH tools you can use inside your browser that simplify the task.
You want to look at Extracting data from HTML documents - Dive into Python because HERE it does (almost)exactly what you want.
The best modules for this task are lxml or html5lib; Beautifull Soap is imho not worth to use anymore. And for recursive models regular expressions are definitly the wrong method.
If I am getting your question correctly, this can simply be done by using urlopen function of urllib. Just have a look at this function to open an url and read the response which will be the html code of that page.
The quickest way to get a usable sample of what a browser would display is to remove any tags from the html and print the rest. This can, for example, be done using python's re.

Getting the value of href attributes in all <a> tags on a html file with Python

I'm building an app in python, and I need to get the URL of all links in one webpage. I already have a function that uses urllib to download the html file from the web, and transform it to a list of strings with readlines().
Currently I have this code that uses regex (I'm not very good at it) to search for links in every line:
for line in lines:
result = re.match ('/href="(.*)"/iU', line)
print result
This is not working, as it only prints "None" for every line in the file, but I'm sure that at least there are 3 links on the file I'm opening.
Can someone give me a hint on this?
Thanks in advance
Beautiful Soup can do this almost trivially:
from BeautifulSoup import BeautifulSoup as soup
html = soup('<body>qweasd</body>')
print [tag.attrMap['href'] for tag in html.findAll('a', {'href': True})]
Another alternative to BeautifulSoup is lxml (http://lxml.de/);
import lxml.html
links = lxml.html.parse("http://stackoverflow.com/").xpath("//a/#href")
for link in links:
print link
There's an HTML parser that comes standard in Python. Checkout htmllib.
As previously mentioned: regex does not have the power to parse HTML. Do not use regex for parsing HTML. Do not pass Go. Do not collect £200.
Use an HTML parser.
But for completeness, the primary problem is:
re.match ('/href="(.*)"/iU', line)
You don't use the “/.../flags” syntax for decorating regexes in Python. Instead put the flags in a separate argument:
re.match('href="(.*)"', line, re.I|re.U)
Another problem is the greedy ‘.*’ pattern. If you have two hrefs in a line, it'll happily suck up all the content between the opening " of the first match and the closing " of the second match. You can use the non-greedy ‘.*?’ or, more simply, ‘[^"]*’ to only match up to the first closing quote.
But don't use regexes for parsing HTML. Really.
What others haven't told you is that using regular expressions for this is not a reliable solution.
Using regular expression will give you wrong results on many situations: if there are <A> tags that are commented out, or if there are text in the page which include the string "href=", or if there are <textarea> elements with html code in it, and many others. Plus, the href attribute may exist on tags other that the anchor tag.
What you need for this is XPath, which is a query language for DOM trees, i.e. it lets you retrieve any set of nodes satisfying the conditions you specify (HTML attributes are nodes in the DOM).
XPath is a well standarized language now a days (W3C), and is well supported by all major languages. I strongly suggest you use XPath and not regexp for this.
adw's answer shows one example of using XPath for your particular case.
Don't divide the html content into lines, as there maybe multiple matches in a single line. Also don't assume there is always quotes around the url.
Do something like this:
links = re.finditer(' href="?([^\s^"]+)', content)
for link in links:
print link
Well, just for completeness I will add here what I found to be the best answer, and I found it on the book Dive Into Python, from Mark Pilgrim.
Here follows the code to list all URL's from a webpage:
from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
import urllib, urllister
usock = urllib.urlopen("http://diveintopython.net/")
parser = urllister.URLLister()
parser.feed(usock.read())
usock.close()
parser.close()
for url in parser.urls: print url
Thanks for all the replies.

Categories