The xpath for text I wish to extract is reliably located deep in the tree at
...table/tbody/tr[4]/td[2]
Specifically, td[2] is structured like so
<td class="val">xyz</td>
I am trying to extract the text "xyz", but a broad search returns multiple results. For example the following path returns 10 elements.
xpath('//td[#class="val"]')
... while a specific search doesn't return any elements. I am unsure why the following returns nothing.
xpath('//tbody/tr/td[#class="val"]')
One solution involves..
table = root.xpath('//table[#class="123"]')
#going down the tree
xyz = table[0][3][1]
print vol.text
However, I am pretty sure this extremely brittle. I would appreciate it if someone could tell me how to construct an xpath search that would be both un-brittle and relatively cheap on resources
You haven't mentioned it explicitly, but if your target table and td tag classes are reliable then you could do something like:
//table[#class="123"]/descendant::td[#class="val"]
And you half dodge the issue of tbody being there or not.
However, there's no substitute for actually seeing the material you are trying to parse for recommending XPATH queries...
...table/tbody/tr[4]/td[2]
I guess you found this XPath via a tool like Firebug. One thing to note about tools like Firebug (or other inspect tools within browsers) is that they use the DOM tree generated by the browser itself and most (if not all) HTML parsers in browsers would try hard to make the passed HTML valid. This often requires adding various tags the standard dictates.
<tbody> is one of these tags. <tr> tags are only allowed as a child of <thead>, <tbody> or <tfoot> tags. Unfortunately, in my experience, you will rarely see one of these tags inside a <table> in the actual source, but a browser would add these necessary tags while parsing to make HTML valid since standard requires to do so.
To cut this story short, there is probably no <tbody> tag in your actual source. That is why your XPath returns nothing.
As for generating XPath queries, this highly depends on the particular page/xml. In general, positional queries such as td[4] should be the last resort since they tend to break easily when something is added before them. You should inspect the markup carefully and try to come up queries that use attributes like id or class since they add specificity more reliably than the positional ones. But in the end, it all boils down to the specifics of the page in question.
This seems to be working
from lxml import etree
doc = etree.HTML('<html><body><table><tbody><tr><td>bad</td><td class="val">xyz</td></tr></tbody></table></body></html>')
print doc.xpath('//tbody/tr/td[#class="val"]')[0].text
output:
xyz
So what is your problem?
Related
I'm trying to get the number in all the <b> tags on this website. I want every single "qid" (question id) so I think I have to use qids = driver.find_elements_by_tag_name("b"), and based on other questions I've found I also need to implement a for loop and then print(qids.get_attribute("text")) but my code can't even seem to find elements with the <b> since I keep on getting the NoSuchElementException. The appearance of the website leads me to believe the content I'm looking for is within an iframe but I'm not sure if that affects the functionality of my code.
Here's a screencap of the website for reference
The html isn't of much use because the tag is its only defining trait:
<b>13570etc...</b>
Any help is much appreciated.
You could try searching by XPath:
driver.find_elements_by_xpath("//b")
Where // means "find all matching elements regardless of where they are in the document/current scope." Check out the XPath syntax here and mess around with a few different options.
I am using the Python version of the lxml libray. I am currently trying to parse the text from a table but am encountering a problem in that some of the text is links.
For example, one of the cells may look something like this:
<td>
Can I kick it, <a>to all the people</a> who can quest like a <a>tribe</a> does
</td>
Say after parsing the html, the td element is stored as foo. Then foo.text will not display the whole text, only the parts that aren't links. Moreover, if I find the link text using [i.text for i in foo.getchildren()] I no longer know the order in which to put the non-link text and link text.
Is there an easy way to get around this?
Well after searching for an hour, within 2 minutes of posting this question I have found the solution.
Use the method foo.text_content() and this will display what is needed.
I am writing a Scrapy program to extract the data.
This is the url, and I want to scrape 20111028013117 (code) information. I have taken XPath from FireFox add-on XPather. This is the path:
/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]
While I am trying to execute this
try:
temp_list = hxs.select("/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]").extract()
print "temp_list:" + str(temp_list)
except:
print "error"
It returns an empty list, I am struggling to find out an answer for this from the last 4 hours. I am a newbie to scrapy eventhough I handled issues very well for other projects, but it seems to be a bit difficult.
The reason of why your xpath doesn't work is becuase of tbody. You have to remove it and check if you get that result that you want.
You can read this in scrapy documentation: http://doc.scrapy.org/en/0.14/topics/firefox.html
Firefox, in particular, is known for adding <tbody> elements to
tables. Scrapy, on the other hand, does not modify the original page
HTML, so you won’t be able to extract any data if you use <tbody> in
your XPath expressions.
I see that the element you are hunting for is inside a <table>.
Firefox adds tbody tag for every table, even if it does not exists in source HTML code.
That's might be the reason, that your xpath query works in the browser, but fails in Scrapy.
As suggested, use other anchors in your xpath query.
You can extract data with more ease using more robust XPaths instead of taking the direct output from XPather.
For the data you are matching, this XPath would do a lot better:
//font[contains(text(),'Code')]/parent::td/following-sibling::td/font/text()
This will match the <font> tag containing "Code", then go to the td tag above it and select the next td -> font, which contains the code you are looking for.
Have you tried removing a few node tags at the end of the query, and re-running until you get a result? Do this several times until you get something, then add items back in cautiously until the query is rectified.
Also, check that your target page validates as XHTML - an invalid page would probably upset the parser.
I have created a simple test harness in python for my ASP .net web site.
I would like to look up some HTML tags in the resulting page to find certain values.\
What would be the best way of doing this in python?
eg (returned page):
<div id="ErrorPanel">An error occurred......</div>
would display (in std out from python):
Error: .....
or
<td id="dob">23/3/1985</td>
would display:
Date of birth: 23/3/1985
Do you want to parse XML, as you state in your question's title, or HTML, as you show in the text of the question? For the latter, I recommend BeautifulSoup -- download it and install it, then, once having made a soup object out of the HTML, you can easily locate the tag with a certain id (or other attribute), e.g.:
errp = soup.find(attrs={'id': 'ErrorPanel'})
if errp is not None:
print 'Error:', errp.string
and similarly for the other case (easily tweakable e.g. into a loop if you're looking for non-unique attributes, and so on).
You can also do it with lxml. It handles HTML very well, and you can use CSS selectors for querying DOM, which makes it particularly attractive if you use libraries like jQuery regularly.
Of course an HTML page can be parsed using any number of python parsers, but I'm surprised that there don't seem to be any public parsing scripts to extract meaningful content (excluding sidebars, navigation, etc.) from a given HTML doc.
I'm guessing it's something like collecting DIV and P elements and then checking them for a minimum amount of text content, but I'm sure a solid implementation would include plenty of things that I haven't thought of.
Try the Beautiful Soup library for Python. It has very simple methods to extract information from an html file.
Trying to generically extract data from webpages would require people to write their pages in a similar way... but there's an almost infinite number of ways to convey a page that looks identical let alone all the conbinations you can have to convey the same information.
Was there a particular type of information you were trying to extract or some other end goal?
You could try extracting any content in 'div' and 'p' markers and compare the relative sizes of all the information in the page. The problem then is that people probably group information into collections of 'div's and 'p's (or at least they do if they're writing well formed html!).
Maybe if you formed a tree of how the information is related (nodes would be the 'p' or 'div or whatever and each node would contain the associated text) you could do some sort of analysis to identify the smallest 'p' or 'div' that encompases what appears to be the majority of the information.. ?
[EDIT] Maybe if you can get it into the tree structure I suggested, you could then use a similar points system to spam assassin. Define some rules that attempt to classify the information. Some examples:
+1 points for every 100 words
+1 points for every child element that has > 100 words
-1 points if the section name contains the word 'nav'
-2 points if the section name contains the word 'advert'
If you have a lots of low scoring rules which add up when you find more relevent looking sections, I think that could evolve into a fairly powerful and robust technique.
[EDIT2] Looking at the readability, it seems to be doing pretty much exactly what I just suggested! Maybe it could be improved to try and understand tables better?
Have a look at templatemaker: http://www.holovaty.com/writing/templatemaker/
It's written by one of the founders of Django. Basically you feed it a few example html files and it will generate a "template" that you can then use to extract just the bits that are different (which is usually the meaningful content).
Here's an example from the google code page:
# Import the Template class.
>>> from templatemaker import Template
# Create a Template instance.
>>> t = Template()
# Learn a Sample String.
>>> t.learn('<b>this and that</b>')
# Output the template so far, using the "!" character to mark holes.
# We've only learned a single string, so the template has no holes.
>>> t.as_text('!')
'<b>this and that</b>'
# Learn another string. The True return value means the template gained
# at least one hole.
>>> t.learn('<b>alex and sue</b>')
True
# Sure enough, the template now has some holes.
>>> t.as_text('!')
'<b>! and !</b>'
You might use the boilerpipe Web application to fetch and extract content on the fly.
(This is not specific to Python, as you only need to issue a HTTP GET request to a page on Google AppEngine).
Cheers,
Christian
What is meaningful and what is not, it depends on the semantic of the page. If the semantics is crappy, your code won't "guess" what is meaningful. I use readability, which you linked in the comment, and I see that on many pages I try to read it does not provide any result, not talking about a decent one.
If someone puts the content in a table, you're doomed. Try readability on a phpbb forum you'll see what I mean.
If you want to do it, go with a regexp on <p></p>, or parse the DOM.
Goose is just the library for this task. To quote their README:
Goose will try to extract the following information:
Main text of an article
Main image of article
Any Youtube/Vimeo movies embedded in article
Meta Description
Meta tags