Parsing HTML with XPath, Python and Scrapy

Parsing HTML with XPath, Python and Scrapy - python

I am writing a Scrapy program to extract the data.
This is the url, and I want to scrape 20111028013117 (code) information. I have taken XPath from FireFox add-on XPather. This is the path:
/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]
While I am trying to execute this
try:
temp_list = hxs.select("/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]").extract()
print "temp_list:" + str(temp_list)
except:
print "error"
It returns an empty list, I am struggling to find out an answer for this from the last 4 hours. I am a newbie to scrapy eventhough I handled issues very well for other projects, but it seems to be a bit difficult.

The reason of why your xpath doesn't work is becuase of tbody. You have to remove it and check if you get that result that you want.
You can read this in scrapy documentation: http://doc.scrapy.org/en/0.14/topics/firefox.html
Firefox, in particular, is known for adding <tbody> elements to
tables. Scrapy, on the other hand, does not modify the original page
HTML, so you won’t be able to extract any data if you use <tbody> in
your XPath expressions.

I see that the element you are hunting for is inside a <table>.
Firefox adds tbody tag for every table, even if it does not exists in source HTML code.
That's might be the reason, that your xpath query works in the browser, but fails in Scrapy.
As suggested, use other anchors in your xpath query.

You can extract data with more ease using more robust XPaths instead of taking the direct output from XPather.
For the data you are matching, this XPath would do a lot better:
//font[contains(text(),'Code')]/parent::td/following-sibling::td/font/text()
This will match the <font> tag containing "Code", then go to the td tag above it and select the next td -> font, which contains the code you are looking for.

Have you tried removing a few node tags at the end of the query, and re-running until you get a result? Do this several times until you get something, then add items back in cautiously until the query is rectified.
Also, check that your target page validates as XHTML - an invalid page would probably upset the parser.

Related

How can I fetch the number in a b tag through selenium-python?

I'm trying to get the number in all the <b> tags on this website. I want every single "qid" (question id) so I think I have to use qids = driver.find_elements_by_tag_name("b"), and based on other questions I've found I also need to implement a for loop and then print(qids.get_attribute("text")) but my code can't even seem to find elements with the <b> since I keep on getting the NoSuchElementException. The appearance of the website leads me to believe the content I'm looking for is within an iframe but I'm not sure if that affects the functionality of my code.
Here's a screencap of the website for reference
The html isn't of much use because the tag is its only defining trait:
<b>13570etc...</b>
Any help is much appreciated.

You could try searching by XPath:
driver.find_elements_by_xpath("//b")
Where // means "find all matching elements regardless of where they are in the document/current scope." Check out the XPath syntax here and mess around with a few different options.

xpath query on id //*[#id="page"] returns two elements

I'm trying to scrap the site ketabejam.ir
I'm using python3.4.1 and for parsing I use lxml 3.4.1
by the way I parsed it with lxml.html.fromstring method
when I load the document on my interpreter and ask for following query to get number of pages , so I can handle pagination:
s = doc.xpath("//*[#id='page']")
surprisingly I get the result:
>>>len(s) == 2
True
I got the address of the element from firebug's minimal xpath,
when I choose normal xpath , the query run smoothly
Is it a bug, or I'm doing something wrong??

You can work around this in general by always doing something like:
s = doc.xpath("(//*[#id='page'])[1]")
...if you know you really just want the first node that matches, and can safely ignore any subsequent ones (which seems like a safe bet in this case).

Looking at the page source for the page you linked, there are exactly two elements with that id in the page. Most probably the one of the top of the table, and the other one of the bottom of the table.
The copy minimal xpath version of firebug works based on the id of the element. It is only available for elements that have an id tag and it creates an xpath in the format -
//*[#id="elementID"]
Which is what you are getting.
Ideally, in every html page , there should only be one element with a particular id , that is id should be unique across the page. And seem like firebug's minimal xpath depends on that.
In your context, I think both elements return the same link, so you can use either to continue your scraping. Or as you indicated , you can use the normal xpath for that.

Xpath wildcard in Selenium to capture multiple instances of results

I'm trying gather some data from a site using Python, Selenium and Xpath. There are multiple datapoints I want and they are all in this structure:
/tr[1]/td
/tr[2]/td
/tr[3]/td
/tr[4]/td
I do not know how many <tr>'s there are so I am trying to search in a way that just gives me all results (hopefully in a list). How do I do that?
Here is my actual code but this is only giving me individual results. I'm new to web scraping and unsure if the issue is with my Xpath (not doing wildcards correctly or if its related to my get_attribute tag - if its getting innerhtml then is it only getting it for the single entry?)
data = driver.find_element_by_xpath('//*[#id="a-stockFinancials_tabs"]/div[2]/div[1]/table/tbody/tr[5]/td').get_attribute("innerHTML")
print data

You should give find_elements_by_xpath a try.
I think, without seeing your full HTML, that this would work:
data = driver.find_elements_by_xpath('//*[#id="a-stockFinancials_tabs"]/div[2]/div[1]/table/tbody/tr/td')
for element in data:
print element.get_attribute("innerHTML")

xpath doesn't work in this website

I am scraping individual listing pages from justproperty.com (individual listing from the original question no longer active).
I want to get the value of the Ref
this is my xpath:
>>> sel.xpath('normalize-space(.//div[#class="info_div"]/table/tbody/tr/td[norma
lize-space(text())="Ref:"]/following-sibling::td[1]/text())').extract()[0]
This has no results in scrapy, despite working in my browser.

The following works perfectly in lxml.html (with modern Scrapy uses):
sel.xpath('.//div[#class="info_div"]//td[text()="Ref:"]/following-sibling::td[1]/text()')
Note that I'm using // to get between the div and the td, not laying out the explicit path. I'd have to take a closer look at the document to grok why, but the path given in that area was incorrect.

Don't create XPath expression by looking at Firebug or Chrome Dev Tools, they're changing the markup. Remove the /tbody axis step and you'll receive exactly what you're look for.
normalize-space(.//div[#class="info_div"]/table/tr/td[
normalize-space(text())="Ref:"
]/following-sibling::td[1]/text())
Read Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing? for more details.

Another XPath that gets the same thing: (.//td[#class='titles']/../td[2])[1]
I tried your XPath using XPath Checker and it works fine.

Python lxml XPath with deep nesting with specific search

The xpath for text I wish to extract is reliably located deep in the tree at
...table/tbody/tr[4]/td[2]
Specifically, td[2] is structured like so
<td class="val">xyz</td>
I am trying to extract the text "xyz", but a broad search returns multiple results. For example the following path returns 10 elements.
xpath('//td[#class="val"]')
... while a specific search doesn't return any elements. I am unsure why the following returns nothing.
xpath('//tbody/tr/td[#class="val"]')
One solution involves..
table = root.xpath('//table[#class="123"]')
#going down the tree
xyz = table[0][3][1]
print vol.text
However, I am pretty sure this extremely brittle. I would appreciate it if someone could tell me how to construct an xpath search that would be both un-brittle and relatively cheap on resources

You haven't mentioned it explicitly, but if your target table and td tag classes are reliable then you could do something like:
//table[#class="123"]/descendant::td[#class="val"]
And you half dodge the issue of tbody being there or not.
However, there's no substitute for actually seeing the material you are trying to parse for recommending XPATH queries...

...table/tbody/tr[4]/td[2]
I guess you found this XPath via a tool like Firebug. One thing to note about tools like Firebug (or other inspect tools within browsers) is that they use the DOM tree generated by the browser itself and most (if not all) HTML parsers in browsers would try hard to make the passed HTML valid. This often requires adding various tags the standard dictates.
<tbody> is one of these tags. <tr> tags are only allowed as a child of <thead>, <tbody> or <tfoot> tags. Unfortunately, in my experience, you will rarely see one of these tags inside a <table> in the actual source, but a browser would add these necessary tags while parsing to make HTML valid since standard requires to do so.
To cut this story short, there is probably no <tbody> tag in your actual source. That is why your XPath returns nothing.
As for generating XPath queries, this highly depends on the particular page/xml. In general, positional queries such as td[4] should be the last resort since they tend to break easily when something is added before them. You should inspect the markup carefully and try to come up queries that use attributes like id or class since they add specificity more reliably than the positional ones. But in the end, it all boils down to the specifics of the page in question.

This seems to be working
from lxml import etree
doc = etree.HTML('<html><body><table><tbody><tr><td>bad</td><td class="val">xyz</td></tr></tbody></table></body></html>')
print doc.xpath('//tbody/tr/td[#class="val"]')[0].text
output:
xyz
So what is your problem?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing HTML with XPath, Python and Scrapy - python

I see that the element you are hunting for is inside a <table>. Firefox adds tbody tag for every table, even if it does not exists in source HTML code. That's might be the reason, that your xpath query works in the browser, but fails in Scrapy. As suggested, use other anchors in your xpath query.

Related

How can I fetch the number in a b tag through selenium-python?

xpath query on id //*[#id="page"] returns two elements

Xpath wildcard in Selenium to capture multiple instances of results

xpath doesn't work in this website

Python lxml XPath with deep nesting with specific search

Categories

Resources