Scraping data via scrapy from table yields nothing - python

I am having issues extracting data from the table below.
https://tirewheelguide.com/sizes/perodua/myvi/2019/
I want to extract the sizes in this example & it would be the 175/65 SR14
<a style="text-decoration: underline;" href="https://tirewheelguide.com/tires/s/175-65-14/">175/65 SR14 </a>
Using scrapy shell function
response.xpath('/html/body/div[2]/table[1]/tbody/tr[1]/td[1]/a[1]/text()').get()
yields nothing.
Do you know what I am doing wrong?

There is a problem with your XPath
instead this:
response.xpath('/html/body/div[2]/table[1]/tbody/tr[1]/td[1]/a[1]/text()').get()
use this:
response.xpath('//table[1]//td//a/text()').get()
Some website doesn't create tables in proper so in my XPath I pass html/body/div also there was a problem with tr. The website creates multiple tr in the same row and it causes a problem. If you use the XPath I posted, it will work fine.

Related

Is it possible to download the 'inspect element' data from a website?

I have been trying to access the inspect element data from a certain website (The regular source code won't work for this). At first I tried rendering the javascript for the site. I've tried using selenium, pyppeteer, webbot, phantomjs, and request_html + beautifulsoup. All of these did not work. Would it be possible to simply copy-paste this data using python?
The data I need is from https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6 and looks like this:
<nav class="feature-list">
<span style="" id="ember683" class="flex-horizontal feature-list-item ember-view">
(all span's in this certain nav)

Scraping with Python - XPath issue

I'm currently in the process of researching scraping and I've been following a tutorial on Youtube. The tutorial is using 'Scrapy' and I've managed to scrape data from the website previewed in the tutorial. However, now I've tried scraping another website with no success.
From my understanding, the problem is from the Xpath that I'm using. I've tried several Xpath testing/generator websites with no success.
This is the following XML code:
<div class="price" currentmouseover="94">
<del currentmouseover="96">
<span class="woocommerce-Price-amount amount" currentmouseover="90"><span class="woocommerce-Price-currencySymbol">€</span>3.60</span>
</del>
<ins><span class="woocommerce-Price-amount amount" currentmouseover="123"><span class="woocommerce-Price-currencySymbol" currentmouseover="92">€</span>3.09</span></ins></div>
I'm currently using the following code:
def parse(self,response):
for title in response.xpath("//div[#class='Price']"):
yield {
'title_text': title.xpath(".//span[#class='woocommerce-Price-amount amount']/text()").extract_first()
}
I've also tried using //span[#class='woocommerce-Price-amount amount'].
I want my output to be '3.09', instead, I'm getting null when I export it to a JSON file. Can someone point me in the right direction?
Thanks in advance.
Update 1:
I've managed to fix the problem with Jack Fleeting's answer. Since I've had problems understanding Xpath I've been trying different websites in order to get a further understanding of how Xpath works. Unfortunately, I'm stuck in another example.
<div class="add-product"><strong><small>€3.11</small> €3.09</strong></div>
I'm using the following snippet:
l.add_xpath('price', ".//div[#class='add-product']/strong[1]")
My expectation is to output the 3.09, however, I'm outputting both numbers. I've tried using a minimum function, but Xpath 1.0 does not support it. ie: since I wanted to output the actual (discounted) value of the item
Try this xpath expression, and see if it works:
//div[#class='price']/ins/span
Note that price is lower case, as in you html.

Xpath is correct but Scrapy doesn't work

I'm trying to download two fields from a webpage, I identify the XPath expressions for each one and then run the spider, but nothing is downloaded.
The webpage:
http://www.morningstar.es/es/funds/snapshot/snapshot.aspx?id=F0GBR04MZH
The field I want to itemize is ISIN.
The spider runs without errors, but the output is empty.
Here is the line code:
item['ISIN'] = response.xpath('//*[#id="overviewQuickstatsDiv"]/table/tbody/tr[5]/td[3]/text()').extract()
Try to remove tbody from XPath:
'//*[#id="overviewQuickstatsDiv"]/table//tr[5]/td[3]/text()'
Note that this tag is added by your browser while page rendering and it's absent in page source
P.S. I suggest you to use IMHO even better XPath:
'//td[.="ISIN"]/following-sibling::td[contains(#class, "text")]/text()'
I think response.selector was not given. Try this.
response.selector.xpath('//*[#id="overviewQuickstatsDiv"]/table/tbody/tr[5]/td[3]/text()').extract()

can not find table content (hidden table) when scrapy on a web

I am trying to scrape the following url (http://cmegroup.com/clearing/operations-and-deliveries/accepted-trade-types/block-data.html/#contractTypes=FUT&exchanges=XNYM&assetClassId=0), the table content is what I'm interested, however looks like the table is hidden at somewhere:
Right click the inspection on the table, I can get ==$0 (following by )
But at scrapy shell, if I do response.xpath('//*[#table]'), it returns nothing which means I can't scrape the content by this way....
Please help on this issue, thanks.
UPDATE: The final solution is by using Selenium (great tool) for this scrapy task, and selenium is especially useful when the web page content such as tables and etc. is java encrypted, there are tons of selenium instruction to be found in the community, here is one example.
The reason the table is empty is that you are trying to scrapy the wrong url that contains data of table, the correct is:
http://www.cmegroup.com/CmeWS/mvc/xsltTransformer.do?xlstDoc=/XSLT/md/blocks-records.xsl&url=/da/BlockTradeQuotes/V1/Block/BlockTrades?exchange=XCBT,XCME,XCEC,DUMX,XNYM&foi=FUT,OPT,SPD&assetClassId=0&tradeDate=05172018&sortCol=time&sortBy=desc
The "05172018" text on url above looks like a date filter with this format: MMDDYYYY.

Parsing HTML with XPath, Python and Scrapy

I am writing a Scrapy program to extract the data.
This is the url, and I want to scrape 20111028013117 (code) information. I have taken XPath from FireFox add-on XPather. This is the path:
/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]
While I am trying to execute this
try:
temp_list = hxs.select("/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]").extract()
print "temp_list:" + str(temp_list)
except:
print "error"
It returns an empty list, I am struggling to find out an answer for this from the last 4 hours. I am a newbie to scrapy eventhough I handled issues very well for other projects, but it seems to be a bit difficult.
The reason of why your xpath doesn't work is becuase of tbody. You have to remove it and check if you get that result that you want.
You can read this in scrapy documentation: http://doc.scrapy.org/en/0.14/topics/firefox.html
Firefox, in particular, is known for adding <tbody> elements to
tables. Scrapy, on the other hand, does not modify the original page
HTML, so you won’t be able to extract any data if you use <tbody> in
your XPath expressions.
I see that the element you are hunting for is inside a <table>.
Firefox adds tbody tag for every table, even if it does not exists in source HTML code.
That's might be the reason, that your xpath query works in the browser, but fails in Scrapy.
As suggested, use other anchors in your xpath query.
You can extract data with more ease using more robust XPaths instead of taking the direct output from XPather.
For the data you are matching, this XPath would do a lot better:
//font[contains(text(),'Code')]/parent::td/following-sibling::td/font/text()
This will match the <font> tag containing "Code", then go to the td tag above it and select the next td -> font, which contains the code you are looking for.
Have you tried removing a few node tags at the end of the query, and re-running until you get a result? Do this several times until you get something, then add items back in cautiously until the query is rectified.
Also, check that your target page validates as XHTML - an invalid page would probably upset the parser.

Categories