I am using scrapy to crawl this page
class QuotesSpider(scrapy.Spider):
name = "tesco"
start_urls = [
'https://www.tesco.com/direct/tv-offer.event?icid=offers_trade_slot1',
]
def parse(self, response):
for quote in response.xpath('//li[contains(#class,"product-tile")]'):
learningscrapyItem = crawlerItem()
learningscrapyItem['title'] = quote.xpath('.//h3/a/text()').extract_first()
price = quote.xpath('.//div[#class="buy-box-container"]/p[2]/text()').extract_first()
learningscrapyItem['price'] = price.strip()
yield (learningscrapyItem)
I am having issues with the price xpath which is only pulling some prices:
//div[#class="buy-box-container"]/p[2]/text()
By removing text() I think I can see the reason why, the ones that are pulling the price though are setup like the below:
<p class="price">
£189.00
</p>
The ones that are not are structured like:
<p class="price">
<span class="from">From</span>
£549.00
</p>
So the strip() appears to be removing these. Is there a way with Xpath that I can get the text from within the paragraph tag and not from/or the span within it?
Thanks.
The problem is that /text() would only match the direct text child node and, you understood correctly, that the second example breaks the selector.
I would just get all the "text" nodes from inside the "price" element and grab the amount with .re_first():
price = quote.xpath('.//div[#class="buy-box-container"]/p[2]//text()').re_first(r"\d+\.\d+")
Or, even simpler with a CSS selector instead of the XPath:
price = quote.css('.buy-box-container .price').re_first(r"\d+\.\d+")
Try the below way to get the prices you wish to have.
Instead of using this:
quote.xpath('.//div[#class="buy-box-container"]/p[2]/text()').extract_first()
Try using this:
quote.xpath('.//div[#class="buy-box-container"]//p[#class="price"]/text()').extract()[-1]
Related
I am using the following code to get values from a site
import scrapy
class scraping(scrapy.Spider):
name = 'NewsSpider'
start_urls = ['https://www.uol.com.br/']
def parse(self, response):
news = response.xpath('//article')
for n in news:
print({
'Link': n.xpath("//a[#class='hyperlink headlineSub__link']").get(),
'Title': n.xpath('//a/div/h3/text()').get(),
})
On "Link" I am getting a lot of information but I want to get only the link inside the href, is it possible to get only that information?
I have a sample of doing this very same thing. You should use something like this selector:
.css('a[href*=topic]::attr(href)')
a tag in my case was something like <a ... href="topic/1321343">something</a>.
The key is a::attr(href)
parse your response and make it as small as you can and get your wanted href value.
This is my solution on a project for scraping Microsoft Academia articles. The linked line gets items in "Related Topics" section.
Here is some other example:
<span class="title">
</span>
pars by:
Link = Link1.css('span.title a::attr(href)').extract()[0]
I have such HTML code:
<li class="IDENTIFIER"><h5 class="hidden">IDENTIFIER</h5><p>
<span class="tooltip-iws" data-toggle="popover" data-content="SOME TEXT">
other text</span></p></li>
And I'd like to obtain the SOME TEXT from the data-content.
I wrote
target = soup.find('span', {'class' : 'tooltip-iws'})['data-content']
to get the span, and I wrote
identifier_elt= soup.find("li", {'class': 'IDENTIFIER'})
to get the class, but I'm not sure how to combine the two.
But the class tooltip-iws is not unique, and I would get extraneous results if I just used that (there are other spans, before the code snippet, with the same class)
That's why I want to specify my search within the class IDENTIFIER. How can I do that in BeautifulSoup?
try using css selector,
soup.select_one("li[class='IDENTIFIER'] > p > span")['data-content']
Try using selectorlib, should solve your issue, comment if you need further assistance
https://selectorlib.com/
I am trying to scrape the data about the circulrs from my college's website using scrapy for a project but my spider is not scraping the data properly. There are a lot of blank elements and also I am unable to scrape the 'href' attributes of the circulars for some reason. I am assuming that my CSS selectors are wrong but I am unable to figure out what am I doing wrong exactly. I copied my CSS selectors using the 'Selector Gadget' Chrome extension. I ams till learning scrapy so it would be great if you could explain what I was doing wrong.
The Website I am scraping data from is : https://www.imsnsit.org/imsnsit/notifications.php
My code is :
import scrapy
from ..items import CircularItem
class CircularSpider(scrapy.Spider):
name = "circular"
start_urls = [
"https://www.imsnsit.org/imsnsit/notifications.php"
]
def parse(self, response):
items = CircularItem()
all = response.css('tr~ tr+ tr font')
for x in all:
cirName = x.css('a font::text').extract()
cirLink = x.css('.list-data-focus a').attrib['href'].extract()
date = x.css('tr~ tr+ tr td::text').extract()
items["Name"] = cirName
items["href"] = cirLink
items["Date"] = date
yield items
I modified your parse callback function. I changed CSS selectors into xpath. Also, try to learn xpath selectors they are very powerful and easy to use.
Generally, It is bad idea to copy CSS or xpath using automatic selectors, because in some cases they might give you incorrect results or just one element without general path.
First of all I select all tr. If you look carefully, some of tr are just blank used for separator. You can filter them by trying to select date, if it is None you can just skip the row. And finally you can just select cirName and cirLink.
Also, markup of the given website is not good and It is really hard to write proper selectors, elements don't have many attributes, like class or id. That's the solution I came up with, I know it is not perfect.
def parse(self, response):
items = CircularItem()
all = response.xpath('//tr') # select all table items
for x in all:
date = x.xpath('.//td/font[#size="3"]/text()').get() # filter them by date
if not date:
continue
cirName = x.xpath('.//a/font/text()').get()
cirLink = x.xpath('.//a[#title="NOTICES / CIRCULARS"]/#href').get()
items["Name"] = cirName
items["href"] = cirLink
items["Date"] = date
yield items
It is propably very trivial question but I am new to Scrapy. I've tried to find solution for my problem but I just can't see what is wrong with this code.
My goal is to scrap all of the opera shows from given website. Data for every show is inside one div with class "row-fluid row-performance ". I am trying to iterate over them to retrieve it but it doesn't work. It gives me content of the first div in each iteration(I am getting 19x times the same show, instead of different items).
import scrapy
from ..items import ShowItem
class OperaSpider(scrapy.Spider):
name = "opera"
allowed_domains = ["http://www.opera.krakow.pl"]
start_urls = [
"http://www.opera.krakow.pl/pl/repertuar/na-afiszu/listopad"
]
def parse(self, response):
divs = response.xpath('//div[#class="row-fluid row-performance "]')
for div in divs:
item= ShowItem()
item['title'] = div.xpath('//h2[#class="item-title"]/a/text()').extract()
item['time'] = div.xpath('//div[#class="item-time vertical-center"]/div[#class="vcentered"]/text()').extract()
item['date'] = div.xpath('//div[#class="item-date vertical-center"]/div[#class="vcentered"]/text()').extract()
yield item
Try to change the xpaths inside the for loop to start with .//. That is, just put a dot in front of the double backslash. You can also try using extract_first() instead of extract() and see if that gives you better results.
I'm trying to extract a list of links in a web page using python selenium. All the links on the page have the following format in the source code:
Using the following line gives me all the elements on the page with tag name a:
driver.find_elements_by_tag_name("a")
The issue is that I need only a specific set of links, and all these links are within a table. The above code gives me all the links on the page, even those outside the table. Outline of the page source looks like this:
<html>
...
...
<frame name = "frame">
<a href = "unwantedLink">
<form name = "form">
<table name = "table">
<a href = "link1">
<a href = "link2">
<a href = "link3">
</table>
</form>
</frame>
...
</html>
I need link1,link2 and link3, but not unwantedLink. Both the required links and the unwanted link are in the same frame, so switching frames won't work. Is there a way to look for tag names a within the table but not within the parent frame?
Thanks
This should give you want you want:
driver.find_elements_by_css_selector("table[name='table'] a")
The table[name='table'] bit selects only the table with the attribute name set to "table". And then the selector gets all a elements that are descendants of the table. So it does not matter whether the a elements are children of the table element or if they appear in side td elements.
Note that if you have more than one table that has a name attribute set to the value "table", you'll get more elements than you are actually looking for. (There are no guarantees of uniqueness for the name attribute.)