Conceptually simple question/idea.
Using Scrapy, how to I use use LinkExtractor that extracts on only follows links with a given CSS?
Seems trivial and like it should already be built in, but I don't see it? Is it?
It looks like I can use an XPath, but I'd prefer using CSS selectors. It seems like they are not supported?
Do I have to write a custom LinkExtractor to use CSS selectors?
From what I understand, you want something similar to restrict_xpaths, but provide a CSS selector instead of an XPath expression.
This is actually a built-in feature in Scrapy 1.0 (currently in a release candidate state), the argument is called restrict_css:
restrict_css
a CSS selector (or list of selectors) which defines regions inside the
response where links should be extracted from. Has the same behaviour
as restrict_xpaths.
The initial feature request:
CSS support in link extractors
Related
I'm working on a scraper project and one of the goals is to get every image link from HTML & CSS of a website. I was using BeautifulSoup & TinyCSS to do that but now I'd like to switch everything on Selenium as I can load the JS.
I can't find in the doc a way to target some CSS parameters without having to know the tag/id/class. I can get the images from the HTML easily but I need to target every "background-image" parameter from the CSS in order to get the URL from it.
ex: background-image: url("paper.gif");
Is there a way to do it or should I loop into each element and check the corresponding CSS (which would be time-consuming)?
You can grab all the Style tags and parse them, searching what you look.
Also you can download the css file, using the resource URL and parse them.
Also you can create a XPATH/CSS rule for searching nodes that contain the parameter that you're looking for.
I am trying to create "universal" Xpath, so when I run spider, it will be able to download the hotel name for each hotel on the list.
This is the XPath that I need to convert:
//*[#id="offerPage"]/div[3]/div[1]/div[1]/div/div/div/div/div[2]/div/div[1]/h3/a
Can anyone point me the right direction?
This is the example how they did it in the scrapy docs:
https://github.com/scrapy/quotesbot/blob/master/quotesbot/spiders/toscrape-xpath.py
for text: they have :
'text': quote.xpath('./span[#class="text"]/text()').extract_first(),
When you open "http://quotes.toscrape.com/" and copy Xpath for text you will get :
/html/body/div/div[2]/div[1]/div[1]/span[1]
When you look at the html that your are scraping just using "copy xpath" from the browser source viewer is not enough.
You need to look at the attributes that the html tags have.
Of course, using just tag types as an xpath can work, but what if not every page you are going to scrape follows that pattern?
The Scrapy example you are using uses the span's class attribute to precisely point to the target tag.
I suggest reading a bit more about Xpath (for example here) to understand how flexible your search patterns can be.
If you want to go even broader, reading about DOM structure will also be useful. Let us know if you need more pointers.
I have the following HTML structure
I want to extract all the links with the class:dev-link
<a class="dev-link" href="mailto:info#jourist.com" rel="nofollow" title='Photoshoot"</a>
I am using the below code to extract the link in scrapy
response.css('.dev-link::attr(href)').extract()
I am getting the correct output but is this the right way to use css selectors??
As you can see in Scrapy Documentation there are two methods to scrap data, CSS Selector and XPath Selector both are works correctly but XPath needs some practice to get expert, in my opinion, Xpath is more power in special cases you can scrap data easier that CSS selector ( but of course you can get them with CSS selector too),
what you did is correct
link = response.css('.dev-link::attr(href)').extract_first()
and also you can get it with the following too
link = response.xpath('/[contains(#class,’dev-link’)]/#href').extract_first()
I have created a spider which is supposed to crawl multiple websites and I need to define different rules for each URL in the start_url list.
start_urls = [
"http://URL1.com/foo"
"http://URL2.com/bar"
]
rules = [
Rule (LinkExtractor(restrict_xpaths=("//" + xpathString+"/a")), callback="parse_object", follow=True)
]
The only thing that needs to change in the rule is the xpath string for restrict_xpath. I've already come up with a function that can get the xpath I want dynamically from any website.
I figured I can just get the current URL that the spider will be scraping and pass it through the function and then pass the resulting xpath to the rule.
Unfortunately, I've been searching and it seems that this isn't possible since scrapy utilizes a scheduler and compiles all the start_urls and rules right from the start. Is there any workaround to achieve what I'm trying to do?
I assume you are using CrawlSpider.
By default, CrawlSpider rules are applied for all pages (whatever the domain) your spider is crawling.
If you are crawling multiple domains in start URLs, and want different rules for each domains, you wont be able to tell scrapy which rule(s) to apply to which domain. (I mean, it's not available out of the box)
You can run your spider with 1 start URL at a time (and domain-specific rules, built dynamically at init time). And run multiple spiders in paralel.
Another option is to subclass CrawlSpider and customize it for your needs:
build rules as a dict using domains as keys,
and values being the list of rules to apply for that domain. See _compile_rules method.
and apply different rules depending on the domain of the response. See _requests_to_follow
You can just override the parse method. This method will get a scrapy response object with full html content. You can run xpath on it. You will can also retrieve the url from the response object and depending on the url, you can run custom xpath.
Please checkout the docs here: http://doc.scrapy.org/en/latest/topics/request-response.html
I've write a simple web parser to parse a website using beautiful soap. But i want to know wether a class css attribute "postion" is fiexd or absolute.
All the css attribute are css links defined in html header, with class attribute in the css file like this
.slideABox , .slideBBox{
max-width:320px; position:relative; margin: 0 auto;
}
So how can i check the attribute of a css class in python, just like in javascript?
BeautifulSoup is a poor choice for what you want to accomplish, because it does not come with a rendering engine (therefore, CSS rules are not even taken in consideration when parsing the page).
Obviously you could parse the CSS rules manually using an appropriate tool (e.g. http://pythonhosted.org/tinycss/), but I won't recommend that because CSS properties can also be inherited, and you'll end up with false results, unless your HTML page is a very simple one.
Instead, I suggest to take a look to Selenium, which is essentially a browser wrapper, and has Python bindings. Its Element object has a value_of_css_property method, which will probably suit your needs (see the APIs here).