I have the following HTML structure
I want to extract all the links with the class:dev-link
<a class="dev-link" href="mailto:info#jourist.com" rel="nofollow" title='Photoshoot"</a>
I am using the below code to extract the link in scrapy
response.css('.dev-link::attr(href)').extract()
I am getting the correct output but is this the right way to use css selectors??
As you can see in Scrapy Documentation there are two methods to scrap data, CSS Selector and XPath Selector both are works correctly but XPath needs some practice to get expert, in my opinion, Xpath is more power in special cases you can scrap data easier that CSS selector ( but of course you can get them with CSS selector too),
what you did is correct
link = response.css('.dev-link::attr(href)').extract_first()
and also you can get it with the following too
link = response.xpath('/[contains(#class,’dev-link’)]/#href').extract_first()
Related
I need to scrape https://libraries.io/search?order=desc&platforms=Maven&sort=rank and extract the links to the webpages within the site. When I run the code below I get way too many links from classes I don't need. (I just need the "project" class). How do I pass an argument to just get the links I need?
for link in soup.findAll("a"):
print(link.get('href'))
Try using css selectors to focus on what you need. Something like
for link in soup.select('div.project a[href]'):
print(link['href']
Output:
/maven/junit:junit
/maven/org.springframework:spring-context
/maven/org.springframework:spring-test
etc.
I'm trying to scrap a website using selenium.
I tried using XPATH, but the problem is that the rows on the website change over time...
How can I scrap the website, so that it gives me the output '21,73' ?
<div class="weather_yesterday_mean">21,73</div>
You can just use querySelector that accepts CSS selectors. I personally like them way more than XPath:
elem = driver.find_element_by_css_selector('div.weather_yesterday_mean')
result = elem.text
If that suits you, please read a bit about CSS selectors, for example here: https://www.w3schools.com/cssref/css_selectors.asp
I'm working on a scraper project and one of the goals is to get every image link from HTML & CSS of a website. I was using BeautifulSoup & TinyCSS to do that but now I'd like to switch everything on Selenium as I can load the JS.
I can't find in the doc a way to target some CSS parameters without having to know the tag/id/class. I can get the images from the HTML easily but I need to target every "background-image" parameter from the CSS in order to get the URL from it.
ex: background-image: url("paper.gif");
Is there a way to do it or should I loop into each element and check the corresponding CSS (which would be time-consuming)?
You can grab all the Style tags and parse them, searching what you look.
Also you can download the css file, using the resource URL and parse them.
Also you can create a XPATH/CSS rule for searching nodes that contain the parameter that you're looking for.
The reviews are in the selector with multiple classes "row _3wYu6I _3BRC7L".
But when scraping, the response does not have the above selector but instead has "row _3wYu6I _1KVtzT" selector.And this selector has empty list.Actually all the classes with values "_3BRC7L" in flipkart page Flipkart page are converted into classes with values "_1KVtzT" in response that I get through scraping. The list of elements that I get when using xpath of the parent class. How should I resolve this issue?
The flipkart page generates dynamic content through ajax requests.That is the reason I could not get the correct class selectors. Now, I changed my code as per the instructions of the following answer: To retrieve data through ajax requests.
It is very helpful and simple for me to do as I am new to scraping and I need not use scrapy or casperjs.
Using xpath you can get the div with a certain class that contains a paragraph that has an id that contains the value review.
This selector is a very good start, you can build from here any selector for the review.
//div[.//p[contains(#id, 'review')]][#class='col']
Conceptually simple question/idea.
Using Scrapy, how to I use use LinkExtractor that extracts on only follows links with a given CSS?
Seems trivial and like it should already be built in, but I don't see it? Is it?
It looks like I can use an XPath, but I'd prefer using CSS selectors. It seems like they are not supported?
Do I have to write a custom LinkExtractor to use CSS selectors?
From what I understand, you want something similar to restrict_xpaths, but provide a CSS selector instead of an XPath expression.
This is actually a built-in feature in Scrapy 1.0 (currently in a release candidate state), the argument is called restrict_css:
restrict_css
a CSS selector (or list of selectors) which defines regions inside the
response where links should be extracted from. Has the same behaviour
as restrict_xpaths.
The initial feature request:
CSS support in link extractors