How to prevent scrapy from crawling a website endless, when only the url particularly the session id or something like that is altered and the content behind the urls is the same.
Is there a way to detect that?
I've read this Avoid Duplicate URL Crawling, Scrapy - how to identify already scraped urls and that how to filter duplicate requests based on url in scrapy, but for solving my problem this is sadly not enough.
There are a couple of ways to do this, both related to the questions you've linked to.
With one, you decide what URL parameters make a page unique, and tell your custom duplicate request filter to ignore the other portions of the URL. This is similar to the answer at https://stackoverflow.com/a/13605919 .
Example:
url: http://www.example.org/path/getArticle.do?art=42&sessionId=99&referrerArticle=88
important bits: protocol, host, path, query parameter "art"
implementation:
def url_fingerprint(self, url):
pr = urlparse.urlparse(url)
queryparts = pr.query.split('&')
for prt in queryparts:
if prt.split("=")[0] != 'art':
queryparts.remove(prt)
return urlparse.urlunparse(ParseResult(scheme=pr.scheme, netloc=pr.netloc, path=pr.path, params=pr.params, query='&'.join(queryparts), fragment=pr.fragment))
The other way is to determine what bit of information on the page make it unique, and use either the IgnoreVisitedItems middleware (as per https://stackoverflow.com/a/4201553) or a dictionary/set in your spider's code. If you go the dictionary/set route, you'll have your spider extract that bit of information from the page and check the dictionary/set to see if you've seen that page before; if so, you can stop parsing and return.
What bit of information you'll need to extract depends on your target site. It could be the title of the article, an OpenGraph <og:url> tag, etc.
Related
Using Python and Scrapy, I am attempting to search a specific domain URL to find the page(s) which match an inputted page title. The domain's page slug/naming structure is fairly straightforward:
domain.com/[some combination of letters and numbers (I have seen both 8, 10 and 12 characters)]
Additional Details:
There does not appear to be any particular pattern to the sequence.
The pages are unique and typically only have external links embedded, but not internal links which can be crawled to help find the pages I am looking for on this domain.
There is not a visible sitemap.xml (domain.com/sitemap.xml) which I can use.
What I have tried:
Using Scrapy's general and CrawlSpider, I have passed in many start_urls using permutations of the possible page sequences, then examined the page title and ran an if/else to identify if the page title matched my target title. If yes, add to list, if no, pass. That seems to work conceptually, but there are too many permutations to ever get the full list of URLs I am seeking.
The question:
Is there a way for me to use only the domain's URL to return pages with titles matching the inputted criteria, without having to visit each one of those pages? Or, is there an alternative approach I should be taking to most effectively browse/crawl this URL for the page URLs I am seeking?
Thanks for your help.
I'm using scrapy (on PyCharm v2020.1.3) to build a spider that crawls this webpage: "https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas", i want to extract the products names, and the breadcrumb in a list format, and save the results in a csv file.
I tried the following code but it returns empty brackets [] , after i've inspected the html code i discovred that the content is hidden in angularjs format.
If someone has a solution for that it would be great
Thank you
import scrapy
class ProductsSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas']
def parse(self, response):
product = response.css('a.shelfProductTile-descriptionLink::text').extract()
yield "productnames"
You won't be able to get the desired products through parsing the HTML. It is heavily javascript orientated and therefore scrapy wont parse this.
The simplest way to get the product names, I'm not sure what you mean by breadcrumbs is to re-engineer the HTTP requests. The woolworths website generates the product details via an API. If we can mimick the request the browser makes to obtain that product information we can get the information in a nice neat format.
First you have to set within settings.py ROBOTSTXT_OBEY = False. Becareful about protracted scrapes of this data because your IP will probably get banned at some point.
Code Example
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['woolworths.com']
data = {
'excludeUnavailable': 'true',
'source': 'RR-Best Sellers'}
def start_requests(self):
url = 'https://www.woolworths.com.au/apis/ui/products/58520,341057,305224,70660,208073,69391,69418,65416,305227,305084,305223,427068,201688,427069,341058,305195,201689,317793,714860,57624'
yield scrapy.Request(url=url,meta=self.data,callback=self.parse)
def parse(self, response):
data = response.json()
for a in data:
yield {
'name': a['Name'],
}
Explanation
We start of with our defined url in start_requests. This URL is the specific URL of the API woolworth uses to obtain information for iced tea. For any other link on woolworths the part of the URL after /products/ will be specific to that part of the website.
The reason why we're using this, is because using browser activity is slow and prone to being brittle. This is fast and the information we can get is usually highly structured much better for scraping.
So how do we get the URL you may be asking ? You need to inspect the page, and find the correct request. If you click on network tools and then reload the website. You'll see a bunch of requests. Usually the largest sized request is the one with all the data. Clicking that and clicking preview gives you a box on the right hand side. This gives all the details of the products.
In this next image, you can see a preview of the product data
We can then get the request URL and anything else from this request.
I will often copy this request as a CURL (Bash Command) as seen here
And enter it into curl.trillworks.com. This can convert CURL to python. Giving you a nice formatted headers and any other data needed to mimick the request.
Now putting this into jupyter and playing about, you actually only need the params NOT the headers which is much better.
So back to the code. We do a request, using meta argument we can pass on the data, remember because it's outside the function we have to use self.data and then specifying the callback to parse.
We can use the response.json() method to convert the JSON object to a set of python dictionaries corresponding to each product. YOU MUST have scrapy V2.2 to use this method. Other you could use data = json.loads(response.text), but you'll have put to import json at the top of the script.
From the preview and playing about with the json in requests we can see these python dictionaries are actually within a list and so we can use a for loop to loop round each product, which is what we are doing here.
We then yield a dictionary to extract the data, a refers to each products which is it's own dictionary and a['Name'] refers to that specific python dictionary key 'Name' and giving us the correct value. To get a better feel for this, I always use requests package in jupyter to figure out the correct way to get the data I want.
The only thing left to do is to use scrapy crawl test -o products.csv to output this to a CSV file.
I can't really help you more than this until you specify any other data you want from this page. Please remember that you're going against what the site wants you to scrape, but also any other pages on that website you will need to find out the specific URL to the API to get those products. I have given you the way to do this, I suggest if you want to automate this it would be worth your while trying to struggle with this. We are hear to help but an attempt on your part is how you're going to learn coding.
Additional Information on the Approach of Dynamic Content
There is a wealth of information on this topic. Here are some guidelines to think about when looking at javascript orientated websites. The default is you should try re-engineer the requests the browser makes to load the pages information. This is what the javascript in this site and many other sites is doing, it's providing a dynamic way without reloading the page to display new information by making an HTTP request. If we can mimic that request, we can get the information we desire. This is the most efficent way to get dynamic content.
In order of preference
Re-engineering the HTTP requests
Scrapy-splash
Scrapy_selenium
importing selenium package into your scripts
Scrapy-splash is slightly better than the selenium package, as it pre-renders the page, giving you access to the selectors with the data. Selenium is slow, prone to errors but will allow you to mimic browser activity.
There are multiple ways to include selenium into your scripts see down below as an overview.
Recommended Reading/Research
Look at the scrapy documentation with regard to dynamic content here
This will give you an overview of the steps to handling dynamic content. I will say generally speaking selenium should be thought of as a last resort. It's pretty inefficient when doing larger scale scraping.
If you are consider adding in the selenium package into your script. This might be the lower barrier of entry to getting your script working but not necessarily that efficient. At the end of the day scrapy is a framework but there is a lot of flexibility in adding in 3rd party packages. The spider scripts are just a python class importing the scrapy architecture in the background. As long as you're mindful of the response and translating some of the selenium to work with scrapy, you should be able to input selenium into your scripts. I would this solution is probably the least efficient though.
Consider using scrapy-splash, splash pre-renders the page and allows for you to add in javascript execution. Docs are here and a good article from scrapinghub here
Scrapy-selenium is a package with a custom scrapy downloader middleware that allows you to do selenium actions and execute javascript. Docs here You'll need to have a play around to get the login in procedure from this, it doesn't have the same level of detail as the selenium package itself.
I am new to Python and web crawling. I intend to scrape links in the top stories of a website. I was told to look at to its Ajax requests and send similar ones. The problem is that all requests for the links are same: http://www.marketwatch.com/newsviewer/mktwheadlines
My question would be how to extract links from an infinite scrolling box like this. I am using beautiful soup, but I think it's not suitable for this task. I am also not familiar with Selenium and java scripts. I know how to scrape certain requests by Scrapy though.
It is indeed an AJAX request. If you take a look at the network tab in your browser inspector:
You can see that it's making a POST request to download the urls to the articles.
Every value is self explanatory in here except maybe for docid and timestamp. docid seems to indicate which box to pull articles for(there are multiple boxes on the page) and it seems to be the id attached to <li> element under which the article urls are stored.
Fortunately in this case POST and GET are interchangable. Also timestamp paremeter doesn't seem to be required. So in all you can actually view the results in your browser, by right clicking the url in the inspector and selecting "copy location with parameters":
http://www.marketwatch.com/newsviewer/mktwheadlines?blogs=true&commentary=true&docId=1275261016&premium=true&pullCount=100&pulse=true&rtheadlines=true&topic=All%20Topics&topstories=true&video=true
This example has timestamp parameter removed as well as increased pullCount to 100, so simply request it, it will return 100 of article urls.
You can mess around more to reverse engineer how the website does it and what the use of every keyword, but this is a good start.
The problem
I'm using the Wikipedia API to get page HTML which I parse. I use queries like this one to get the HTML for the first section of a page.
The MediaWiki API provides a handy parameter, redirects, which will cause the API to automatically follow pages that redirect other pages. For example, if I search for 'Cats' with https://en.wikipedia.org/w/api.php?page=Cats&redirects, I will be shown the results for Cat because Cats redirects to Cat.
I'd like a similar function for disambiguation pages such as this, by which if I arrive at a disambiguation page, I am automatically redirected to the first link. For example, if I make a request to a page like Mercury, I'd automatically be redirected to Mercury (element), as it is the first link listed in the page.
The Python HTML parser BeautifulSoup is fairly slow on large documents. By only requesting the first section of articles (that's all I need for my use), using section=0, I can parse it quickly. This is perfect for most articles. But for disambiguation pages, the first section does not include any of the links to specific pages, making it a poor solution. But if I request more than the first section, the HTML loading slows down, which is unnecessary for most articles. See this query for an example of a disambiguation page in which links are not included in the first section.
What I have so far
As of right now, I've gotten as far as detecting when a disambiguation page is reached. I use code like
bs4.BeautifulSoup(page_html).find("p", recursive=false).get_text().endswith(("refer to:", "refers to:"))
I also spent a while trying to write code that automatically followed a link, before I realized that the links were not included in
My constraints
I'd prefer to keep the number of requests made to a minimum. I also need to be parsing as little HTML as possible, because speed is essential for my application.
Possible solutions (which I need help executing)
I could envision several solutions to this problem:
A way within the MediaWiki API to automatically follow the first link from disambiguation pages
A method within the Mediawiki API that allows it to return different amounts of HTML content based on a condition (like presence of a disambiguation template)
A way to dramatically improve the speed of bs4 so that it doesn't matter if I end up having to parse the entire page HTML
As Tgr and everybody said, no, such a feature doesn't exist because it doesn't make sense. The first link in a disambiguation page doesn't have any special status or meaning.
As for the existing API, see https://www.mediawiki.org/wiki/Extension:Disambiguator#API_usage
By the way, the "bot policy" you linked does not really apply to crawlers/scraper; the only relevant policy/guideline is the User-Agent policy.
I'm trying to scrape a site that uses a ton of relative URLs. One archive page has links to many individual entries, but the URL is given like "../2011/category/example.html"
For each entry, I want to open the page and scrape it, but I'm not sure what the most efficient way to handle that is. I'm thinking of splitting the starting URL by "/", pop off the last item and re-joining them, to get the base URL.
That seems like such a cludge, though. Is there a cleaner way?
To construct an absolute URL from a relative URL, use urlparse.urljoin (docs here).
If you are using a browsing system like mechanize for crawling, however, you can simply fetch an absolute url initially and then feed the browser relative urls after that. The browser will keep track of state and fetch the URL from the same domain as the previous request automatically.