Confusion on Scrapy re-direct behavior? - python

So I am trying to scrape articles from news website that has an infinite scroll type layout so the following is what happens:
example.com has first page of articles
example.com/page/2/ has second page
example.com/page/3/ has third page
And so on. As you scroll down, the url changes. To account for that, I wanted to scrape the first x number of articles and did the following:
start_urls = ['http://example.com/']
for x in range(1,x):
new_url = 'http://www.example.com/page/' + str(x) +'/'
start_urls.append(new_url)
It seems to work fine for the first 9 pages and I get something like the following:
Redirecting (301) to <GET http://example.com/page/4/> from <GET http://www.example.com/page/4/>
Redirecting (301) to <GET http://example.com/page/5/> from <GET http://www.example.com/page/5/>
Redirecting (301) to <GET http://example.com/page/6/> from <GET http://www.example.com/page/6/>
Redirecting (301) to <GET http://example.com/page/7/> from <GET http://www.example.com/page/7/>
2017-09-08 17:36:23 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 3 pages/min), scraped 0 items (at 0 items/min)
Redirecting (301) to <GET http://example.com/page/8/> from <GET http://www.example.com/page/8/>
Redirecting (301) to <GET http://example.com/page/9/> from <GET http://www.example.com/page/9/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/10/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/11/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/12/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/13/>
Starting from page 10, it redirects to a page like example.com/ from example.com/page/10/ instead of the original link, example.com/page/10. What can be causing this behavior?
I looked into a couple options like dont_redirect, but I just don't understand what is happening. What can be the reason for this re-direction behavior? Especially since no re-direction happens when you directly type in the link for the website like example.com/page/10?
Any help would be greatly appreciated, thanks!!
[EDIT]
class spider(CrawlSpider):
start_urls = ['http://example.com/']
for x in range(startPage,endPage):
new_url = 'http://www.example.com/page/' + str(x) +'/'
start_urls.append(new_url)
custom_settings = {'DEPTH_PRIORITY': 1, 'DEPTH_LIMIT': 1}
rules = (
Rule(LinkExtractor(allow=('some regex here,')deny=('example\.com/page/.*','some other regex',),callback='parse_article'),
)
def parse_article(self, response):
#some parsing work here
yield item
Is it because I include example\.com/page/.* in the LinkExtractor? Shouldn't that only apply to links that are not the start_url however?

looks like this site uses some kind of security to only check the User-Agent in the request headers.
So you only need to add a common User-Agent in the settings.py file:
USER_AGENT = 'Mozilla/5.0'
Also, the spider doesn't necessarily need the start_urls attribute to get the starting sites, you can also use the start_requests method, so replace all the creating of start_urls with:
class spider(CrawlSpider):
...
def start_requests(self):
for x in range(1,20):
yield Request('http://www.example.com/page/' + str(x) +'/')
...

Related

Scray shell URL returns 404 for endless scroll

I am training on how to use scrapy shell in the command prompt and here's the URL
https://shopee.com.my/shop/145423/followers/?__classic__=1
For the google chrome developers (F12 pressed) and at the Network section, I have cleared everything and scoll down the website and got this link
https://shopee.com.my/shop/145423/followers/?offset=60&limit=20&offset_of_offset=0&_=1610787400133
The link is supposed to return some data but when trying
scrapy shell https://shopee.com.my/shop/145423/followers/?offset=60&limit=20&offset_of_offset=0&_=1610787400133
I got 404 as a response.
I think there's a popup that needs the user to click on the language and this is what makes the problem
How can such popup dealed with or skipped?
Use User Agent . You can also use User Agent in command line
headers={'User-Agent': 'Mybot'}
>>> r = scrapy.Request(url, headers=headers)
>>> fetch(r)
2021-01-16 16:53:11 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://shopee.com.my/shop/145423/followers/?offset=60&limit=20&offset_of_offset=0&_=1610787400133&__classic__=1> from <GET https://shopee.com.my/shop/145423/followers/?offset=60&limit=20&offset_of_offset=0&_=1610787400133>
2021-01-16 16:53:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://shopee.com.my/shop/145423/followers/?offset=60&limit=20&offset_of_offset=0&_=1610787400133&__classic__=1> (referer: None)
>>> response.status
200
>>>

Stop Scrapy crawler from external domains

I am new to scrapy and trying to run a crawler on a few websites where my allowed domain and start url looks like this
allowed_domains = ['www.siemens.com']
start_urls= ['https://www.siemens.com/']
The problem is that the website also contains links to different domains like
"siemens.fr" and "seimens.de"
and I don't want the scrapy to also scrape these websites. Any suggestion on how to tell the spider not to crawl these websites.
I am trying to build a more general spider so that it is applicable to other websites also
Update#2
As suggested by Felix Eklöf, I tried to adjust my code and change some settings. This is what the code looks like now
The spider
class webSpider(scrapy.Spider):
name = 'web'
allowed_domains = ['eaton.com']
start_urls= ['https://www.eaton.com/us/']
# include_patterns = ['']
exclude_patterns = ['.*\.(css|js|gif|jpg|jpeg|png)']
#proxies = 'proxies.txt'
response_type_whitelist = ['text/html']
# response_type_blacklist = []
rules = [Rule(LinkExtractor(allow = (allowed_domains)), callback='parse_item', follow=True)]
And the settings look like this:
SPIDER_MIDDLEWARES = {
'smartspider.middlewares.SmartspiderSpiderMiddleware': 543,
#'scrapy_testmaster.TestMasterMiddleware': 950
}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'smartspider.middlewares.SmartspiderDownloaderMiddleware': 543,
'smartspider.middlewares.FilterResponses': 543,
'smartspider.middlewares.RandomProxyForReDirectedUrls': 650,
"scrapy.spidermiddlewares.offsite.OffsiteMiddleware": 543
}
ITEM_PIPELINES = {
'smartspider.pipelines.SmartspiderPipeline': 300,
}
Please let me know if any of these settings are interfering with the spider only accessing the internal links and maintaining the given domain
Update3#
As suggested by #Felix, I updated the Spider which looks like this now
class WebSpider(CrawlSpider):
name = 'web'
allowed_domains = ['eaton.com']
start_urls= ['https://www.eaton.com/us/']
# include_patterns = ['']
exclude_patterns = ['.*\.(css|js|gif|jpg|jpeg|png)']
response_type_whitelist = ['text/html']
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]
The settings look
#SPIDER_MIDDLEWARES = {
# 'smartspider.middlewares.SmartspiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
# 'smartspider.middlewares.SmartspiderDownloaderMiddleware': 543,
'smartspider.middlewares.FilterResponses': 543,
'smartspider.middlewares.RandomProxyForReDirectedUrls': 650,
}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'smartspider.pipelines.SmartspiderPipeline': 300,
#}
But the spider is still scraping through different domains.
But the logs are showing that it is rejecting offsite websites with another website (thalia.de)
2021-01-04 19:46:42 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.rtbhouse.com': <GET https://www.rtbhouse.com/privacy-center/>
2021-01-04 19:46:42 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.quicklizard.com': <GET https://www.quicklizard.com/terms-of-service/>
2021-01-04 19:46:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.thalia.de/shop/hilfe-gutschein/show/> (referer: https://www.thalia.de/)
2021-01-04 19:46:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.thalia.de/shop/hilfe-kaufen/show/> (referer: https://www.thalia.de/)
2021-01-04 19:46:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.thalia.de/shop/home/login/login/?source=%2Fde.buch.shop%2Fshop%2F2%2Fhome%2Fkundenbewertung%2Fschreiben%3Fartikel%3D149426569&jumpId=2610518> (referer: https://www.thalia.de/shop/home/artikeldetails/ID149426569.html)
2021-01-04 19:46:43 [scrapy.extensions.logstats] INFO: Crawled 453 pages (at 223 pages/min), scraped 0 items (at 0 items/min)
2021-01-04 19:46:43 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://www.thalia.de/shop/home/show/
Is the spider working as expected or the problem is with a specific website?
Try removing "www." from allowed_domains.
Accoring to the Scrapy docs you should do it like this:
Let’s say your target url is https://www.example.com/1.html, then
add 'example.com' to the list.
So, in your case:
allowed_domains = ['siemens.com']
start_urls= ['https://www.siemens.com/']
Please have a closer look at the other, country-specific domains, such as siemens.de, siemens.dk, siemens.fr, etc.
If you run a curl call against the German site, curl --head https://www.siemens.de, you will realize a 301 status code.
The URL is redirected to https://new.siemens.com/**de**/de.html.
THe same pattern is observed for the other countries. The ISO 3166-1 alpha-2 code is embedded in the URL. If you need to filter, here is the location to tackle the problem.
I took a closer look at your code and might have found the issue.
I belive the issue is on this line:
rules = [Rule(LinkExtractor(allow = (allowed_domains)), callback='parse_item', follow=True)]
The class LinkExtractor expects the argument allow to be a str or list of str's, however the str's are also expected to be regular expressions. And since you have a . (dot) in the url, the regular expression will interpret that as any character.
Instead you can just use the arguement allow_domains. Like this.
rules = [Rule(LinkExtractor(allow_domains = allowed_domains), callback='parse_item', follow=True)]
But, these requests should still be filtered out by the allowed_domains. So I'm not sure why that's not working, but try this.
I can't really figure out what's wrong, but I made a test project of my own. It's a totally clean project, only changed ROBOTSTXT_OBEY = False in settings.py.
I noticed that your spider class is extending scrapy.Spider but uses rules, I believe that class variable is only used by the generic spider CrawlSpider.
Here's my testSpider.
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
class TestSpider(CrawlSpider):
name = 'web'
allowed_domains = ['stackoverflow.com']
start_urls = ['https://www.stackoverflow.com/']
rules = [Rule(link_extractor=LinkExtractor(), follow=True)]
And it seems to work fine
2021-01-04 12:50:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-01-04 12:50:13 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://stackoverflow.com/> from <GET https://www.stackoverflow.com/>
2021-01-04 12:50:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/> (referer: None)
2021-01-04 12:50:13 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'stackexchange.com': <GET https://stackexchange.com/sites>
2021-01-04 12:50:13 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'stackoverflow.blog': <GET https://stackoverflow.blog>
2021-01-04 12:50:13 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://stackoverflow.com/#for-developers> - no more
duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2021-01-04 12:50:14 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.g2.com': <GET https://www.g2.com/products/stack-overflow-for-teams/>
2021-01-04 12:50:14 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'stackoverflowbusiness.com': <GET https://stackoverflowbusiness.com>
2021-01-04 12:50:14 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'serverfault.com': <GET https://serverfault.com>
So I would try using the CrawlSpider. Also, if it doesn't work you can post the whole code for the spider then I can debug it.

Scrapy getting all pages hrefs from an array of startUrls

The problem I have is the following: I am trying to scrape a website that has multiple categories of products, and for each category of products, it has several pages with 24 products in each. I am able to get all starting urls, and scraping every page I am able to get the urls (endpoints, which I then make into full urls) of all pages.
I should say that not for every category I have product pages, and not every starting url is a category and thus it might not have the structure I am looking for. But most of them do.
My intent is: from all pages of all categories I want to extract the href of every product displayed in the page. And the code I have been using is the following one:
import scrapy
class MySpider(scrapy.spiders.CrawlSpider):
name = 'myProj'
with open('resultt.txt','r') as f:
endurls = f.read()
f.close()
endurls= endurls.split(sep=' ')
endurls = ['https://www.someurl.com'+url for url in endurls]
start_urls = endurls
def parse(self, response):
with open('allpages.txt', 'a') as f:
pages_in_category = response.xpath('//option/#value').getall()
length = len(pages_in_category)
pages_in_category = ['https://www.someurl.com'+page for page in pages_in_category]
if length == 0:
f.write(str(response.url))
else:
for page in pages_in_category:
f.write(page)
f.close()
Through scrapy shell I am able to make it work, though not iteratively. The command I run in the terminal is then
scrapy runspider ScrapyCarr.py -s USER_AGENT='my-cool-project (http://example.com)'
Since I have not initialized a proper scrapy structure (I don't need that, it is a simple project for uni and I do not care much about the structure). Unfortunately the file in which I am trying to append my products urls remains empty, even if when inputting it through scrapy shell I see it working.
The output I am currently getting is the following
2020-10-15 12:51:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.someurl.com/market/fish/typefish/N-4minn0/c> (referer: None)
2020-10-15 12:51:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.someurl.com/market/medicines/typemed/N-i50owa/c> (referer: None)
2020-10-15 12:51:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.someurl.com/market/medicines/typemed/N-1l0cnr6/c> (referer: None)
2020-10-15 12:51:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.someurl.com/market/medicines/typemed/N-18isujc/c> (referer: None)
The problem was that I was initializing my class MySpider with a spider.CrawlSpider. The code works when using a class spider.Spider.
SOLVED

Scrapy - unexpected suffix "%0A" in links

I'm scraping a site to download email addresses from websites.
I have a simple Scrapy crawler, which takes a .txt file with domains and then scrape them to find email addresses.
Unfortunately Scrapy is adding suffix "%0A" in links. You can see it on log file.
Here is my code:
class EmailsearcherSpider(scrapy.Spider):
name = 'emailsearcher'
allowed_domains = []
start_urls = []
unique_data = set()
def __init__(self):
for line in open('/home/*****/domains',
'r').readlines():
self.allowed_domains.append(line)
self.start_urls.append('http://{}'.format(line))
def parse(self, response):
emails = response.xpath('//body').re('([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)')
for email in emails:
print(email)
print('\n')
if email and (email not in self.unique_data):
self.unique_data.add(email)
yield {'emails': email}
domains.txt:
link4.pl/kontakt
danone.pl/Kontakt
axadirect.pl/kontakt/dane-axa-direct.html
andrzejtucholski.pl/kontakt
premier.gov.pl/kontakt.html
Here are logs from console:
2017-09-26 22:27:02 [scrapy.core.engine] INFO: Spider opened
2017-09-26 22:27:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-09-26 22:27:02 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6026
2017-09-26 22:27:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.premier.gov.pl/kontakt.html> from <GET http://premier.gov.pl/kontakt.html>
2017-09-26 22:27:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://andrzejtucholski.pl/kontakt> from <GET http://andrzejtucholski.pl/kontakt%0A>
2017-09-26 22:27:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://axadirect.pl/kontakt/dane-axa-direct.html%0A> from <GET http://axadirect.pl/kontakt/dane-axa-direct.html%0A>
2017-09-26 22:27:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.link4.pl/kontakt> from <GET http://link4.pl/kontakt%0A>
2017-09-26 22:27:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://danone.pl/Kontakt%0a> from <GET http://danone.pl/Kontakt%0A>
The %0A is the newline character. Reading the lines keeps the newline characters intact. To get rid of them, you may use string.strip function, like this:
self.start_urls.append('http://{}'.format(string.strip(line)))
I found the right solution. I had to use rstrip function.
self.start_urls.append('http://{}'.format(line.rstrip()))

Portia Spider logs showing ['Partial'] during crawling

I have created a spider using Portia web scraper and the start URL is
https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs
While scheduling this spider in scrapyd I am getting
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs> (referer: None) ['partial']
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=2> (referer: https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs) ['partial']
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.showJob&RID=21805&CurrentPage=1> (referer: https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs) ['partial']`<br><br>
What does the ['partial'] mean and why the content from the page is not scraped by the spdier?
Late answer, but hopefully not useless, since this behavior by scrapy doesn't seem well-documented. Looking at this line of code from the scrapy source, the partial flag is set when the request encounters a Twisted PotentialDataLoss error. According to the corresponding Twisted documentation:
This only occurs when making requests to HTTP servers which do not set Content-Length or a Transfer-Encoding in the response
Possible causes include:
The server is misconfigured
There's a proxy involved that's blocking some headers
You get a response that doesn't normally have Content-Length, e.g. redirects (301, 302, 303), but you've set handle_httpstatus_list or handle_httpstatus_all such that the response doesn't get filtered out by HttpErrorMiddleware or fetched by RedirectMiddleware

Categories