Stop Scrapy crawler from external domains - python

I am new to scrapy and trying to run a crawler on a few websites where my allowed domain and start url looks like this
allowed_domains = ['www.siemens.com']
start_urls= ['https://www.siemens.com/']
The problem is that the website also contains links to different domains like
"siemens.fr" and "seimens.de"
and I don't want the scrapy to also scrape these websites. Any suggestion on how to tell the spider not to crawl these websites.
I am trying to build a more general spider so that it is applicable to other websites also
Update#2
As suggested by Felix Eklöf, I tried to adjust my code and change some settings. This is what the code looks like now
The spider
class webSpider(scrapy.Spider):
name = 'web'
allowed_domains = ['eaton.com']
start_urls= ['https://www.eaton.com/us/']
# include_patterns = ['']
exclude_patterns = ['.*\.(css|js|gif|jpg|jpeg|png)']
#proxies = 'proxies.txt'
response_type_whitelist = ['text/html']
# response_type_blacklist = []
rules = [Rule(LinkExtractor(allow = (allowed_domains)), callback='parse_item', follow=True)]
And the settings look like this:
SPIDER_MIDDLEWARES = {
'smartspider.middlewares.SmartspiderSpiderMiddleware': 543,
#'scrapy_testmaster.TestMasterMiddleware': 950
}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'smartspider.middlewares.SmartspiderDownloaderMiddleware': 543,
'smartspider.middlewares.FilterResponses': 543,
'smartspider.middlewares.RandomProxyForReDirectedUrls': 650,
"scrapy.spidermiddlewares.offsite.OffsiteMiddleware": 543
}
ITEM_PIPELINES = {
'smartspider.pipelines.SmartspiderPipeline': 300,
}
Please let me know if any of these settings are interfering with the spider only accessing the internal links and maintaining the given domain
Update3#
As suggested by #Felix, I updated the Spider which looks like this now
class WebSpider(CrawlSpider):
name = 'web'
allowed_domains = ['eaton.com']
start_urls= ['https://www.eaton.com/us/']
# include_patterns = ['']
exclude_patterns = ['.*\.(css|js|gif|jpg|jpeg|png)']
response_type_whitelist = ['text/html']
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]
The settings look
#SPIDER_MIDDLEWARES = {
# 'smartspider.middlewares.SmartspiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
# 'smartspider.middlewares.SmartspiderDownloaderMiddleware': 543,
'smartspider.middlewares.FilterResponses': 543,
'smartspider.middlewares.RandomProxyForReDirectedUrls': 650,
}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'smartspider.pipelines.SmartspiderPipeline': 300,
#}
But the spider is still scraping through different domains.
But the logs are showing that it is rejecting offsite websites with another website (thalia.de)
2021-01-04 19:46:42 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.rtbhouse.com': <GET https://www.rtbhouse.com/privacy-center/>
2021-01-04 19:46:42 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.quicklizard.com': <GET https://www.quicklizard.com/terms-of-service/>
2021-01-04 19:46:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.thalia.de/shop/hilfe-gutschein/show/> (referer: https://www.thalia.de/)
2021-01-04 19:46:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.thalia.de/shop/hilfe-kaufen/show/> (referer: https://www.thalia.de/)
2021-01-04 19:46:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.thalia.de/shop/home/login/login/?source=%2Fde.buch.shop%2Fshop%2F2%2Fhome%2Fkundenbewertung%2Fschreiben%3Fartikel%3D149426569&jumpId=2610518> (referer: https://www.thalia.de/shop/home/artikeldetails/ID149426569.html)
2021-01-04 19:46:43 [scrapy.extensions.logstats] INFO: Crawled 453 pages (at 223 pages/min), scraped 0 items (at 0 items/min)
2021-01-04 19:46:43 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://www.thalia.de/shop/home/show/
Is the spider working as expected or the problem is with a specific website?

Try removing "www." from allowed_domains.
Accoring to the Scrapy docs you should do it like this:
Let’s say your target url is https://www.example.com/1.html, then
add 'example.com' to the list.
So, in your case:
allowed_domains = ['siemens.com']
start_urls= ['https://www.siemens.com/']

Please have a closer look at the other, country-specific domains, such as siemens.de, siemens.dk, siemens.fr, etc.
If you run a curl call against the German site, curl --head https://www.siemens.de, you will realize a 301 status code.
The URL is redirected to https://new.siemens.com/**de**/de.html.
THe same pattern is observed for the other countries. The ISO 3166-1 alpha-2 code is embedded in the URL. If you need to filter, here is the location to tackle the problem.

I took a closer look at your code and might have found the issue.
I belive the issue is on this line:
rules = [Rule(LinkExtractor(allow = (allowed_domains)), callback='parse_item', follow=True)]
The class LinkExtractor expects the argument allow to be a str or list of str's, however the str's are also expected to be regular expressions. And since you have a . (dot) in the url, the regular expression will interpret that as any character.
Instead you can just use the arguement allow_domains. Like this.
rules = [Rule(LinkExtractor(allow_domains = allowed_domains), callback='parse_item', follow=True)]
But, these requests should still be filtered out by the allowed_domains. So I'm not sure why that's not working, but try this.

I can't really figure out what's wrong, but I made a test project of my own. It's a totally clean project, only changed ROBOTSTXT_OBEY = False in settings.py.
I noticed that your spider class is extending scrapy.Spider but uses rules, I believe that class variable is only used by the generic spider CrawlSpider.
Here's my testSpider.
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
class TestSpider(CrawlSpider):
name = 'web'
allowed_domains = ['stackoverflow.com']
start_urls = ['https://www.stackoverflow.com/']
rules = [Rule(link_extractor=LinkExtractor(), follow=True)]
And it seems to work fine
2021-01-04 12:50:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-01-04 12:50:13 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://stackoverflow.com/> from <GET https://www.stackoverflow.com/>
2021-01-04 12:50:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/> (referer: None)
2021-01-04 12:50:13 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'stackexchange.com': <GET https://stackexchange.com/sites>
2021-01-04 12:50:13 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'stackoverflow.blog': <GET https://stackoverflow.blog>
2021-01-04 12:50:13 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://stackoverflow.com/#for-developers> - no more
duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2021-01-04 12:50:14 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.g2.com': <GET https://www.g2.com/products/stack-overflow-for-teams/>
2021-01-04 12:50:14 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'stackoverflowbusiness.com': <GET https://stackoverflowbusiness.com>
2021-01-04 12:50:14 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'serverfault.com': <GET https://serverfault.com>
So I would try using the CrawlSpider. Also, if it doesn't work you can post the whole code for the spider then I can debug it.

Related

SCRAPY FORM REQUEST doesn't return any data

I was making a form request to a website. The request is made successfully but it's not returning any data.
LOGS:
2020-09-05 22:37:57 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:57 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:59 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:59 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-05 22:37:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
MY CODE:
# -*- coding: utf-8 -*-
import scrapy
codes = open('codes.txt').read().split('\n')
class MainSpider(scrapy.Spider):
name = 'main'
form_url = 'https://safer.fmcsa.dot.gov/query.asp'
start_urls = ['https://safer.fmcsa.dot.gov/CompanySnapshot.aspx']
def parse(self, response):
for code in codes:
data = {
'searchtype': 'ANY',
'query_type': 'queryCarrierSnapshot',
'query_param': 'USDOT',
'query_string': code,
}
yield scrapy.FormRequest(url=self.form_url, formdata=data, callback=self.parse_form)
def parse_form(self, response):
cargo = response.xpath('(//table[#summary="Cargo Carried"]/tbody/tr)[2]')
for each in cargo:
each_x = each.xpath('.//td[contains(text(), "X")]/following-sibling::td/font/text()').get()
yield {
"X Values": each_x if each_x else "N/A",
}
The following are a few samples code that I am using for the POST REQUEST.
2146709
273286
120670
2036998
690147
I believe all you need is to remove tbody from your XPath here:
cargo = response.xpath('(//table[#summary="Cargo Carried"]/tbody/tr)[2]')
use like this:
cargo = response.xpath('//table[#summary="Cargo Carried"]/tr[2]')
# I also removed the () inside the path because you don't need it, but that didn't cause the problem.
The reason for this is that Scrapy will parse the original code from the page, while your browser may render tbody in case it isn't in the source. Further info here.

DEBUG: Crawled (404)

This is my code:
# -*- coding: utf-8 -*-
import scrapy
class SinasharesSpider(scrapy.Spider):
name = 'SinaShares'
allowed_domains = ['money.finance.sina.com.cn/mkt/']
start_urls = ['http://money.finance.sina.com.cn/mkt//']
def parse(self, response):
contents=response.xpath('//*[#id="list_amount_ctrl"]/a[2]/#class').extract()
print(contents)
And I have set an user-agent in setting.py.
Then I get an error:
2020-04-27 10:54:50 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://money.finance.sina.com.cn/robots.txt> (referer: None)
2020-04-27 10:54:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://money.finance.sina.com.cn/mkt//> (referer: None)
So How can I eliminate this error?
Maybe your ip is banned by the website,also you can need to add some cookies to crawling the data that you needed.
The http-statuscode 404 is received because Scrapy is checking the /robots.txt by default. In your case this site does not exist and so a 404 is received but that does not have any impact. In case you want to avoid checking the robots.txt you can set ROBOTSTXT_OBEY = False in the settings.py.
Then the website is accessed successfully (http-statuscode 200). No content is printed because based on your xpath-selection nothing is selected. You have to fix your xpath-selection.
If you want to test different xpath- or css-selections in order to figure how to get your desired content, you might want to use the interactive scrapy shell:
scrapy shell "http://money.finance.sina.com.cn/mkt/"
You can find an example of a scrapy shell session in the official Scrapy documentation here.

Get all URLs in a entire site using Scrapy

folks!
I'm trying to get all internal URLs in entire site for SEO purposes and i recently discovered Scrapy to help me in this task. But my code always returns a error:
2017-10-11 10:32:00 [scrapy.core.engine] INFO: Spider opened
2017-10-11 10:32:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min
)
2017-10-11 10:32:00 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-10-11 10:32:01 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.**test**.com/> from
<GET http://www.**test**.com/robots.txt>
2017-10-11 10:32:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.**test**.com/> (referer: None)
2017-10-11 10:32:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.**test**.com/> from
<GET http://www.**test**.com>
2017-10-11 10:32:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.**test**.com/> (referer: None)
2017-10-11 10:32:03 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.**test**.com/> (referer: None)
Traceback (most recent call last):
File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "c:\python27\lib\site-packages\scrapy\spiders\__init__.py", line 90, in parse
raise NotImplementedError
NotImplementedError
I change the original url.
Here's the code i'm running
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class TestSpider(scrapy.Spider):
name = "test"
allowed_domains = ["http://www.test.com"]
start_urls = ["http://www.test.com"]
rules = [Rule (LinkExtractor(allow=['.*']))]
Thanks!
EDIT:
This worked for me:
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
filename = response.url
arquivo = open("file.txt", "a")
string = str(filename)
arquivo.write(string+ '\n')
arquivo.close
=D
The error you are getting is caused by the fact that you don't have defined parse method in your spider, which is mandatory if you base your spider on scrapy.Spider class.
For your purpose (i.e. crawling whole website) it's best to base your spider on scrapy.CrawlSpider class. Also, in Rule, you have to define callback attribute as a method that will parse every page you visit. Last one cosmetic change, in LinkExtractor, if you want to visit every page, you can leave out allow as its default value is empty tuple which means it will match all links found.
Consult a CrawlSpider example for concrete code.

Confusion on Scrapy re-direct behavior?

So I am trying to scrape articles from news website that has an infinite scroll type layout so the following is what happens:
example.com has first page of articles
example.com/page/2/ has second page
example.com/page/3/ has third page
And so on. As you scroll down, the url changes. To account for that, I wanted to scrape the first x number of articles and did the following:
start_urls = ['http://example.com/']
for x in range(1,x):
new_url = 'http://www.example.com/page/' + str(x) +'/'
start_urls.append(new_url)
It seems to work fine for the first 9 pages and I get something like the following:
Redirecting (301) to <GET http://example.com/page/4/> from <GET http://www.example.com/page/4/>
Redirecting (301) to <GET http://example.com/page/5/> from <GET http://www.example.com/page/5/>
Redirecting (301) to <GET http://example.com/page/6/> from <GET http://www.example.com/page/6/>
Redirecting (301) to <GET http://example.com/page/7/> from <GET http://www.example.com/page/7/>
2017-09-08 17:36:23 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 3 pages/min), scraped 0 items (at 0 items/min)
Redirecting (301) to <GET http://example.com/page/8/> from <GET http://www.example.com/page/8/>
Redirecting (301) to <GET http://example.com/page/9/> from <GET http://www.example.com/page/9/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/10/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/11/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/12/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/13/>
Starting from page 10, it redirects to a page like example.com/ from example.com/page/10/ instead of the original link, example.com/page/10. What can be causing this behavior?
I looked into a couple options like dont_redirect, but I just don't understand what is happening. What can be the reason for this re-direction behavior? Especially since no re-direction happens when you directly type in the link for the website like example.com/page/10?
Any help would be greatly appreciated, thanks!!
[EDIT]
class spider(CrawlSpider):
start_urls = ['http://example.com/']
for x in range(startPage,endPage):
new_url = 'http://www.example.com/page/' + str(x) +'/'
start_urls.append(new_url)
custom_settings = {'DEPTH_PRIORITY': 1, 'DEPTH_LIMIT': 1}
rules = (
Rule(LinkExtractor(allow=('some regex here,')deny=('example\.com/page/.*','some other regex',),callback='parse_article'),
)
def parse_article(self, response):
#some parsing work here
yield item
Is it because I include example\.com/page/.* in the LinkExtractor? Shouldn't that only apply to links that are not the start_url however?
looks like this site uses some kind of security to only check the User-Agent in the request headers.
So you only need to add a common User-Agent in the settings.py file:
USER_AGENT = 'Mozilla/5.0'
Also, the spider doesn't necessarily need the start_urls attribute to get the starting sites, you can also use the start_requests method, so replace all the creating of start_urls with:
class spider(CrawlSpider):
...
def start_requests(self):
for x in range(1,20):
yield Request('http://www.example.com/page/' + str(x) +'/')
...

Scrapy - unexpected suffix "%0A" in links

I'm scraping a site to download email addresses from websites.
I have a simple Scrapy crawler, which takes a .txt file with domains and then scrape them to find email addresses.
Unfortunately Scrapy is adding suffix "%0A" in links. You can see it on log file.
Here is my code:
class EmailsearcherSpider(scrapy.Spider):
name = 'emailsearcher'
allowed_domains = []
start_urls = []
unique_data = set()
def __init__(self):
for line in open('/home/*****/domains',
'r').readlines():
self.allowed_domains.append(line)
self.start_urls.append('http://{}'.format(line))
def parse(self, response):
emails = response.xpath('//body').re('([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)')
for email in emails:
print(email)
print('\n')
if email and (email not in self.unique_data):
self.unique_data.add(email)
yield {'emails': email}
domains.txt:
link4.pl/kontakt
danone.pl/Kontakt
axadirect.pl/kontakt/dane-axa-direct.html
andrzejtucholski.pl/kontakt
premier.gov.pl/kontakt.html
Here are logs from console:
2017-09-26 22:27:02 [scrapy.core.engine] INFO: Spider opened
2017-09-26 22:27:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-09-26 22:27:02 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6026
2017-09-26 22:27:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.premier.gov.pl/kontakt.html> from <GET http://premier.gov.pl/kontakt.html>
2017-09-26 22:27:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://andrzejtucholski.pl/kontakt> from <GET http://andrzejtucholski.pl/kontakt%0A>
2017-09-26 22:27:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://axadirect.pl/kontakt/dane-axa-direct.html%0A> from <GET http://axadirect.pl/kontakt/dane-axa-direct.html%0A>
2017-09-26 22:27:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.link4.pl/kontakt> from <GET http://link4.pl/kontakt%0A>
2017-09-26 22:27:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://danone.pl/Kontakt%0a> from <GET http://danone.pl/Kontakt%0A>
The %0A is the newline character. Reading the lines keeps the newline characters intact. To get rid of them, you may use string.strip function, like this:
self.start_urls.append('http://{}'.format(string.strip(line)))
I found the right solution. I had to use rstrip function.
self.start_urls.append('http://{}'.format(line.rstrip()))

Categories