Crawling redirected urls with scrapy

Crawling redirected urls with scrapy - python

Im trying to use scrapy to crawl www.mywebsite.com.
www.mywebsite.com is hosted on a free host with the url www.mywebsite.freehost.com. I am redirecting the free host to my paid domain.
The problem here is that scrapy ignores the redirect and the end result is that 0 pages are scraped.
How do I tell scrapy that I need it to crawl the redirected url? I only need it to crawl the redirected url and not other urls that lead out of the website (like facebook pages etc.)
2016-11-27 14:48:42 [scrapy] INFO: Spider opened
2016-11-27 14:48:42 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-27 14:48:42 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-27 14:48:44 [scrapy] DEBUG: Crawled (200) <GET http://www.mywebsite.com/> (referer: None)
2016-11-27 14:48:44 [scrapy] DEBUG: Filtered offsite request to 'www.mywebsite.freehost.net': <GET www.mywebsite.freehost.net>
2016-11-27 14:48:44 [scrapy] INFO: Closing spider (finished)
2016-11-27 14:48:44 [scrapy] INFO: Dumping Scrapy stats:

The logs show that your request is being filtered:
DEBUG: Filtered offsite request to 'www.mywebsite.freehost.net': <GET www.mywebsite.freehost.net>
Add that domain freehost.net to your allowed_domains list, or remove allowed_domains from your spider to allow every domain.

Related

DEBUG: Crawled (404) when crawling table with Scrapy

I am quite new to Scrapy and I try to get table data from every page from this website.
But first, I just want to get the table data from page 1.
This is my code:
import scrapy
class UAESpider(scrapy.Spider):
name = 'uae_free'
allowed_domains = ['https://www.uaeonlinedirectory.com']
start_urls = [
'https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A'
]
def parse(self, response):
zones = response.xpath('//table[#class="GridViewStyle"]/tbody/tr')
for zone in zones[1:]:
yield {
'company_name': zone.xpath('.//td[1]//text()').get(),
'zone': zone.xpath('.//td[2]//text()').get(),
'category': zone.xpath('.//td[4]//text()').get()
}
On the terminal, I get this message:
2020-07-01 08:41:07 [scrapy.core.engine] INFO: Spider opened
2020-07-01 08:41:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-01 08:41:07 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-01 08:41:09 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.uaeonlinedirectory.com/robots.txt> (referer: None)
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 8 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 11 without any user agent to enforce it on.
2020-07-01 08:41:09 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.
2020-07-01 08:41:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A> (referer: None)
2020-07-01 08:41:14 [scrapy.core.engine] INFO: Closing spider (finished)
Do you guys know what is this message about and what wrong with my code?
Update:
I found this answer, and after I set ROBOTSTXT_OBEY = False, I don't receive the message above anymore. But I still cannot get the data.
The terminal message after I set ROBOTSTXT_OBEY = False:
2020-07-01 08:56:03 [scrapy.core.engine] INFO: Spider opened
2020-07-01 08:56:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-01 08:56:03 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-01 08:56:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A> (referer: None)
2020-07-01 08:56:07 [scrapy.core.engine] INFO: Closing spider (finished)
Update 2:
I open terminal and use scrapy shell https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A to check my xpath:
>>> response.xpath('//table[#class="GridViewStyle"]')
[<Selector xpath='//table[#class="GridViewStyle"]' data='<table class="GridViewStyle" cellspac...'>]
>>> response.xpath('//table[#class="GridViewStyle"]/tbody')
[]
So does my xpath wrong?

Not sure why, but for some reason your XPath doesn't find the table body. I changed it to this and it seems to work now:
//table[#class="GridViewStyle"]//tr'

Can't make my first spider run,any advice?

This is my first time using scrapy and maybe the third in python, so i'm a noob.
The problem with this code is that it doesn't even enter the page.
I have tried to use:
scrapy shell 'https://www.zooplus.es/shop/tienda_perros/pienso_perros/pienso_hipoalergenico'
This works and then using...
response.xpath('//*[#class="product__varianttitle ui-text--small"]')
... I can retrieve information.
My code:
import scrapy
class ZooplusSpider(scrapy.Spider):
name = 'Zooplus'
allowed_domains = ['zooplus.es']
start_urls = ['https://www.zooplus.es/shop/tienda_perros/pienso_perros/pienso_hipoalergenico']
def parse(self, response):
item= scrapy.Item()
item['nombre']=response.xpath('//*[#class="product__varianttitle ui-text--small"]')
item['preciooriginal']=response.xpath('//*[#class="product__prices_col prices"]')
item['preciorebaja']=response.xpath('//*[#class="product__specialprice__text"]')
return item
The error message says:
2019-08-30 21:16:57 [scrapy.core.engine] INFO: Spider opened
2019-08-30 21:16:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-08-30 21:16:57 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-08-30 21:16:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zooplus.es/robots.txt> (referer: None)
2019-08-30 21:16:57 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.zooplus.es/shop/tienda_perros/pienso_perros/pienso_hipoalergenico> from <GET https://www.zooplus.es/shop/tienda_perros/pienso_perros/pienso_hipoalergenico/>
2019-08-30 21:16:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zooplus.es/shop/tienda_perros/pienso_perros/pienso_hipoalergenico> (referer: None)
2019-08-30 21:16:58 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.zooplus.es/shop/tienda_perros/pienso_perros/pienso_hipoalergenico> (referer: None)

I think you haven't defined the fields for your items.py
the error is coming from item['nombre']
Either you should define the field in items.py or simply replace
item= scrapy.Item()
with item = dict()

Scrapy - unexpected suffix "%0A" in links

I'm scraping a site to download email addresses from websites.
I have a simple Scrapy crawler, which takes a .txt file with domains and then scrape them to find email addresses.
Unfortunately Scrapy is adding suffix "%0A" in links. You can see it on log file.
Here is my code:
class EmailsearcherSpider(scrapy.Spider):
name = 'emailsearcher'
allowed_domains = []
start_urls = []
unique_data = set()
def __init__(self):
for line in open('/home/*****/domains',
'r').readlines():
self.allowed_domains.append(line)
self.start_urls.append('http://{}'.format(line))
def parse(self, response):
emails = response.xpath('//body').re('([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)')
for email in emails:
print(email)
print('\n')
if email and (email not in self.unique_data):
self.unique_data.add(email)
yield {'emails': email}
domains.txt:
link4.pl/kontakt
danone.pl/Kontakt
axadirect.pl/kontakt/dane-axa-direct.html
andrzejtucholski.pl/kontakt
premier.gov.pl/kontakt.html
Here are logs from console:
2017-09-26 22:27:02 [scrapy.core.engine] INFO: Spider opened
2017-09-26 22:27:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-09-26 22:27:02 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6026
2017-09-26 22:27:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.premier.gov.pl/kontakt.html> from <GET http://premier.gov.pl/kontakt.html>
2017-09-26 22:27:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://andrzejtucholski.pl/kontakt> from <GET http://andrzejtucholski.pl/kontakt%0A>
2017-09-26 22:27:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://axadirect.pl/kontakt/dane-axa-direct.html%0A> from <GET http://axadirect.pl/kontakt/dane-axa-direct.html%0A>
2017-09-26 22:27:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.link4.pl/kontakt> from <GET http://link4.pl/kontakt%0A>
2017-09-26 22:27:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://danone.pl/Kontakt%0a> from <GET http://danone.pl/Kontakt%0A>

The %0A is the newline character. Reading the lines keeps the newline characters intact. To get rid of them, you may use string.strip function, like this:
self.start_urls.append('http://{}'.format(string.strip(line)))

I found the right solution. I had to use rstrip function.
self.start_urls.append('http://{}'.format(line.rstrip()))

Scrapy crawl 301 redirect pages but doesn't scrape data from them

I can't figure out how to allow scrapy to scrape 301 redirected pages.
When I add
handle_httpstatus_list = [301,302]
the log stops telling me
2015-09-29 09:45:06 [scrapy] DEBUG: Crawled (301) <GET http://www.example.com/conditions-generales/> (referer: http://www.example.com/)
2015-09-29 09:45:07 [scrapy] DEBUG: Ignoring response <301 http://www.example.com/conditions-generales/>: HTTP status code is not handled or not allowed
but only crawls 301 redirected pages and never scrapes data from them (while it does for 200 http status code pages).
I then get :
2015-09-29 09:55:39 [scrapy] DEBUG: Crawled (301) <GET http://www.example.com/espace-annonceurs/> (referer: http://www.example.com/)
But never :
2015-09-29 09:55:39 [scrapy] DEBUG: Scraped from <301 http://www.example.com/espace-annonceurs/>
I would like to scrape http://www.example.com/espace-annonceurs/ juste the way I would if it were a 200 HTTP status code.
I suppose I have to use a middleware, but I don't know how to do this
Thank you for your help

Scrapy gets stuck with IIS 5.1 page

I'm writing spiders with scrapy to get some data from a couple of applications using ASP. Both webpages are almost identical and requires to log in before starting scrapping, but I only managed to scrap one of them. In the other one scrapy gets waiting something forever and never gets after the login using FormRequest method.
The code of both spiders (they are almost identical but with different IPs) is as following:
from scrapy.spider import BaseSpider
from scrapy.http import FormRequest
from scrapy.shell import inspect_response
class MySpider(BaseSpider):
name = "my_very_nice_spider"
allowed_domains = ["xxx.xxx.xxx.xxx"]
start_urls = ['http://xxx.xxx.xxx.xxx/reporting/']
def parse(self,response):
#Simulate user login on (http://xxx.xxx.xxx.xxx/reporting/)
return [FormRequest.from_response(response,
formdata={'user':'the_username',
'password':'my_nice_password'},
callback=self.after_login)]
def after_login(self,response):
inspect_response(response,self) #Spider never gets here in one site
if "Bad login" in response.body:
print "Login failed"
return
#Scrapping code begins...
Wondering what could be different between them I used Firefox Live HTTP Headers for inspecting the headers and found only one difference: the webpage that works uses IIS 6.0 and the one that doesn't IIS 5.1.
As this alone couldn't explain myself why one works and the other doesnt' I used Wireshark to capture network traffic and found this:
Interaction using scrapy with working webpage (IIS 6.0)
scrapy --> webpage GET /reporting/ HTTP/1.1
scrapy <-- webpage HTTP/1.1 200 OK
scrapy --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded)
scrapy <-- webpage HTTP/1.1 302 Object moved
scrapy --> webpage GET /reporting/htm/webpage.asp
scrapy <-- webpage HTTP/1.1 200 OK
scrapy --> webpage POST /reporting/asp/report1.asp
...Scrapping begins
Interaction using scrapy with not working webpage (IIS 5.1)
scrapy --> webpage GET /reporting/ HTTP/1.1
scrapy <-- webpage HTTP/1.1 200 OK
scrapy --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded)
scrapy <-- webpage HTTP/1.1 100 Continue # What the f...?
scrapy <-- webpage HTTP/1.1 302 Object moved
...Scrapy waits forever...
I googled a little bit and found that indeed IIS 5.1 has some nice kind of "feature" that makes it return HTTP 100 whenever someone makes a POST to it as shown here.
Knowing that the root of all evil is where always is, but having to scrap that site anyway... How can I make scrapy work in this situation? Or am I doing something wrong?
Thank you!
Edit - Console log with not working site:
2014-01-17 09:09:50-0300 [scrapy] INFO: Scrapy 0.20.2 started (bot: mybot)
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Optional features available: ssl, http11
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': bot.spiders', 'SPIDER_MODULES': [bot.spiders'], 'BOT_NAME': 'bot'}
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled item pipelines:
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Spider opened
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-17 09:09:54-0300 [my_very_nice_spider] DEBUG: Crawled (200) <GET http://xxx.xxx.xxx.xxx/reporting/> (referer: None)
2014-01-17 09:10:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:11:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:12:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:12:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 1 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds..
2014-01-17 09:13:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:14:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:15:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:15:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 2 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds..
...

Try using the HTTP 1.0 downloader:
# settings.py
DOWNLOAD_HANDLERS = {
'http': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler',
'https': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler',
}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Crawling redirected urls with scrapy - python

The logs show that your request is being filtered: DEBUG: Filtered offsite request to 'www.mywebsite.freehost.net': <GET www.mywebsite.freehost.net> Add that domain freehost.net to your allowed_domains list, or remove allowed_domains from your spider to allow every domain.

Related

DEBUG: Crawled (404) when crawling table with Scrapy

Can't make my first spider run,any advice?

Scrapy - unexpected suffix "%0A" in links

Scrapy crawl 301 redirect pages but doesn't scrape data from them

Scrapy gets stuck with IIS 5.1 page

Categories

Resources