Scrapy ignores deny rule - python

As a newbie in scrapy and python, I'm struggling with the deny rules of my Crawl Spider. I want to filter all URLs on my target page, which contain the word "versicherung" and the double ? structure in any part of the URL. However, scrapy ignores my rule. Can anyone tell me what's wrong with the syntax (I've already tried without the "" before the *, but that doesn't work either)?
Rule:
rules = [Rule(LinkExtractor(deny=r'\*versicher\*', r\'*\?*\?\*',),
callback='parse_norisbank', follow=True)]
Log:
DEBUG: Crawled (200) <GET https://www.norisbank.de/produkte/versicherungen/rechtsschutzversicherung.html> (referer: https://www.norisbank.de)
DEBUG: Crawled (200) <GET https://www.norisbank.de/produkte/versicherungen/haftpflichtversicherung.html> (referer: https://www.norisbank.de)
DEBUG: Crawled (200) <GET https://www.norisbank.de/produkte/versicherungen/hausratversicherung.html> (referer: https://www.norisbank.de)
DEBUG: Crawled (200) <GET https://www.norisbank.de/produkte/versicherungen/versicherungsmanager.html> (referer: https://www.norisbank.de)
DEBUG: Saved file nbtest-versicherungen.html

The rules must be regular expressions and (even if I correct your syntax) you are not using * correctly.
r'\*versicher\*' should be r'.*versicher.*' EDIT: looking at scrapy docs, it looks like r'versicher' is sufficient.
I don't understand what you mean by "double ? structure", but your URLs don't seem to have it.
I expect r'.*\?\?.*' is what you want (or r'\?\?')
In regular expressions
. means any character
* means 0 or more of the preceding (so .* matches anything)
\\ is how you escape a special character. You don't want to escape the * since you want it to act in its special way.

Related

Response.url and refer url scrapy

2020-11-09 12:13:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com/books/adventure/book1/index.html> (referer: example.com/books/adventure/index.html)
If anyone is familiar with scarpy, you know that https://example.com/books/adventure/book1/index.html is called response.url. However, I want to get the refer link example.com/books/adventure/index.html, does anyone know what it's called.
You need to create the referer in your header.
Ideally it needs to be created by you, i.e you will already have it and you don't need to get it from the response.
eg.
headers={'Referer':'example.com/books/adventure/index.html'}
Hope that helps?

Scrapy getting all pages hrefs from an array of startUrls

The problem I have is the following: I am trying to scrape a website that has multiple categories of products, and for each category of products, it has several pages with 24 products in each. I am able to get all starting urls, and scraping every page I am able to get the urls (endpoints, which I then make into full urls) of all pages.
I should say that not for every category I have product pages, and not every starting url is a category and thus it might not have the structure I am looking for. But most of them do.
My intent is: from all pages of all categories I want to extract the href of every product displayed in the page. And the code I have been using is the following one:
import scrapy
class MySpider(scrapy.spiders.CrawlSpider):
name = 'myProj'
with open('resultt.txt','r') as f:
endurls = f.read()
f.close()
endurls= endurls.split(sep=' ')
endurls = ['https://www.someurl.com'+url for url in endurls]
start_urls = endurls
def parse(self, response):
with open('allpages.txt', 'a') as f:
pages_in_category = response.xpath('//option/#value').getall()
length = len(pages_in_category)
pages_in_category = ['https://www.someurl.com'+page for page in pages_in_category]
if length == 0:
f.write(str(response.url))
else:
for page in pages_in_category:
f.write(page)
f.close()
Through scrapy shell I am able to make it work, though not iteratively. The command I run in the terminal is then
scrapy runspider ScrapyCarr.py -s USER_AGENT='my-cool-project (http://example.com)'
Since I have not initialized a proper scrapy structure (I don't need that, it is a simple project for uni and I do not care much about the structure). Unfortunately the file in which I am trying to append my products urls remains empty, even if when inputting it through scrapy shell I see it working.
The output I am currently getting is the following
2020-10-15 12:51:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.someurl.com/market/fish/typefish/N-4minn0/c> (referer: None)
2020-10-15 12:51:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.someurl.com/market/medicines/typemed/N-i50owa/c> (referer: None)
2020-10-15 12:51:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.someurl.com/market/medicines/typemed/N-1l0cnr6/c> (referer: None)
2020-10-15 12:51:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.someurl.com/market/medicines/typemed/N-18isujc/c> (referer: None)
The problem was that I was initializing my class MySpider with a spider.CrawlSpider. The code works when using a class spider.Spider.
SOLVED

Confusion on Scrapy re-direct behavior?

So I am trying to scrape articles from news website that has an infinite scroll type layout so the following is what happens:
example.com has first page of articles
example.com/page/2/ has second page
example.com/page/3/ has third page
And so on. As you scroll down, the url changes. To account for that, I wanted to scrape the first x number of articles and did the following:
start_urls = ['http://example.com/']
for x in range(1,x):
new_url = 'http://www.example.com/page/' + str(x) +'/'
start_urls.append(new_url)
It seems to work fine for the first 9 pages and I get something like the following:
Redirecting (301) to <GET http://example.com/page/4/> from <GET http://www.example.com/page/4/>
Redirecting (301) to <GET http://example.com/page/5/> from <GET http://www.example.com/page/5/>
Redirecting (301) to <GET http://example.com/page/6/> from <GET http://www.example.com/page/6/>
Redirecting (301) to <GET http://example.com/page/7/> from <GET http://www.example.com/page/7/>
2017-09-08 17:36:23 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 3 pages/min), scraped 0 items (at 0 items/min)
Redirecting (301) to <GET http://example.com/page/8/> from <GET http://www.example.com/page/8/>
Redirecting (301) to <GET http://example.com/page/9/> from <GET http://www.example.com/page/9/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/10/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/11/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/12/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/13/>
Starting from page 10, it redirects to a page like example.com/ from example.com/page/10/ instead of the original link, example.com/page/10. What can be causing this behavior?
I looked into a couple options like dont_redirect, but I just don't understand what is happening. What can be the reason for this re-direction behavior? Especially since no re-direction happens when you directly type in the link for the website like example.com/page/10?
Any help would be greatly appreciated, thanks!!
[EDIT]
class spider(CrawlSpider):
start_urls = ['http://example.com/']
for x in range(startPage,endPage):
new_url = 'http://www.example.com/page/' + str(x) +'/'
start_urls.append(new_url)
custom_settings = {'DEPTH_PRIORITY': 1, 'DEPTH_LIMIT': 1}
rules = (
Rule(LinkExtractor(allow=('some regex here,')deny=('example\.com/page/.*','some other regex',),callback='parse_article'),
)
def parse_article(self, response):
#some parsing work here
yield item
Is it because I include example\.com/page/.* in the LinkExtractor? Shouldn't that only apply to links that are not the start_url however?
looks like this site uses some kind of security to only check the User-Agent in the request headers.
So you only need to add a common User-Agent in the settings.py file:
USER_AGENT = 'Mozilla/5.0'
Also, the spider doesn't necessarily need the start_urls attribute to get the starting sites, you can also use the start_requests method, so replace all the creating of start_urls with:
class spider(CrawlSpider):
...
def start_requests(self):
for x in range(1,20):
yield Request('http://www.example.com/page/' + str(x) +'/')
...

Scrapy. Unexpected symbols in LinkExtractor

I am studying Scrapy library and trying to make a little crawler.
Here's the crawler's rules:
rules = (
Rule(LinkExtractor(restrict_xpaths='//div[#class="wrapper"]/div[last()]/a[#class="pagenav"][last()]')),
# Rule(LinkExtractor(restrict_xpaths='//span[#class="update_title"]/a'), callback='parse_item'),
)
But I get this error message:
DEBUG: Crawled (200) <GET http://web/category.php?id=4&> (referer: None)
DEBUG: Crawled (404) <GET http://web/%0D%0Acategory.php?id=4&page=2&s=d> (referer: http://web/category.php?id=4&)
DEBUG: Ignoring response <404 http://web/%0D%0Acategory.php?id=4&page=2&s=d>: HTTP status code is not handled or not allowed
Here's how html look like:
<a class="pagenav" href=" category.php?id=4&page=8&s=d& ">8</a>
|
<a class="pagenav" href=" category.php?id=4&page=9&s=d& ">9</a>
|
<a class="pagenav" href=" category.php?id=4&page=10&s=d& ">10</a>
|
<a class="pagenav" href=" category.php?id=4&page=2&s=d& ">Next ></a>
Can someone explain where's this %0D%0A come from?
Kind regards, Maxim.
UPD:
I made a simple function
def process_value(value):
value = value.strip()
print value
return value
and changed rules to
rules = (
Rule(LinkExtractor(restrict_xpaths='//div[#class="wrapper"]/div[last()]/a[#class="pagenav"][last()]', process_value=process_value)),
# Rule(LinkExtractor(restrict_xpaths='//span[#class="update_title"]/a'), callback='parse_item'),
)
print command prints this:
Crawled (200) <GET http://web/category.php?id=4&>(referer: None)
http://web/
category.php?id=4&page=2&s=d&
Crawled (404) <GET http://web/%0D%0Acategory.php?%0D=&id=4&page=2&s=d>(referer: http://web/category.php?id=4&)
%0D and %0A are CR and LF characters in HTML-encoding.
Author of the website which you parse put the characters into HTML document. I think, occasionally, because they aren't visible in IDE or browser.
Explanation what the invisible characters mean:
https://en.wikipedia.org/wiki/Carriage_return
https://en.wikipedia.org/wiki/Newline
And more about the encoding http://www.w3schools.com/tags/ref_urlencode.asp
I suggest you to strip all the links which need to fetch in the way like that:
href = href.strip()

Portia Spider logs showing ['Partial'] during crawling

I have created a spider using Portia web scraper and the start URL is
https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs
While scheduling this spider in scrapyd I am getting
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs> (referer: None) ['partial']
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=2> (referer: https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs) ['partial']
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.showJob&RID=21805&CurrentPage=1> (referer: https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs) ['partial']`<br><br>
What does the ['partial'] mean and why the content from the page is not scraped by the spdier?
Late answer, but hopefully not useless, since this behavior by scrapy doesn't seem well-documented. Looking at this line of code from the scrapy source, the partial flag is set when the request encounters a Twisted PotentialDataLoss error. According to the corresponding Twisted documentation:
This only occurs when making requests to HTTP servers which do not set Content-Length or a Transfer-Encoding in the response
Possible causes include:
The server is misconfigured
There's a proxy involved that's blocking some headers
You get a response that doesn't normally have Content-Length, e.g. redirects (301, 302, 303), but you've set handle_httpstatus_list or handle_httpstatus_all such that the response doesn't get filtered out by HttpErrorMiddleware or fetched by RedirectMiddleware

Categories