Scrapy. Unexpected symbols in LinkExtractor - python

I am studying Scrapy library and trying to make a little crawler.
Here's the crawler's rules:
rules = (
Rule(LinkExtractor(restrict_xpaths='//div[#class="wrapper"]/div[last()]/a[#class="pagenav"][last()]')),
# Rule(LinkExtractor(restrict_xpaths='//span[#class="update_title"]/a'), callback='parse_item'),
)
But I get this error message:
DEBUG: Crawled (200) <GET http://web/category.php?id=4&> (referer: None)
DEBUG: Crawled (404) <GET http://web/%0D%0Acategory.php?id=4&page=2&s=d> (referer: http://web/category.php?id=4&)
DEBUG: Ignoring response <404 http://web/%0D%0Acategory.php?id=4&page=2&s=d>: HTTP status code is not handled or not allowed
Here's how html look like:
<a class="pagenav" href=" category.php?id=4&page=8&s=d& ">8</a>
|
<a class="pagenav" href=" category.php?id=4&page=9&s=d& ">9</a>
|
<a class="pagenav" href=" category.php?id=4&page=10&s=d& ">10</a>
|
<a class="pagenav" href=" category.php?id=4&page=2&s=d& ">Next ></a>
Can someone explain where's this %0D%0A come from?
Kind regards, Maxim.
UPD:
I made a simple function
def process_value(value):
value = value.strip()
print value
return value
and changed rules to
rules = (
Rule(LinkExtractor(restrict_xpaths='//div[#class="wrapper"]/div[last()]/a[#class="pagenav"][last()]', process_value=process_value)),
# Rule(LinkExtractor(restrict_xpaths='//span[#class="update_title"]/a'), callback='parse_item'),
)
print command prints this:
Crawled (200) <GET http://web/category.php?id=4&>(referer: None)
http://web/
category.php?id=4&page=2&s=d&
Crawled (404) <GET http://web/%0D%0Acategory.php?%0D=&id=4&page=2&s=d>(referer: http://web/category.php?id=4&)

%0D and %0A are CR and LF characters in HTML-encoding.
Author of the website which you parse put the characters into HTML document. I think, occasionally, because they aren't visible in IDE or browser.
Explanation what the invisible characters mean:
https://en.wikipedia.org/wiki/Carriage_return
https://en.wikipedia.org/wiki/Newline
And more about the encoding http://www.w3schools.com/tags/ref_urlencode.asp
I suggest you to strip all the links which need to fetch in the way like that:
href = href.strip()

Related

Scrapy getting all pages hrefs from an array of startUrls

The problem I have is the following: I am trying to scrape a website that has multiple categories of products, and for each category of products, it has several pages with 24 products in each. I am able to get all starting urls, and scraping every page I am able to get the urls (endpoints, which I then make into full urls) of all pages.
I should say that not for every category I have product pages, and not every starting url is a category and thus it might not have the structure I am looking for. But most of them do.
My intent is: from all pages of all categories I want to extract the href of every product displayed in the page. And the code I have been using is the following one:
import scrapy
class MySpider(scrapy.spiders.CrawlSpider):
name = 'myProj'
with open('resultt.txt','r') as f:
endurls = f.read()
f.close()
endurls= endurls.split(sep=' ')
endurls = ['https://www.someurl.com'+url for url in endurls]
start_urls = endurls
def parse(self, response):
with open('allpages.txt', 'a') as f:
pages_in_category = response.xpath('//option/#value').getall()
length = len(pages_in_category)
pages_in_category = ['https://www.someurl.com'+page for page in pages_in_category]
if length == 0:
f.write(str(response.url))
else:
for page in pages_in_category:
f.write(page)
f.close()
Through scrapy shell I am able to make it work, though not iteratively. The command I run in the terminal is then
scrapy runspider ScrapyCarr.py -s USER_AGENT='my-cool-project (http://example.com)'
Since I have not initialized a proper scrapy structure (I don't need that, it is a simple project for uni and I do not care much about the structure). Unfortunately the file in which I am trying to append my products urls remains empty, even if when inputting it through scrapy shell I see it working.
The output I am currently getting is the following
2020-10-15 12:51:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.someurl.com/market/fish/typefish/N-4minn0/c> (referer: None)
2020-10-15 12:51:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.someurl.com/market/medicines/typemed/N-i50owa/c> (referer: None)
2020-10-15 12:51:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.someurl.com/market/medicines/typemed/N-1l0cnr6/c> (referer: None)
2020-10-15 12:51:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.someurl.com/market/medicines/typemed/N-18isujc/c> (referer: None)
The problem was that I was initializing my class MySpider with a spider.CrawlSpider. The code works when using a class spider.Spider.
SOLVED

Scrapy ignores deny rule

As a newbie in scrapy and python, I'm struggling with the deny rules of my Crawl Spider. I want to filter all URLs on my target page, which contain the word "versicherung" and the double ? structure in any part of the URL. However, scrapy ignores my rule. Can anyone tell me what's wrong with the syntax (I've already tried without the "" before the *, but that doesn't work either)?
Rule:
rules = [Rule(LinkExtractor(deny=r'\*versicher\*', r\'*\?*\?\*',),
callback='parse_norisbank', follow=True)]
Log:
DEBUG: Crawled (200) <GET https://www.norisbank.de/produkte/versicherungen/rechtsschutzversicherung.html> (referer: https://www.norisbank.de)
DEBUG: Crawled (200) <GET https://www.norisbank.de/produkte/versicherungen/haftpflichtversicherung.html> (referer: https://www.norisbank.de)
DEBUG: Crawled (200) <GET https://www.norisbank.de/produkte/versicherungen/hausratversicherung.html> (referer: https://www.norisbank.de)
DEBUG: Crawled (200) <GET https://www.norisbank.de/produkte/versicherungen/versicherungsmanager.html> (referer: https://www.norisbank.de)
DEBUG: Saved file nbtest-versicherungen.html
The rules must be regular expressions and (even if I correct your syntax) you are not using * correctly.
r'\*versicher\*' should be r'.*versicher.*' EDIT: looking at scrapy docs, it looks like r'versicher' is sufficient.
I don't understand what you mean by "double ? structure", but your URLs don't seem to have it.
I expect r'.*\?\?.*' is what you want (or r'\?\?')
In regular expressions
. means any character
* means 0 or more of the preceding (so .* matches anything)
\\ is how you escape a special character. You don't want to escape the * since you want it to act in its special way.

DEBUG: Crawled (404)

This is my code:
# -*- coding: utf-8 -*-
import scrapy
class SinasharesSpider(scrapy.Spider):
name = 'SinaShares'
allowed_domains = ['money.finance.sina.com.cn/mkt/']
start_urls = ['http://money.finance.sina.com.cn/mkt//']
def parse(self, response):
contents=response.xpath('//*[#id="list_amount_ctrl"]/a[2]/#class').extract()
print(contents)
And I have set an user-agent in setting.py.
Then I get an error:
2020-04-27 10:54:50 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://money.finance.sina.com.cn/robots.txt> (referer: None)
2020-04-27 10:54:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://money.finance.sina.com.cn/mkt//> (referer: None)
So How can I eliminate this error?
Maybe your ip is banned by the website,also you can need to add some cookies to crawling the data that you needed.
The http-statuscode 404 is received because Scrapy is checking the /robots.txt by default. In your case this site does not exist and so a 404 is received but that does not have any impact. In case you want to avoid checking the robots.txt you can set ROBOTSTXT_OBEY = False in the settings.py.
Then the website is accessed successfully (http-statuscode 200). No content is printed because based on your xpath-selection nothing is selected. You have to fix your xpath-selection.
If you want to test different xpath- or css-selections in order to figure how to get your desired content, you might want to use the interactive scrapy shell:
scrapy shell "http://money.finance.sina.com.cn/mkt/"
You can find an example of a scrapy shell session in the official Scrapy documentation here.

Confusion on Scrapy re-direct behavior?

So I am trying to scrape articles from news website that has an infinite scroll type layout so the following is what happens:
example.com has first page of articles
example.com/page/2/ has second page
example.com/page/3/ has third page
And so on. As you scroll down, the url changes. To account for that, I wanted to scrape the first x number of articles and did the following:
start_urls = ['http://example.com/']
for x in range(1,x):
new_url = 'http://www.example.com/page/' + str(x) +'/'
start_urls.append(new_url)
It seems to work fine for the first 9 pages and I get something like the following:
Redirecting (301) to <GET http://example.com/page/4/> from <GET http://www.example.com/page/4/>
Redirecting (301) to <GET http://example.com/page/5/> from <GET http://www.example.com/page/5/>
Redirecting (301) to <GET http://example.com/page/6/> from <GET http://www.example.com/page/6/>
Redirecting (301) to <GET http://example.com/page/7/> from <GET http://www.example.com/page/7/>
2017-09-08 17:36:23 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 3 pages/min), scraped 0 items (at 0 items/min)
Redirecting (301) to <GET http://example.com/page/8/> from <GET http://www.example.com/page/8/>
Redirecting (301) to <GET http://example.com/page/9/> from <GET http://www.example.com/page/9/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/10/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/11/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/12/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/13/>
Starting from page 10, it redirects to a page like example.com/ from example.com/page/10/ instead of the original link, example.com/page/10. What can be causing this behavior?
I looked into a couple options like dont_redirect, but I just don't understand what is happening. What can be the reason for this re-direction behavior? Especially since no re-direction happens when you directly type in the link for the website like example.com/page/10?
Any help would be greatly appreciated, thanks!!
[EDIT]
class spider(CrawlSpider):
start_urls = ['http://example.com/']
for x in range(startPage,endPage):
new_url = 'http://www.example.com/page/' + str(x) +'/'
start_urls.append(new_url)
custom_settings = {'DEPTH_PRIORITY': 1, 'DEPTH_LIMIT': 1}
rules = (
Rule(LinkExtractor(allow=('some regex here,')deny=('example\.com/page/.*','some other regex',),callback='parse_article'),
)
def parse_article(self, response):
#some parsing work here
yield item
Is it because I include example\.com/page/.* in the LinkExtractor? Shouldn't that only apply to links that are not the start_url however?
looks like this site uses some kind of security to only check the User-Agent in the request headers.
So you only need to add a common User-Agent in the settings.py file:
USER_AGENT = 'Mozilla/5.0'
Also, the spider doesn't necessarily need the start_urls attribute to get the starting sites, you can also use the start_requests method, so replace all the creating of start_urls with:
class spider(CrawlSpider):
...
def start_requests(self):
for x in range(1,20):
yield Request('http://www.example.com/page/' + str(x) +'/')
...

Portia Spider logs showing ['Partial'] during crawling

I have created a spider using Portia web scraper and the start URL is
https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs
While scheduling this spider in scrapyd I am getting
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs> (referer: None) ['partial']
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=2> (referer: https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs) ['partial']
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.showJob&RID=21805&CurrentPage=1> (referer: https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs) ['partial']`<br><br>
What does the ['partial'] mean and why the content from the page is not scraped by the spdier?
Late answer, but hopefully not useless, since this behavior by scrapy doesn't seem well-documented. Looking at this line of code from the scrapy source, the partial flag is set when the request encounters a Twisted PotentialDataLoss error. According to the corresponding Twisted documentation:
This only occurs when making requests to HTTP servers which do not set Content-Length or a Transfer-Encoding in the response
Possible causes include:
The server is misconfigured
There's a proxy involved that's blocking some headers
You get a response that doesn't normally have Content-Length, e.g. redirects (301, 302, 303), but you've set handle_httpstatus_list or handle_httpstatus_all such that the response doesn't get filtered out by HttpErrorMiddleware or fetched by RedirectMiddleware

Categories