Portia Spider logs showing ['Partial'] during crawling - python

I have created a spider using Portia web scraper and the start URL is
https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs
While scheduling this spider in scrapyd I am getting
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs> (referer: None) ['partial']
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=2> (referer: https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs) ['partial']
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.showJob&RID=21805&CurrentPage=1> (referer: https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs) ['partial']`<br><br>
What does the ['partial'] mean and why the content from the page is not scraped by the spdier?

Late answer, but hopefully not useless, since this behavior by scrapy doesn't seem well-documented. Looking at this line of code from the scrapy source, the partial flag is set when the request encounters a Twisted PotentialDataLoss error. According to the corresponding Twisted documentation:
This only occurs when making requests to HTTP servers which do not set Content-Length or a Transfer-Encoding in the response
Possible causes include:
The server is misconfigured
There's a proxy involved that's blocking some headers
You get a response that doesn't normally have Content-Length, e.g. redirects (301, 302, 303), but you've set handle_httpstatus_list or handle_httpstatus_all such that the response doesn't get filtered out by HttpErrorMiddleware or fetched by RedirectMiddleware

Related

Why doesn't a callback get executed immediately upon calling yield in Scrapy?

I am building a web scraper to scrape remote jobs. The spider behaves in a way that I don't understand and I'd appreciate it if someone could explain why.
Here's the code for the spider:
import scrapy
import time
class JobsSpider(scrapy.Spider):
name = "jobs"
start_urls = [
"https://stackoverflow.com/jobs/remote-developer-jobs"
]
already_visited_links = []
def parse(self, response):
jobs = response.xpath("//div[contains(#class, 'job')]")
links_to_next_pages = response.xpath("//a[contains(#class, 's-pagination--item')]").css("a::attr(href)").getall()
# visit each job page (as I do in the browser) and scrape the relevant information (Job title etc.)
for job in jobs:
job_id = int(job.xpath('#data-jobid').extract_first()) # there will always be one element
# now visit the link with the job_id and get the info
job_link_to_visit = "https://stackoverflow.com/jobs?id=" + str(job_id)
request = scrapy.Request(job_link_to_visit,
callback=self.parse_job)
yield request
# sleep for 10 seconds before requesting the next page
print("Sleeping for 10 seconds...")
time.sleep(10)
# go to the next job listings page (if you haven't already been there)
# not sure if this solution is the best since it has a loop which has a recursion in it
for link_to_next_page in links_to_next_pages:
if link_to_next_page not in self.already_visited_links:
self.already_visited_links.append(link_to_next_page)
yield response.follow(link_to_next_page, callback=self.parse)
print("End of parse method")
def parse_job(self, response):
print(response.body)
print("Sleeping for 10 seconds...")
time.sleep(10)
pass
Here's the output (the relevant parts):
Sleeping for 10 seconds...
End of parse method
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=525754> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=525748> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=497114> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=523136> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=525730> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
In parse_job
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs/remote-developer-jobs?so_source=JobSearch&so_medium=Internal> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=523319> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=522480> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=511761> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=522483> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=249610> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=522481> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
In parse_job
In parse_job
In parse_job
In parse_job
...
I don't understand why the parse method gets executed fully before the parse_job method gets called. From my understanding, as soon as I yield a job from jobs, the parse_job method should get called. The spider should go over each page of job listings and visit the details of each individual job at that job listing page. However, the description I gave in the previous sentence doesn't match the output. I also don't understand why are there multiple GET requests between each call to the parse_job method.
Can someone explain what is going on here?
Scrapy is event driven. Firstly requests are queued by Scheduler. Queued requests are passed to Downloader. The callback function is called when the response is downloaded and ready and then, response will be passed as the first argument to the callback function.
You are blocking callbacks by using time.sleep(). In the presented logs, after the first callback call the procedure was blocked for 10 seconds in parsed_job() but at the same time Downloader was working and getting responses ready for callback function as it is obvious in successive DEBUG: Crawled (200) logs after the first parse_job() call. So, while callback was blocked, Downloader finished its job and the responses were queued to be fed to callback function. As it is obvious in the last part of the logs, passing response to callback function was bottle necked.
If you want to put delay between requests, it's better to use DOWNLOAD_DELAY settings instead of time.sleep().
Take a look at this for more details about Scrapy architecture.

Scray shell URL returns 404 for endless scroll

I am training on how to use scrapy shell in the command prompt and here's the URL
https://shopee.com.my/shop/145423/followers/?__classic__=1
For the google chrome developers (F12 pressed) and at the Network section, I have cleared everything and scoll down the website and got this link
https://shopee.com.my/shop/145423/followers/?offset=60&limit=20&offset_of_offset=0&_=1610787400133
The link is supposed to return some data but when trying
scrapy shell https://shopee.com.my/shop/145423/followers/?offset=60&limit=20&offset_of_offset=0&_=1610787400133
I got 404 as a response.
I think there's a popup that needs the user to click on the language and this is what makes the problem
How can such popup dealed with or skipped?
Use User Agent . You can also use User Agent in command line
headers={'User-Agent': 'Mybot'}
>>> r = scrapy.Request(url, headers=headers)
>>> fetch(r)
2021-01-16 16:53:11 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://shopee.com.my/shop/145423/followers/?offset=60&limit=20&offset_of_offset=0&_=1610787400133&__classic__=1> from <GET https://shopee.com.my/shop/145423/followers/?offset=60&limit=20&offset_of_offset=0&_=1610787400133>
2021-01-16 16:53:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://shopee.com.my/shop/145423/followers/?offset=60&limit=20&offset_of_offset=0&_=1610787400133&__classic__=1> (referer: None)
>>> response.status
200
>>>

DEBUG: Crawled (404)

This is my code:
# -*- coding: utf-8 -*-
import scrapy
class SinasharesSpider(scrapy.Spider):
name = 'SinaShares'
allowed_domains = ['money.finance.sina.com.cn/mkt/']
start_urls = ['http://money.finance.sina.com.cn/mkt//']
def parse(self, response):
contents=response.xpath('//*[#id="list_amount_ctrl"]/a[2]/#class').extract()
print(contents)
And I have set an user-agent in setting.py.
Then I get an error:
2020-04-27 10:54:50 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://money.finance.sina.com.cn/robots.txt> (referer: None)
2020-04-27 10:54:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://money.finance.sina.com.cn/mkt//> (referer: None)
So How can I eliminate this error?
Maybe your ip is banned by the website,also you can need to add some cookies to crawling the data that you needed.
The http-statuscode 404 is received because Scrapy is checking the /robots.txt by default. In your case this site does not exist and so a 404 is received but that does not have any impact. In case you want to avoid checking the robots.txt you can set ROBOTSTXT_OBEY = False in the settings.py.
Then the website is accessed successfully (http-statuscode 200). No content is printed because based on your xpath-selection nothing is selected. You have to fix your xpath-selection.
If you want to test different xpath- or css-selections in order to figure how to get your desired content, you might want to use the interactive scrapy shell:
scrapy shell "http://money.finance.sina.com.cn/mkt/"
You can find an example of a scrapy shell session in the official Scrapy documentation here.

Confusion on Scrapy re-direct behavior?

So I am trying to scrape articles from news website that has an infinite scroll type layout so the following is what happens:
example.com has first page of articles
example.com/page/2/ has second page
example.com/page/3/ has third page
And so on. As you scroll down, the url changes. To account for that, I wanted to scrape the first x number of articles and did the following:
start_urls = ['http://example.com/']
for x in range(1,x):
new_url = 'http://www.example.com/page/' + str(x) +'/'
start_urls.append(new_url)
It seems to work fine for the first 9 pages and I get something like the following:
Redirecting (301) to <GET http://example.com/page/4/> from <GET http://www.example.com/page/4/>
Redirecting (301) to <GET http://example.com/page/5/> from <GET http://www.example.com/page/5/>
Redirecting (301) to <GET http://example.com/page/6/> from <GET http://www.example.com/page/6/>
Redirecting (301) to <GET http://example.com/page/7/> from <GET http://www.example.com/page/7/>
2017-09-08 17:36:23 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 3 pages/min), scraped 0 items (at 0 items/min)
Redirecting (301) to <GET http://example.com/page/8/> from <GET http://www.example.com/page/8/>
Redirecting (301) to <GET http://example.com/page/9/> from <GET http://www.example.com/page/9/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/10/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/11/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/12/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/13/>
Starting from page 10, it redirects to a page like example.com/ from example.com/page/10/ instead of the original link, example.com/page/10. What can be causing this behavior?
I looked into a couple options like dont_redirect, but I just don't understand what is happening. What can be the reason for this re-direction behavior? Especially since no re-direction happens when you directly type in the link for the website like example.com/page/10?
Any help would be greatly appreciated, thanks!!
[EDIT]
class spider(CrawlSpider):
start_urls = ['http://example.com/']
for x in range(startPage,endPage):
new_url = 'http://www.example.com/page/' + str(x) +'/'
start_urls.append(new_url)
custom_settings = {'DEPTH_PRIORITY': 1, 'DEPTH_LIMIT': 1}
rules = (
Rule(LinkExtractor(allow=('some regex here,')deny=('example\.com/page/.*','some other regex',),callback='parse_article'),
)
def parse_article(self, response):
#some parsing work here
yield item
Is it because I include example\.com/page/.* in the LinkExtractor? Shouldn't that only apply to links that are not the start_url however?
looks like this site uses some kind of security to only check the User-Agent in the request headers.
So you only need to add a common User-Agent in the settings.py file:
USER_AGENT = 'Mozilla/5.0'
Also, the spider doesn't necessarily need the start_urls attribute to get the starting sites, you can also use the start_requests method, so replace all the creating of start_urls with:
class spider(CrawlSpider):
...
def start_requests(self):
for x in range(1,20):
yield Request('http://www.example.com/page/' + str(x) +'/')
...

Scrapy crawl 301 redirect pages but doesn't scrape data from them

I can't figure out how to allow scrapy to scrape 301 redirected pages.
When I add
handle_httpstatus_list = [301,302]
the log stops telling me
2015-09-29 09:45:06 [scrapy] DEBUG: Crawled (301) <GET http://www.example.com/conditions-generales/> (referer: http://www.example.com/)
2015-09-29 09:45:07 [scrapy] DEBUG: Ignoring response <301 http://www.example.com/conditions-generales/>: HTTP status code is not handled or not allowed
but only crawls 301 redirected pages and never scrapes data from them (while it does for 200 http status code pages).
I then get :
2015-09-29 09:55:39 [scrapy] DEBUG: Crawled (301) <GET http://www.example.com/espace-annonceurs/> (referer: http://www.example.com/)
But never :
2015-09-29 09:55:39 [scrapy] DEBUG: Scraped from <301 http://www.example.com/espace-annonceurs/>
I would like to scrape http://www.example.com/espace-annonceurs/ juste the way I would if it were a 200 HTTP status code.
I suppose I have to use a middleware, but I don't know how to do this
Thank you for your help

Categories