Python Scrapy - Not Getting Any Content For Certain Page

Python Scrapy - Not Getting Any Content For Certain Page - python

I am trying to scrape some info from offerup.com and on the scrapy shell, nothing comes up.
I will type:
scrapy shell https://offerup.com/
It will go there but then if I simply try to get text of the whole webpage with:
response.xpath('//text()').extract()
it comes back with:
['Request unsuccessful. Incapsula incident ID: 623000250007296502-10946686267359632']
It comes back with nothing for any other info I try to get for the response such as the title.
Do you know why this happens? Any help is hugely appreciated.

Take care to read the response you get when visitng offerup.
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains
scrapy.Request, scrapy.Selector, etc)
[s] crawler
[s] item {}
[s] request https://offerup.com>
[s] response <403
https://offerup.com>
You get a 403, a Forbidden error. Nothing can bypass a 403.
If you try a different site, such as http://buffalo.craigslist.org, an OK response of 200 is given. Using the same command will show the desired page, and using response.xpath('//text()').extract() will print all of the text elements from root.
Some sites may have anti-scraping measures set up to prevent robots from hogging their resources. Offerup is apparently such a site.
To direclty answer your question, your code is functional, but the target site prevents you from using it.

Related

Error 404 Scrapy on Python url to be scraped works (sometimes) in browser but not in python

I'm working on a project where the data of the following url needs to be scraped: https://www.funda.nl/objectinsights/getdata/5628496/
The last part of the url represents the ID of an object. Opening the link in the browser does work, but sometimes it returns a 404 error. The same when using scrapy shell in python, sometimes I can scrape the url, sometimes not.
When I managed to open the url(without 404 error), I went to inspect > network. But i'm not experienced enough to understand this information. Does someone know the fix? Or additional information to this topic?
Extra urls you can try:
https://www.funda.nl/objectinsights/getdata/5819260/
https://www.funda.nl/objectinsights/getdata/5819578/
https://www.funda.nl/objectinsights/getdata/5819237/
https://www.funda.nl/objectinsights/getdata/5819359/
https://www.funda.nl/objectinsights/getdata/5819371/
https://www.funda.nl/objectinsights/getdata/5819386/

I tested these in scrapy shell and got response 200 each time.
This is not a Scrapy issue if you are having intermittent 404 response even from browser.
They may well limiting you to a small number of requests per ip address or per minute.
Try write some code with a delay in it between requests, or use rotating proxy (free trials are out there if you don't want t sign up for one).

empty list response extract on scrapy

I'm new on scrapy and i have to crawl a webpage for a test. So I use the code below on a terminal but its returns a empty list i Don't understand why. When i use the same command on a another website, like amazon, with the right selector, it works. Can someone put light on it? thank you so much
scrapy shell "'https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas"
response.css('.tileList-title').extract()

First of all, when I consulted the source code of the page you seemed interested to scrape the title Iced Teas in a header tags <h1>. Am I right ?
Second, I tried scrapy shell sessions to understand the issue. It seems to be a settings of user-agent request's headers. Look at the code sessions below:
Without user-agent set
scrapy shell https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas
In [1]: response.css('.tileList-title').extract()
Out[1]: []
view(response) #open the given response in your local web browser, for inspection.
With user agent set
scrapy shell https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas -s USER_AGENT='Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
In [1]: response.css('.tileList-title').extract()
Out[1]: ['<h1 class="tileList-title" ng-if="$ctrl.listTitle" tabindex="-1">Iced Teas</h1>']
#now as you can see it does not return an empty list.
view(response)
So to improve your future practices, know you can use -s KEYWORDSETTING=value in your scrapy shell sessions. Here the settings key words for scrapy.
And to check with view(response) to see if the requests returns the expected content even if it sent a 200. For my experience, with view(response) you can see that the content page, and even source code sometimes, is a little different when you use it in scrapy shell than when you use it in a normal browser. So that's a good practice to check with this shortcut. Here the shorcuts for scrapy. They are mentioned at each scrapy shell session too.

Why are the urls and headers of two requests seemingly the same but have different status codes (404 and 200, respectively)?

I am trying to crawl pages like this http://cmispub.cicpa.org.cn/cicpa2_web/07/0000010F849E5F5C9F672D8232D275F4.shtml. Each of these pages contains certain information about an individual person.
There are two ways to get to these pages.
One is to coin their urls, which is what I used in my scrapy code. I had my scrapy post request bodies like ascGuid=&isStock=00&method=indexQuery&offName=&pageNum=&pageSize=&perCode=110001670258&perName=&queryType=2to http://cmispub.cicpa.org.cn/cicpa2_web/public/query0/2/00.shtml.
These posts would return response where I can use xpath and regex to find strings like'0000010F849E5F5C9F672D8232D275F4' to coin the urls I really wanted:
next_url_part1 = 'http://cmispub.cicpa.org.cn/cicpa2_web/07/'
next_url_part2 = some xptah and regex...
next_url_part3 = '.shtml'
next_url_list.append([''.join([next_url_part1, i, next_url_part3]))
Finally, scrapy sent GET requests to these coined links and downloaded and crawled information I wanted.
Since the pages I wanted are information about different individuals, I can change the perCode= part in those POST request bodies to coin corresponding urls of different persons.
But this way sometimes doesn't work out. I have sent GET requests to about 100,000 urls I coined and I got 5 404 responses. To figure out what's going on and get information I want, I firstly pasted these failed urls in a browser and not to my suprise I still got 404. So I tried the other way on these 404 urls.
The other way is to manually access these pages in a browser like a real person. Since the pages I wanted are information about different individuals, I can write their personal codes on the down-left blank on this page http://cmispub.cicpa.org.cn/cicpa2_web/public/query0/2/00.shtml( only works properly under IE), and click the orange search button down-right(which I think is exactly like scrapy sending POST requests). And then a table will be on screen, by clicking the right-most blue words(which are the person's name), I can finally access these pages.
What confuses me is that after I practiced the 2nd way to those failed urls and got what I want, these previously 404 urls will return 200 when I retry them with the 1st way(To avoid influences of cookies I retry them with both scrapy shell and browser's inPrivate mode). I then compared the GET request headers of 200 and 404 responses, and they looks the same. I don't understand what's happening here. Could you please help me?
Here is the rest failed urls that I haven't tried the 2nd way so they still returns 404(if you get 200, maybe some other people have tried the url the 2nd way):
http://cmispub.cicpa.org.cn/cicpa2_web/07/7694866B620EB530144034FC5FE04783.shtml
and the personal code of this person is
110001670258
http://cmispub.cicpa.org.cn/cicpa2_web/07/C003D8B431A5D6D353D8E7E231843868.shtml
and the personal code of this person is
110101301633
http://cmispub.cicpa.org.cn/cicpa2_web/07/B8960E3C85AFCF79BF0823A9D8BCABCC.shtml
and the personal code of this person is 110101480523
http://cmispub.cicpa.org.cn/cicpa2_web/07/8B51A9A73684ADF200A38A5D492A1FEA.shtml
and the personal code of this person is 110101500315

scrapy can't crawl all links in a page

I am trying scrapy to crawl a ajax website http://play.google.com/store/apps/category/GAME/collection/topselling_new_free
I want to get all the links directing to each game.
I inspect the element of the page. And it looks like this:
how the page looks like
so I want to extract all links with the pattern /store/apps/details?id=
but when I ran commands in the shell, it returns nothing:
shell command
I've also tried //a/#href. didn't work out either but Don't know what is wrong going on....
Now I can crawl first 120 links with starturl modified and 'formdata' added as someone told me but no more links after that.
Can someone help me with this?

It's actually an ajax-post-request which populates the data on that page. In scrapy shell, you won't get this, instead of inspect element check the network tab there you will find the request.
Make post request to https://play.google.com/store/apps/category/GAME/collection/topselling_new_free?authuser=0 url with
formdata={'start':'0','num':'60','numChildren':'0','ipf':'1','xhr':'1'}
Increment start by 60 on each request to get the paginated result.

Scrapy, hash tag on URLs

I'm on the middle of a scrapping project using Scrapy.
I realized that Scrapy strips the URL from a hash tag to the end.
Here's the output from the shell:
[s] request <GET http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C#/ref=sr_nr_p_8_0?rh=n%3A165796011%2Cn%3A%212334086011%2Cn%3A%212334148011%2Cn%3A3006339011%2Cp_8%3A2229010011&bbn=3006339011&ie=UTF8&qid=1309631658&rnid=598357011>
[s] response <200 http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C>
This really affects my scrapping because after a couple of hours trying to find out why some item was not being selected, I realized that the HTML provided by the long URL differs from the one provided by the short one. Besides, after some observation, the content changes in some critical parts.
Is there a way to modify this behavior so Scrapy keeps the whole URL?
Thanks for your feedback and suggestions.

This isn't something scrapy itself can change--the portion following the hash in the url is the fragment identifier which is used by the client (scrapy here, usually a browser) instead of the server.
What probably happens when you fetch the page in a browser is that the page includes some JavaScript that looks at the fragment identifier and loads some additional data via AJAX and updates the page. You'll need to look at what the browser does and see if you can emulate it--developer tools like Firebug or the Chrome or Safari inspector make this easy.
For example, if you navigate to http://twitter.com/also, you are redirected to http://twitter.com/#!/also. The actual URL loaded by the browser here is just http://twitter.com/, but that page then loads data (http://twitter.com/users/show_for_profile.json?screen_name=also) which is used to generate the page, and is, in this case, just JSON data you could parse yourself. You can see this happen using the Network Inspector in Chrome.

Looks like it's not possible. The problem is not the response, it's in the request, which chops the url.
It is retrievable from Javascript - as
window.location.hash. From there you
could send it to the server with Ajax
for example, or encode it and put it
into URLs which can then be passed
through to the server-side.
Can I read the hash portion of the URL on my server-side application (PHP, Ruby, Python, etc.)?
Why do you need this part which is stripped if the server doesn't receive it from browser?
If you are working with Amazon - i haven't seen any problems with such urls.

Actually, when entering that URL in a web browser, it will also only send the part before the hash tag to the web server. If the content is different, it's probably because there are some javascript on the page that - based on the content of the hash tag part - changes the content of the page after it has been loaded (most likely an XmlHttpRequest is made that loads additional content).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.