Scrapy Redirect in python - python

I am a bit new to the Scrapy framework. I want to scrape a web page that takes me through the result page through a Redirect.
The search form is in my start URL.
Parse Method: I take the response and use
FormRequest.from_response(response,formdata=values,
callback=self.handle_redirect)
to generate a request with all the values needed to post the url.
This request goes to the 302 (object Moved) page. Once there, I don't want to scrape any more data.
I want to redirect to the original page where there are search results.
How should I approach this?

Related

Unable to get all the links within a page

I am trying to scrape this page:
https://www.jny.com/collections/bottoms
It has a total of 55 products listed with only 24 listed once the page is loaded. However, the div contains list of all the 55 products. I am trying to scrape that using scrappy like this :
def parse(self, response):
print("in herre")
self.product_url = response.xpath('//div[#class = "collection-grid js-filter-grid"]//a/#href').getall()
print(len(self.product_url))
print(self.product_url)
It only gives me a list of length 25. How do I get the rest?
I would suggest scraping it through the API directly - the other option would be rendering Javascript using something like Splash/Selenium, which is really not ideal.
If you open up the Network panel in the Developer Tools on Chrome/Firefox, filter down to only the XHR Requests and reload the page, you should be able to see all of the requests being sent out. Some of those requests can help us figure out how the data is being loaded into the HTML. Here's a screenshot of what's going on there behind the scenes.
Clicking on those requests can give us more details on how the requests are being made and the request structure. At the end of the day, for your use case, you would probably want to send out a request to https://www.jny.com/collections/bottoms/products.json?limit=250&page=1 and parse the body_html attribute for each Product in the response (perhaps using scrapy.selector.Selector) and use that however you want. Good luck!

Scraping next page XHR request

I want to scrape the second page of this user reviews.
However the next button executes a XHR request, and while I can see it using Chrome developer tools, I cannot replicate it.
It's not so easy task. First of all you should install this
extension.
It helps you to test own requests based on captured data, i.e. catch and simulate requests with captured data.
As I see they send a token in this XHR request, so you need to get it in from html page body(stores in source code, js variable "taSecureToken" ).
Next you need to do four steps:
Catch POST request with plugin
Change token to saved before
Set limit and offset variables in POST request data
Generate request with resulted body
Note: on this request server returns json data(not the html with next page) containing info about loaded objects on next page.

Is there anyway to access webpage content which has an 302 redirect?

Suppose we have a web page on localhost/helloworld.php which is a normal PHP page with all normal HTML in it, but the page has been set to redirect to another page with a response code of 302
So my question is, is there any way using Python to access those HTML content on localhost/helloworld.php with following protocols and getting redirect to another page?
Note: Page is outputting everything normally, but a header is set using PHP's header(); function to redirect to another page temporarily.

How to simulate XHR request in Scrapy for dynamically loading web pages?

I am trying to crawl olx.in site http://www.olx.in/newdelhi/bmw/, I have set this URL as start_url.
Now to go to next page as it is not normal HTML but it dynamic so in network tab I saw that next button creates a XHR request with POST method. Now I have to simulate it in request method(I guess...) but I can't figure out what will be it's parameters.
I am new to python and web-scraping so sorry if it's too general but any help would be appreciated.
You should take a look at FormRequest that enables you to send data via HTTP POST. As you can see the next button creates a request to http://www.olx.in/ajax/newdelhi/search/list/, with some form data. Just populate the formdata parameter with the needed values from the current Response object. As you are trying to build a pagination you should check this page on how to do it properly.

Scrapy, hash tag on URLs

I'm on the middle of a scrapping project using Scrapy.
I realized that Scrapy strips the URL from a hash tag to the end.
Here's the output from the shell:
[s] request <GET http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C#/ref=sr_nr_p_8_0?rh=n%3A165796011%2Cn%3A%212334086011%2Cn%3A%212334148011%2Cn%3A3006339011%2Cp_8%3A2229010011&bbn=3006339011&ie=UTF8&qid=1309631658&rnid=598357011>
[s] response <200 http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C>
This really affects my scrapping because after a couple of hours trying to find out why some item was not being selected, I realized that the HTML provided by the long URL differs from the one provided by the short one. Besides, after some observation, the content changes in some critical parts.
Is there a way to modify this behavior so Scrapy keeps the whole URL?
Thanks for your feedback and suggestions.
This isn't something scrapy itself can change--the portion following the hash in the url is the fragment identifier which is used by the client (scrapy here, usually a browser) instead of the server.
What probably happens when you fetch the page in a browser is that the page includes some JavaScript that looks at the fragment identifier and loads some additional data via AJAX and updates the page. You'll need to look at what the browser does and see if you can emulate it--developer tools like Firebug or the Chrome or Safari inspector make this easy.
For example, if you navigate to http://twitter.com/also, you are redirected to http://twitter.com/#!/also. The actual URL loaded by the browser here is just http://twitter.com/, but that page then loads data (http://twitter.com/users/show_for_profile.json?screen_name=also) which is used to generate the page, and is, in this case, just JSON data you could parse yourself. You can see this happen using the Network Inspector in Chrome.
Looks like it's not possible. The problem is not the response, it's in the request, which chops the url.
It is retrievable from Javascript - as
window.location.hash. From there you
could send it to the server with Ajax
for example, or encode it and put it
into URLs which can then be passed
through to the server-side.
Can I read the hash portion of the URL on my server-side application (PHP, Ruby, Python, etc.)?
Why do you need this part which is stripped if the server doesn't receive it from browser?
If you are working with Amazon - i haven't seen any problems with such urls.
Actually, when entering that URL in a web browser, it will also only send the part before the hash tag to the web server. If the content is different, it's probably because there are some javascript on the page that - based on the content of the hash tag part - changes the content of the page after it has been loaded (most likely an XmlHttpRequest is made that loads additional content).

Categories