Scrapy, hash tag on URLs

Scrapy, hash tag on URLs - python

I'm on the middle of a scrapping project using Scrapy.
I realized that Scrapy strips the URL from a hash tag to the end.
Here's the output from the shell:
[s] request <GET http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C#/ref=sr_nr_p_8_0?rh=n%3A165796011%2Cn%3A%212334086011%2Cn%3A%212334148011%2Cn%3A3006339011%2Cp_8%3A2229010011&bbn=3006339011&ie=UTF8&qid=1309631658&rnid=598357011>
[s] response <200 http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C>
This really affects my scrapping because after a couple of hours trying to find out why some item was not being selected, I realized that the HTML provided by the long URL differs from the one provided by the short one. Besides, after some observation, the content changes in some critical parts.
Is there a way to modify this behavior so Scrapy keeps the whole URL?
Thanks for your feedback and suggestions.

This isn't something scrapy itself can change--the portion following the hash in the url is the fragment identifier which is used by the client (scrapy here, usually a browser) instead of the server.
What probably happens when you fetch the page in a browser is that the page includes some JavaScript that looks at the fragment identifier and loads some additional data via AJAX and updates the page. You'll need to look at what the browser does and see if you can emulate it--developer tools like Firebug or the Chrome or Safari inspector make this easy.
For example, if you navigate to http://twitter.com/also, you are redirected to http://twitter.com/#!/also. The actual URL loaded by the browser here is just http://twitter.com/, but that page then loads data (http://twitter.com/users/show_for_profile.json?screen_name=also) which is used to generate the page, and is, in this case, just JSON data you could parse yourself. You can see this happen using the Network Inspector in Chrome.

Looks like it's not possible. The problem is not the response, it's in the request, which chops the url.
It is retrievable from Javascript - as
window.location.hash. From there you
could send it to the server with Ajax
for example, or encode it and put it
into URLs which can then be passed
through to the server-side.
Can I read the hash portion of the URL on my server-side application (PHP, Ruby, Python, etc.)?
Why do you need this part which is stripped if the server doesn't receive it from browser?
If you are working with Amazon - i haven't seen any problems with such urls.

Actually, when entering that URL in a web browser, it will also only send the part before the hash tag to the web server. If the content is different, it's probably because there are some javascript on the page that - based on the content of the hash tag part - changes the content of the page after it has been loaded (most likely an XmlHttpRequest is made that loads additional content).

Related

Scrapy: webpage next button uses WebForm_DoPostBackWithOptions()

I am new to scrapy and trying to scrape https://www.sakan.co/result?srv=1&prov=&cty=&maintyp=1&typ=5&minpr=&maxpr=&bdrm=&blk=
This webpage is using a href with the following:
href="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$Content$rptPaging$ctl02$lbPaging", "", true, "", "", false, true))"
Data is getting loaded dynamically. I am trying to find the source (API call if any) for data that is getting loaded but could not find any. How can I navigate to next page and scrape data using Scrapy.

What this js effectively do is trigger a POST request, you can check the details of the request in the browsers developer tools, network tab. (F12 in Firefox - Open the tab and click the link)
Your Scrapy needs to reproduce that same POST request. All the information in the body is available in the page, just keep in mind that those fields that start with __, like __VIEWSTATE, are instance dependent, so you need to retrieve their values from the page your Scrapy loads, copy and paste will usually fail.
The easier way to do this is using the FormRequest.from_response() method. However, its important to check if the method is producing a request body that is the same your browser, quite often the method skips a required field or adds an extra one. (It relies on the page's <form>)
You can read more on scraping this kind of page in this link from Scrapy FAQ.
Finally one last tip: If your request body is the just like the browser, but the request still fails, you might need to reproduce the request headers as well.

Scrapy for dynamic content

Can we use Scrapy for getting content from a web page which is loaded by Javascript?
I'm trying to scrape usage examples from this page,
but since they are loaded using Javascript as a JSON object I'm not able to get them with Scrapy.
Could you suggest what is the best way to deal with such issues?

Open your browser's developer tools and look at the Network tab. If you hit the "next" button on that page enough, it'll send out a new request:
After removing the JSONP paramter, the URL is pretty straightforward:
https://corpus.vocabulary.com/api/1.0/examples.json?query=unalienable&maxResults=24&startOffset=24&filter=0
By making the minimal number of requests, your spider will be fast.
If you want to just emulate a full browser and execute the JavaScript, you can use something like Selenium or Scrapinghub's Splash (and its corresponding Scrapy plugin).

How to scrape value from page that loads dynamicaly?

The homepage of the website I'm trying to scrape displays four tabs, one of which reads "[Number] Available Jobs". I'm interested in scraping the [Number] value. When I inspect the page in Chrome, I can see the value enclosed within a <span> tag.
However, there is nothing enclosed in that <span> tag when I view the page source directly. I was planning on using the Python requests module to make an HTTP GET request and then use regex to capture the value from the returned content. This is obviously not possible if the content doesn't contain the number I need.
My questions are:
What is happening here? How can a value be dynamically loaded into a
page, displayed, and then not appear within the HTML source?
If the value doesn't appear in the page source, what can I do to
reach it?

If the content doesn't appear in the page source then it is probably generated using javascript. For example the site might have a REST API that lists jobs, and the Javascript code could request the jobs from the API and use it to create the node in the DOM and attach it to the available jobs. That's just one possibility.
One way to scrap this information is to figure out how that javascript works and make your python scraper do the same thing (for example, if there is a simple REST API it is using, you just need to make a request to that same URL). Often that is not so easy, so another alternative is to do your scraping using a javascript capable browser like selenium.
One final thing I want to mention is that regular expressions are a fragile way to parse HTML, you should generally prefer to use a library like BeautifulSoup.

1.A value can be loaded dynamically with ajax, ajax loads asynchronously that means that the rest of the site does not wait for ajax to be rendered, that's why when you get the DOM the elements loaded with ajax does not appear in it.
2.For scraping dynamic content you should use selenium, here a tutorial

for data that load dynamically you should look for an xhr request in the networks and if you can make that data productive for you than voila!!
you can you phantom js, it's a headless browser and it captures the html of that page with the dynamically loaded content.

Python Scrapy : response object different from source code in browser

I'm working on a project using Scrapy.
All wanted fields but one get scraped perfectly. The content of the missing field simply doesn't show up in the Scrapy response (as checked in the scrapy shell), while it does show up when i use my browser to visit the page. In the scrapy response, the expected tags are there, but not the text between the tags.
There's no JavaScript involved, but it is a variable that is provided by the server (it's the current number of visits to that particular page). No iframe involved either.
Already set the user agent (in the settings-file) to match my browser.
Already set the download delay (in the settings-file) to 5.
EDIT (addition):
The page : http://www.fincaraiz.com.co/apartamento-en-venta/bogota/salitre-det-1337688.aspx
Xpath to the wanted element : //*[#id="numAdvertVisits"]
What could be the cause of this mystery ?

It's an ajax/javascript loaded value.
What steps did you take to determine there is no JS involved? I loaded the page w/o javascript, and while that area of the page had the stub content ("Visitas"), the actual data was written there with an ajax request.
You can still load that data using scrapy, it'll just take an additional request to the URL endpoint normally accessed via on-page ajax. The server returns the number of visits in XML, via the script at http://www.fincaraiz.com.co/WebServices/Statistics.asmx/GetAdvertVisits?idAdvert=1337688&idASource=40&idType=1001 (try loading that script and you'll see the # of visits for the page you provided in the original email).
There is another ajax request that returns "True" for that page, but I'm not sure what the data's actual meaning is. Still, it may be useful:
http://www.fincaraiz.com.co/WebServices/Statistics.asmx/DetailAdvert?idAdvert=1337688&idType=1001&idASource=40&strCookie=13/11/2014:19-05419&idSession=10hx5wsfbqybyxsywezx0n1r&idOrigin=44

urllib2 not retrieving url with hashes on it

I'm trying to get some data from a webpage, but I found a problem. Whenever I want to go to the next page (i.e. page 2) to keep retrieving the data on it, I keep receiving the data from page 1. Apparently something goes wrong trying to switch to the next page.
The thing is, I haven't had problems with urls like this:
'http://www.webpage.com/index.php?page=' + str(pageno)
I can just start a while statement and I'll just jump to page 2 by adding 1 to "pageno"
My problem comes in when I try to open an url with this format:
'http://www.webpage.com/search/?show_all=1#sort_order=ASC&page=' + str(pageno)
As
urllib2.urlopen('http://www.webpage.com/search/?show_all=1#sort_order=ASC&page=4').read()
will retrieve the source code from http://www.webpage.com/search/?show_all=1
There is no other way to retrieve other pages without using the hash, as far as I'm concerned.
I guess it's just urllib2 ignoring the hash, as it is normally used to specify a starting point for a browser.

The fragment of the url after the hash (#) symbol is for client-side handling and isn't actually sent to the webserver. My guess is there is some javascript on the page that requests the correct data from the server using AJAX, and you need to figure out what URL is used for that.
If you use chrome you can watch the Network tab of the developer tools and see what URLs are requested when you click the link to go to page two in your browser.

that's because hash are not part of the url that is sent to the server, it's a fragment identifier that is used to identify elements inside the page. Some websites misused the hash fragment for JavaScript hook for identifying pages though. You'll either need to be able to execute the JavaScript on the page or you'll need to reverse engineer the JavaScript and emulate the true search request that is being made, presumably through ajax. Firebug's Net tab will be really useful for this.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy, hash tag on URLs - python

Related

Scrapy: webpage next button uses WebForm_DoPostBackWithOptions()

Scrapy for dynamic content

How to scrape value from page that loads dynamicaly?

Python Scrapy : response object different from source code in browser

urllib2 not retrieving url with hashes on it

Categories

Resources