Scrapy - Should I enable cookies while crawling - python

I'm scraping data from some amazon url, but of course sometimes I get captcha. I was wondering enable/disable cookies option has to do with any of this. I rotate around 15 proxies while crawling. I guess the question is should I enable or disable cookies in the settings.py for clean pages or it's irrelavant?
I thought if I enable it website would know the history of what the IP does and after some point notice the pattern and won't allow it (this is my guess) so I should disable it?? or this is not even true about how cookies work and what they are for

How are you accessing these URLs, do you use the urllib library? If so, you might not have noticed but urllib has a default user-agent. The user-agent is part of the HTTP request (stored in the header) and identifies the type of software you have used to access a page. This allows websites to display their content correctly on different browsers but can also be used to determine if you are using an automated program (they don't like bots).
Now the default urllib user agent tells the website you are using python to access the page (usually a big no-no). You can spoof your user-agent quite easily to stop any nasty captcha codes from appearing!
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('www.example.com', None, headers)
html = urllib2.urlopen(req).read()
Because you're using scrapy to crawl webpages you may need to make changes your settings.py file so that you can change the user-agent there.
EDIT
Other reasons why captchas might be appearing all over the place is because you are moving too fast through a website. If you add a sleep call inbetween url requests then this might solve your captcha issue!
Other reasons for Captcha's appearing:
You are clicking on honeypot links (links that are within the html code but not displayed on the webpage) designed to catch crawlers.
You may need to change the pattern of crawling as it may be flagged as "non-human".
Check the websites robots.txt file which shows what is and isn't allowed to be crawled.

Related

Parsing bot protected site

I am trying to parse the website "https://ih.advfn.com/stock-market/NYSE/gen-electric-GE/stock-price" and extract its most recent messages from its board. It is bot protected with Cloud-flare. I am using python and its relative libraries and this is what I have so far
from bs4 import BeautifulSoup as soup #parses/cuts the html
import cfscrape
import requests
url = 'https://ih.advfn.com/stock-market/NYSE/gen-electric-GE/stock-
price'
r=requests.get(url)
html = soup(r.text, "html.parser")
containers = html.find("div",{"id":"bbPosts"})
print(containers.text.strip())
I am not able to use the html parser because the site detects and blocks my script then.
My questions are:
How can I parse the web pages to pull the table data?
Might I mention that this is for a security class I am taking. I am not using this for malicious reasons.
There are multiple ways of bypassing the site protection. You have to see exactly how they are blocking you.
One common way of blocking requests is to look at the User Agent header. The client ( in your case the requests library ) will inform the server about it's identity.
Generally speaking, a browser will say I am a browser and a library will say I am a library. The server can then say I allow browsers but not libraries to access my content.
However, for this particular case, you can simply lie to the server by sending your own User Agent header.
You can see a example here. Try to use your browsers user agent.
Other blocking techniques include ip ranges. One way to bypass this is via a vpn. This is one of the easiest vpns to set up. Just spin up a machine on amazon and get this container running.
What else could happen, you might try to access a single page application that is not rendered server side. In this case, what you should receive with that get requests is a very small html file that essentially references a javascript file. If this is the case, what you need is a actual browser that you control programatically. I would suggest you look at Google Chrome Headless however there are others. You can also use Selenium
Web crawling is a beautiful but very deep subject. I think these pointers should set you on the right direction.
Also, as a quick mention, my advice is to avoid from bs4 import BeautifulSoup as soup. I would recommend html2text

Scrapy to manage session cookies with full webkit javascript execution

I'm trying to use scrapy to scrape a site that uses javascript extensively to manipulate the document, cookies, etc (but nothing simple like JSON responses). For some reason I can't determine from the network traffic, the page I need comes up as an error when I scrape but not when viewed in the browser. So what I want to do is use webkit to render the page as it appears in the browser, and then scrape this. The scrapyjs project was made for this purpose.
To access the page I need, I had to have logged in previously, and saved some session cookies. My problem is that I cannot successfully provide the session cookie to webkit when it renders the page. There are two ways I could think to do this:
use scrapy page requests exclusively until I get to the page that needs webkit, and then pass along the requisite cookies.
use webkit within scrapy (via a modified version of scrapyjs), for the entire session from login until I get to the page I need, and allow it to preserve cookies as needed.
Unfortunately neither approach seems to be working.
Along the lines of approach 1, I tried the following:
In settings.py --
DOWNLOADER_MIDDLEWARES = {
'scrapyjs.middleware.WebkitDownloader': 701, #to run after CookiesMiddleware
}
I modified scrapyjs to send cookies: scrapyjs/middleware.py--
import gtk
import webkit
import jswebkit
#import gi.repository import Soup # conflicting static and dynamic includes!?
import ctypes
libsoup = ctypes.CDLL('/usr/lib/i386-linux-gnu/libsoup-2.4.so.1')
libwebkit = ctypes.CDLL('/usr/lib/libwebkitgtk-1.0.so.0')
def process_request( self, request, spider ):
if 'renderjs' in request.meta:
cookies = request.headers.getlist('Cookie')
if len(cookies)>0:
cookies = cookies[0].split('; ')
cookiejar = libsoup.soup_cookie_jar_new()
libsoup.soup_cookie_jar_set_accept_policy(cookiejar,0) #0==ALWAYS ACCEPT
up = urlparse(request.url)
for c in cookies:
sp=c.find('=') # find FIRST = as split position
cookiename = c[0:sp]; cookieval = c[sp+1:];
libsoup.soup_cookie_jar_add_cookie(cookiejar, libsoup.soup_cookie_new(cookiename,cookieval,up.hostname,'None',-1))
session = libwebkit.webkit_get_default_session()
libsoup.soup_session_add_feature(session,cookiejar)
webview = self._get_webview()
webview.connect('load-finished', self.stop_gtk)
webview.load_uri(request.url)
...
The code for setting the cookiejar is adapted from this response. The problem may be with how imports work; perhaps this is not the right webkit that I'm modifying -- I'm not too familiar with webkit and the python documentation is poor. (I can't use the second answer's approach with from gi.repository import Soup because it mixes static and dynamic libraries. I also can't find any get_default_session() in webkit as imported above).
The second approach fails because sessions aren't preserved across requests, and again I don't know enough about webkit to know how to make it persist in this framework.
Any help appreciated!
Actually, the first approach does work, but with one modification. The path to the cookies needs to be '/' (at least in my application), and not 'None' as in the code above. Ie, the line should be
libsoup.soup_cookie_jar_add_cookie(cookiejar, libsoup.soup_cookie_new(cookiename,cookieval,up.hostname,'/',-1))
Unfortunately this only pushes the question back a bit. Now the cookies are saved properly, but the full page (including the frames) is still not being loaded and rendered with webkit as I had expected, and so the DOM is not complete as I see it in within the browser. If I simply request the frame that I want, then I get the error page instead of the content that is shown in a real browser. I'd love to see how to use webkit to render the whole page, including frames. Or how to achieve the second approach, completing the entire session in webkit.
Not knowing complete work flow of Ithe application, you need to make sure setting the cookie jar happens before any other network activity is done by webkit. http://webkitgtk.org/reference/webkitgtk/unstable/webkitgtk-Global-functions.html#webkit-get-default-session. In my experience, this practically means even before instantiating the web view.
Another thing to check for is if your frames are from same domain.Cookie policies will not allow cookies across different domain.
Lastly, you can probably inject the cookies. See http://webkitgtk.org/reference/webkitgtk/unstable/webkitgtk-webkitwebview.html#WebKitWebView-navigation-policy-decision-requested or resource-request-starting and then set the cookies on actual soup message.

urllib2 not retrieving url with hashes on it

I'm trying to get some data from a webpage, but I found a problem. Whenever I want to go to the next page (i.e. page 2) to keep retrieving the data on it, I keep receiving the data from page 1. Apparently something goes wrong trying to switch to the next page.
The thing is, I haven't had problems with urls like this:
'http://www.webpage.com/index.php?page=' + str(pageno)
I can just start a while statement and I'll just jump to page 2 by adding 1 to "pageno"
My problem comes in when I try to open an url with this format:
'http://www.webpage.com/search/?show_all=1#sort_order=ASC&page=' + str(pageno)
As
urllib2.urlopen('http://www.webpage.com/search/?show_all=1#sort_order=ASC&page=4').read()
will retrieve the source code from http://www.webpage.com/search/?show_all=1
There is no other way to retrieve other pages without using the hash, as far as I'm concerned.
I guess it's just urllib2 ignoring the hash, as it is normally used to specify a starting point for a browser.
The fragment of the url after the hash (#) symbol is for client-side handling and isn't actually sent to the webserver. My guess is there is some javascript on the page that requests the correct data from the server using AJAX, and you need to figure out what URL is used for that.
If you use chrome you can watch the Network tab of the developer tools and see what URLs are requested when you click the link to go to page two in your browser.
that's because hash are not part of the url that is sent to the server, it's a fragment identifier that is used to identify elements inside the page. Some websites misused the hash fragment for JavaScript hook for identifying pages though. You'll either need to be able to execute the JavaScript on the page or you'll need to reverse engineer the JavaScript and emulate the true search request that is being made, presumably through ajax. Firebug's Net tab will be really useful for this.

Web scraping using Python

I am trying to scrape the website http://www.nseindia.com using urllib2 and BeautifulSoup. Unfortunately, I keep getting 403 Forbidden when I try to access the page through Python. I thought it was a user agent issue, but changing that did not help. Then I thought it may have something to do with cookies, but apparently loading the page through links with cookies turned off works fine. What may be blocking requests through urllib?
http://www.nseindia.com/ seems to require an Accept header, for whatever reason. This should work:
import urllib2
r = urllib2.Request('http://www.nseindia.com/')
r.add_header('Accept', '*/*')
r.add_header('User-Agent', 'My scraping program <author#example.com>')
opener = urllib2.build_opener()
content = opener.open(r).read()
Refusing requests without Accept headers is incorrect; RFC 2616 clearly states
If no Accept header field is present, then it is assumed that the
client accepts all media types.

Scrapy, hash tag on URLs

I'm on the middle of a scrapping project using Scrapy.
I realized that Scrapy strips the URL from a hash tag to the end.
Here's the output from the shell:
[s] request <GET http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C#/ref=sr_nr_p_8_0?rh=n%3A165796011%2Cn%3A%212334086011%2Cn%3A%212334148011%2Cn%3A3006339011%2Cp_8%3A2229010011&bbn=3006339011&ie=UTF8&qid=1309631658&rnid=598357011>
[s] response <200 http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C>
This really affects my scrapping because after a couple of hours trying to find out why some item was not being selected, I realized that the HTML provided by the long URL differs from the one provided by the short one. Besides, after some observation, the content changes in some critical parts.
Is there a way to modify this behavior so Scrapy keeps the whole URL?
Thanks for your feedback and suggestions.
This isn't something scrapy itself can change--the portion following the hash in the url is the fragment identifier which is used by the client (scrapy here, usually a browser) instead of the server.
What probably happens when you fetch the page in a browser is that the page includes some JavaScript that looks at the fragment identifier and loads some additional data via AJAX and updates the page. You'll need to look at what the browser does and see if you can emulate it--developer tools like Firebug or the Chrome or Safari inspector make this easy.
For example, if you navigate to http://twitter.com/also, you are redirected to http://twitter.com/#!/also. The actual URL loaded by the browser here is just http://twitter.com/, but that page then loads data (http://twitter.com/users/show_for_profile.json?screen_name=also) which is used to generate the page, and is, in this case, just JSON data you could parse yourself. You can see this happen using the Network Inspector in Chrome.
Looks like it's not possible. The problem is not the response, it's in the request, which chops the url.
It is retrievable from Javascript - as
window.location.hash. From there you
could send it to the server with Ajax
for example, or encode it and put it
into URLs which can then be passed
through to the server-side.
Can I read the hash portion of the URL on my server-side application (PHP, Ruby, Python, etc.)?
Why do you need this part which is stripped if the server doesn't receive it from browser?
If you are working with Amazon - i haven't seen any problems with such urls.
Actually, when entering that URL in a web browser, it will also only send the part before the hash tag to the web server. If the content is different, it's probably because there are some javascript on the page that - based on the content of the hash tag part - changes the content of the page after it has been loaded (most likely an XmlHttpRequest is made that loads additional content).

Categories