Web scraping using Python - python

I am trying to scrape the website http://www.nseindia.com using urllib2 and BeautifulSoup. Unfortunately, I keep getting 403 Forbidden when I try to access the page through Python. I thought it was a user agent issue, but changing that did not help. Then I thought it may have something to do with cookies, but apparently loading the page through links with cookies turned off works fine. What may be blocking requests through urllib?

http://www.nseindia.com/ seems to require an Accept header, for whatever reason. This should work:
import urllib2
r = urllib2.Request('http://www.nseindia.com/')
r.add_header('Accept', '*/*')
r.add_header('User-Agent', 'My scraping program <author#example.com>')
opener = urllib2.build_opener()
content = opener.open(r).read()
Refusing requests without Accept headers is incorrect; RFC 2616 clearly states
If no Accept header field is present, then it is assumed that the
client accepts all media types.

Related

Using python requests is it possible to bypass adblock using request headers?

I'm getting turn-off adblock messages from a website using requests.get() with default request parameters.
Would anyone be able to explain whether there is any way to specify adblock rules in the requests headers? For instance, adding a website exception as defined on https://adblockplus.org/filter-cheatsheet, the likes of: requests.get(url, headers={...adblock_rules = "##||example.com^"...}?

Scrapy - Should I enable cookies while crawling

I'm scraping data from some amazon url, but of course sometimes I get captcha. I was wondering enable/disable cookies option has to do with any of this. I rotate around 15 proxies while crawling. I guess the question is should I enable or disable cookies in the settings.py for clean pages or it's irrelavant?
I thought if I enable it website would know the history of what the IP does and after some point notice the pattern and won't allow it (this is my guess) so I should disable it?? or this is not even true about how cookies work and what they are for
How are you accessing these URLs, do you use the urllib library? If so, you might not have noticed but urllib has a default user-agent. The user-agent is part of the HTTP request (stored in the header) and identifies the type of software you have used to access a page. This allows websites to display their content correctly on different browsers but can also be used to determine if you are using an automated program (they don't like bots).
Now the default urllib user agent tells the website you are using python to access the page (usually a big no-no). You can spoof your user-agent quite easily to stop any nasty captcha codes from appearing!
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('www.example.com', None, headers)
html = urllib2.urlopen(req).read()
Because you're using scrapy to crawl webpages you may need to make changes your settings.py file so that you can change the user-agent there.
EDIT
Other reasons why captchas might be appearing all over the place is because you are moving too fast through a website. If you add a sleep call inbetween url requests then this might solve your captcha issue!
Other reasons for Captcha's appearing:
You are clicking on honeypot links (links that are within the html code but not displayed on the webpage) designed to catch crawlers.
You may need to change the pattern of crawling as it may be flagged as "non-human".
Check the websites robots.txt file which shows what is and isn't allowed to be crawled.

Python Requests status_code 401 UNAUTHORIZED

I'm trying to web scrape from WITHIN a secure network. Security is already tight and I have a username and password- but if you open the site I'm trying to get on with my program, you wouldn't be prompted to login (because you're inside the network). I'm having trouble with the authentication here...
import requests
url = "http://theinternalsiteimtryingtoaccess.com"
r = requests.get(url, auth=('myusername', 'mypass'))
print(r.status_code)
>>>401
I've tried HTTPBasicAuth, but that didn't work either. Are there any ways with requests to get around this?
Just another note, the 'urlopen' command will open this site on command without any authentication being required...Please help! Thanks!
EDIT: After finding this question- (How to scrape URL data from intranet site using python?), I tried the following:
import requests
from requests_ntlm import HttpNtlmAuth
r = requests.get("http://theinternalsiteimtryingtoaccess.aspx",auth=HttpNtlmAuth('NEED DOMAIN HERE\\usr','pass'))
print(r.status_code)
>>>401 #still >:/
RESOLVED: Make sure that if you're having this problem, and you're trying to access an internal site, that in the code you specify your particular domain. I was trying to login but the computer didn't know where to log me into. You can find the domain you're on by going to control panel>>>system and domain should be listed there. Thank you!
It's extremely unlikely we can give you the exact solution to your problem, but I would guess the intranet uses some sort of corporate proxy. I would think your requests need to be directed to the proxy and not as if it's hitting an external public site.
For more information on this check out the official docs.
http://docs.python-requests.org/en/master/user/advanced/#proxies

BeautifulSoup crawling cookies

I've been tasked with creating a cookie audit tool that crawls the entire website and gathers data on all cookies on the page and categorizes them according to whether they follow user data or not. I'm new to Python but I think this will be a great project for me, would beautifulsoup be a suitable tool for the job? We have tons of sites and are currently migrating to Drupal so it would have to be able to scan Polopoly CMS and Drupal.
Urllib2 is for submitting http requests, BeautifulSoup is for parsing html. You'll definitely need a http request library, and you may need BeautifulSoup as well depending on what exactly you want to do.
BeautifulSoup is extremely easy to use and parses broken html well, so would be good for grabbing the links to any javascript on a page (even in cases where the html is malformed). You'll then need something else to parse the javascript to figure out whether it's interacting with the cookies.
To see what the cookie values are on the client-side just look at the http-request header or use cookielib (although I've personally not used this library).
For http requests, I recommend the requests library, looking at the http-request headers will be as simple as:
response = requests.get(url)
header = response.headers
I suspect requests also has a shortcut for just accessing the Set-Cookie values of the header as well, but you'll need to look into that.
I don't think you need BeautifulSoup for this. You could do this with urllib2 for connection and cookielib for operations on cookies.
You don't need bs4 for this purpose because you only require info from cookies. (use bs4 only if finally you need to extract something from html code).
For the cookies stuff I would use python-request and its support to http sessions: http://docs.python-requests.org/en/latest/user/advanced/

How to extract text from a web page that requires logging in using python and beautiful soup?

i have to retrieve some text from a website called morningstar.com . To access that data i have to log in. Once i log in and provide the url of the web page , i get the HTML text of a normal user (not logged in).As a result am not able to accees that information . ANy solutions ?
BeautifulSoup is for parsing html once you've already fetched it. You can fetch the html using any standard url fetching library. I prefer curl, as you tagged your post, python's built-in urllib2 also works well.
If you're saying that after logging in the response html is the same as for those who are not logged in, I'm gonna guess that your login is failing for some reason. If you are using urllib2, are are you making sure to store the cookie properly after your first login and then passing this cookie to urllib2 when you are sending the request for the data?
It would help if you posted the code you are using to make the two requests (the initial login, and the attempt to fetch the data).

Categories