I'm getting turn-off adblock messages from a website using requests.get() with default request parameters.
Would anyone be able to explain whether there is any way to specify adblock rules in the requests headers? For instance, adding a website exception as defined on https://adblockplus.org/filter-cheatsheet, the likes of: requests.get(url, headers={...adblock_rules = "##||example.com^"...}?
Related
I need to fetch basic profile data (complete page - html) of Linkedin profile. I tried python packages such as beautifulsoup but I get access denied.
I have generated the api tokens for linkedIn, but I am not sure how to incorporate those into the code.
Basically, I want to automate the process of scraping by just providing the company name.
Please help. Thanks!
Beautiful Soup is a web scraper. Typically, people use this library to parse data from public websites or websites that don't have APIs. For example, you could use it to scrape the top 10 Google Search results.
Unlike web scrapers, a API lets you retrieve data behind non-public websites. Furthermore, it returns the data in a easily readable XML or JSON format, so you don't have to "scrape" a HTML file for the specific data you care about.
To make a API call to LinkedIn, use need to use a python HTTP request library. See this stackoverflow post for examples.
Take a look at Step 4 of the LinkedIn API documentation. It shows a sample HTTP GET call.
GET /v1/people/~ HTTP/1.1
Host: api.linkedin.com
Connection: Keep-Alive
Authorization: Bearer AQXdSP_W41_UPs5ioT_t8HESyODB4FqbkJ8LrV_5mff4gPODzOYR
Note that you also need to send a "Authorization" header along with HTTP GET call. This is where your token would go. You're probably getting an access denied right now because you didn't set this header in your request.
Here's an example of how you would add that header to a request with the requests library.
And that should be it. When you make that request, it should return a XML or JSON that has the data you want. You can use an XML or JSON parser to get the specific fields you want.
I'm scraping data from some amazon url, but of course sometimes I get captcha. I was wondering enable/disable cookies option has to do with any of this. I rotate around 15 proxies while crawling. I guess the question is should I enable or disable cookies in the settings.py for clean pages or it's irrelavant?
I thought if I enable it website would know the history of what the IP does and after some point notice the pattern and won't allow it (this is my guess) so I should disable it?? or this is not even true about how cookies work and what they are for
How are you accessing these URLs, do you use the urllib library? If so, you might not have noticed but urllib has a default user-agent. The user-agent is part of the HTTP request (stored in the header) and identifies the type of software you have used to access a page. This allows websites to display their content correctly on different browsers but can also be used to determine if you are using an automated program (they don't like bots).
Now the default urllib user agent tells the website you are using python to access the page (usually a big no-no). You can spoof your user-agent quite easily to stop any nasty captcha codes from appearing!
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('www.example.com', None, headers)
html = urllib2.urlopen(req).read()
Because you're using scrapy to crawl webpages you may need to make changes your settings.py file so that you can change the user-agent there.
EDIT
Other reasons why captchas might be appearing all over the place is because you are moving too fast through a website. If you add a sleep call inbetween url requests then this might solve your captcha issue!
Other reasons for Captcha's appearing:
You are clicking on honeypot links (links that are within the html code but not displayed on the webpage) designed to catch crawlers.
You may need to change the pattern of crawling as it may be flagged as "non-human".
Check the websites robots.txt file which shows what is and isn't allowed to be crawled.
I'm trying to web scrape from WITHIN a secure network. Security is already tight and I have a username and password- but if you open the site I'm trying to get on with my program, you wouldn't be prompted to login (because you're inside the network). I'm having trouble with the authentication here...
import requests
url = "http://theinternalsiteimtryingtoaccess.com"
r = requests.get(url, auth=('myusername', 'mypass'))
print(r.status_code)
>>>401
I've tried HTTPBasicAuth, but that didn't work either. Are there any ways with requests to get around this?
Just another note, the 'urlopen' command will open this site on command without any authentication being required...Please help! Thanks!
EDIT: After finding this question- (How to scrape URL data from intranet site using python?), I tried the following:
import requests
from requests_ntlm import HttpNtlmAuth
r = requests.get("http://theinternalsiteimtryingtoaccess.aspx",auth=HttpNtlmAuth('NEED DOMAIN HERE\\usr','pass'))
print(r.status_code)
>>>401 #still >:/
RESOLVED: Make sure that if you're having this problem, and you're trying to access an internal site, that in the code you specify your particular domain. I was trying to login but the computer didn't know where to log me into. You can find the domain you're on by going to control panel>>>system and domain should be listed there. Thank you!
It's extremely unlikely we can give you the exact solution to your problem, but I would guess the intranet uses some sort of corporate proxy. I would think your requests need to be directed to the proxy and not as if it's hitting an external public site.
For more information on this check out the official docs.
http://docs.python-requests.org/en/master/user/advanced/#proxies
I want to unshorten URLs to get the real address.In some cases there are more than one redirection. I have tried using urllib2 but it seems to be making GET requests which is consuming time and bandwidth. I want get only the headers so that I have the final URL without needing to get the whole body/data of that page.
thanks
You need to execute a HTTP HEAD request to get just the headers.
The second answer shows how to perform a HEAD request using urllib.
How do you send a HEAD HTTP request in Python 2?
I am trying to scrape the website http://www.nseindia.com using urllib2 and BeautifulSoup. Unfortunately, I keep getting 403 Forbidden when I try to access the page through Python. I thought it was a user agent issue, but changing that did not help. Then I thought it may have something to do with cookies, but apparently loading the page through links with cookies turned off works fine. What may be blocking requests through urllib?
http://www.nseindia.com/ seems to require an Accept header, for whatever reason. This should work:
import urllib2
r = urllib2.Request('http://www.nseindia.com/')
r.add_header('Accept', '*/*')
r.add_header('User-Agent', 'My scraping program <author#example.com>')
opener = urllib2.build_opener()
content = opener.open(r).read()
Refusing requests without Accept headers is incorrect; RFC 2616 clearly states
If no Accept header field is present, then it is assumed that the
client accepts all media types.