I'm trying to web scrape from WITHIN a secure network. Security is already tight and I have a username and password- but if you open the site I'm trying to get on with my program, you wouldn't be prompted to login (because you're inside the network). I'm having trouble with the authentication here...
import requests
url = "http://theinternalsiteimtryingtoaccess.com"
r = requests.get(url, auth=('myusername', 'mypass'))
print(r.status_code)
>>>401
I've tried HTTPBasicAuth, but that didn't work either. Are there any ways with requests to get around this?
Just another note, the 'urlopen' command will open this site on command without any authentication being required...Please help! Thanks!
EDIT: After finding this question- (How to scrape URL data from intranet site using python?), I tried the following:
import requests
from requests_ntlm import HttpNtlmAuth
r = requests.get("http://theinternalsiteimtryingtoaccess.aspx",auth=HttpNtlmAuth('NEED DOMAIN HERE\\usr','pass'))
print(r.status_code)
>>>401 #still >:/
RESOLVED: Make sure that if you're having this problem, and you're trying to access an internal site, that in the code you specify your particular domain. I was trying to login but the computer didn't know where to log me into. You can find the domain you're on by going to control panel>>>system and domain should be listed there. Thank you!
It's extremely unlikely we can give you the exact solution to your problem, but I would guess the intranet uses some sort of corporate proxy. I would think your requests need to be directed to the proxy and not as if it's hitting an external public site.
For more information on this check out the official docs.
http://docs.python-requests.org/en/master/user/advanced/#proxies
Related
I'm trying to access and get data from www.cclonline.com website using python script.
this is the code.
import requests
from requests_html import HTML
source = requests.get('https://www.cclonline.com/category/409/PC-Components/Graphics-Cards/')
html = HTML(html=source.text)
print(source.status_code)
print(html.text)
this is the errors i get,
403
Access denied | www.cclonline.com used Cloudflare to restrict access
Please enable cookies.
Error 1020
Ray ID: 64c0c2f1ccb5d781 • 2021-05-08 06:51:46 UTC
Access denied
What happened?
This website is using a security service to protect itself from online attacks.
how can i solve this problem? Thanks.
So the site's robots.txt does not explicitly says no bot is allowed. But you need to make your request look like it's coming from an actual browser.
Now to solve the issue at hand. The response says you need to have cookies enabled. So that can be solved by using a headless browser like selenium. Selenium has everything a browser has to offer (it basically uses google chrome or a browser of your chosen as a driver). It will make the server think the request is coming from an actual browser and will return a response.
Learn more about how to use selenium for scraping here.
Also remember to adjust crawl time accordingly. Make pauses after each request and swap user-agents often.
There’s no a silver bullet for solving cloudflare challenges, I’ve tried in my projects the solutions proposed here on this website, using playwright with different options https://substack.thewebscraping.club/p/cloudflare-how-to-scrape
I want to scrape data from a website which has an initial log on (where I have working credentials). It is not possible to inspect the code for this, at is a log on that pops up before visiting the site. I tried searching around, but did not find any answer - perhaps I do not know what to search for.
This is what you get when going to the site:
Log on
Any help is appreciated :-)
The solution is to use the public REST API for the site.
If the web site does not provide a REST API for interacting with it you should not be surprised that your attempt at simulating a human is difficult. Web scraping is generally only possible for pages that do not require authentication or utilize the standard HTTP 401 status response to tell the client that it should prompt the user to respond with the correct credentials. If the site is using a different mechanism, most likely based on AJAX, then the solution is going to be specific to that web site or other sites using the same mechanism. Which means that no one can answer your question since you did not tell us which web site you are interacting with.
Based on your screenshot this looks like it is just using Basic Auth.
Using the library "requests":
import requests
session = requests.Session()
r = session.get(url, auth=requests.auth.HTTPDigestAuth('user', 'pass'))
Should get you there.
I couldn't get Tom's answer to work but I found a work around:
from selenium import webdriver
driver = webdriver.Chrome('path to chromedriver')
driver.get('https://user:password#webaddress.com/')
This worked :)
I am trying to parse the website "https://ih.advfn.com/stock-market/NYSE/gen-electric-GE/stock-price" and extract its most recent messages from its board. It is bot protected with Cloud-flare. I am using python and its relative libraries and this is what I have so far
from bs4 import BeautifulSoup as soup #parses/cuts the html
import cfscrape
import requests
url = 'https://ih.advfn.com/stock-market/NYSE/gen-electric-GE/stock-
price'
r=requests.get(url)
html = soup(r.text, "html.parser")
containers = html.find("div",{"id":"bbPosts"})
print(containers.text.strip())
I am not able to use the html parser because the site detects and blocks my script then.
My questions are:
How can I parse the web pages to pull the table data?
Might I mention that this is for a security class I am taking. I am not using this for malicious reasons.
There are multiple ways of bypassing the site protection. You have to see exactly how they are blocking you.
One common way of blocking requests is to look at the User Agent header. The client ( in your case the requests library ) will inform the server about it's identity.
Generally speaking, a browser will say I am a browser and a library will say I am a library. The server can then say I allow browsers but not libraries to access my content.
However, for this particular case, you can simply lie to the server by sending your own User Agent header.
You can see a example here. Try to use your browsers user agent.
Other blocking techniques include ip ranges. One way to bypass this is via a vpn. This is one of the easiest vpns to set up. Just spin up a machine on amazon and get this container running.
What else could happen, you might try to access a single page application that is not rendered server side. In this case, what you should receive with that get requests is a very small html file that essentially references a javascript file. If this is the case, what you need is a actual browser that you control programatically. I would suggest you look at Google Chrome Headless however there are others. You can also use Selenium
Web crawling is a beautiful but very deep subject. I think these pointers should set you on the right direction.
Also, as a quick mention, my advice is to avoid from bs4 import BeautifulSoup as soup. I would recommend html2text
Sorry for asking very basic question. First time I am using REST API. I tried to get answer to this question on google. But I am unable to get any such.
I need to login into a website using password based authentication using REST API and Python. I dont want to use selenium or some other tools for this purpose due to company policy.
Website Name : https://www.test.com
Username : admin
Password : test#123
Any clue/idea on proceding further please?
Try to use Google chrome REST client extension or postman.
If you need to issue requests and authenticate yourself using HTTP Basic Authentication, this example from the docs should help:
Use of Basic HTTP Authentication:
import urllib.request
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib.request.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
uri='https://mahler:8092/site-updates.py',
user='klem',
passwd='kadidd!ehopper')
opener = urllib.request.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib.request.install_opener(opener)
urllib.request.urlopen('http://www.example.com/login.html')
For more complex usage scenarios I suggest using the httplib2. A great introduction to that is here.
I am trying to scrape the website http://www.nseindia.com using urllib2 and BeautifulSoup. Unfortunately, I keep getting 403 Forbidden when I try to access the page through Python. I thought it was a user agent issue, but changing that did not help. Then I thought it may have something to do with cookies, but apparently loading the page through links with cookies turned off works fine. What may be blocking requests through urllib?
http://www.nseindia.com/ seems to require an Accept header, for whatever reason. This should work:
import urllib2
r = urllib2.Request('http://www.nseindia.com/')
r.add_header('Accept', '*/*')
r.add_header('User-Agent', 'My scraping program <author#example.com>')
opener = urllib2.build_opener()
content = opener.open(r).read()
Refusing requests without Accept headers is incorrect; RFC 2616 clearly states
If no Accept header field is present, then it is assumed that the
client accepts all media types.