403 Forbidden using Urllib2 [Python] - python

url = 'https://www.instagram.com/accounts/login/ajax/'
values = {'username' : 'User',
'password' : 'Pass'}
#'User-agent', ''
data = urllib.urlencode(values)
req = urllib2.Request(url, data,headers={'User-Agent' : "Mozilla/5.0"})
con = urllib2.urlopen( req )
the_page = response.read()
Does anyone have any ideas with this? I keep getting the error "403 forbidden".
Its possible instagram has something that won't let me connect via python (I don't want to connect via their API). What on earth is going on here, does anyone have any ideas?
Thanks!
EDIT: Adding more info.
The error I was getting was this
This page could not be loaded. If you have cookies disabled in your browser, or you are browsing in Private Mode, please try enabling cookies or turning off Private Mode, and then retrying your action.
I edited my code but am still getting that error.
jar = cookielib.FileCookieJar("cookies")
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
print len(jar) #prints 0
opener.addheaders = [('User-agent','Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36')]
result = opener.open('https://www.instagram.com')
print result.getcode(), len(jar) #prints 200 and 2
url = 'https://www.instagram.com/accounts/login/ajax/'
values = {'username' : 'username',
'password' : 'password'}
data = urllib.urlencode(values)
response = opener.open(url, data)
print response.getcode()

Two important things, for starters:
make sure you stay on the legal side. According to the Instagram's Terms of Use:
We prohibit crawling, scraping, caching or otherwise accessing any content on the Service via automated means, including but not limited to, user profiles and photos (except as may be the result of standard search engine protocols or technologies used by a search engine with Instagram's express consent).
You must not create accounts with the Service through unauthorized means, including but not limited to, by using an automated device, script, bot, spider, crawler or scraper.
there is an Instagram API that would help staying on the legal side and make the life easier. There is a Python client: python-instagram
Aside from that, the Instagram itself is javascript-heavy and you may find it difficult to work with using just urllib2 or requests. If, for some reason, you cannot use the API, you would look into browser automation via selenium. Note that you can automate a headless browser like PhantomJS also. Here is a sample code to log in:
from selenium import webdriver
USERNAME = "username"
PASSWORD = "password"
driver = webdriver.PhantomJS()
driver.get("https://www.instagram.com")
driver.find_element_by_name("username").send_keys(USERNAME)
driver.find_element_by_name("password").send_keys(PASSWORD)
driver.find_element_by_xpath("//button[. = 'Log in']").click()

Related

Python requests: Url will show table in browser but not when I use requests

I am trying to scrape a table in a webpage, or even download the .xlsx file of this table, using the requests library.
Normal workflow:
I log into the site. Go to my reporting page, choose report, click button that says "Test" and a second window opens up with my table and gives me the option to download the .xlsx file.
When I try to access this url I can copy and paste it into any chrome browser that I am currently logged into. When I try with requests, even when passing an auth into my get() i get a 200 response but it is a simple page with one line of text telling me to "contact my tech staff to receive the proper url to enter your username and password". This is the same as when i paste the url into a browser where I am not logged into the site. Except when I do that i am redirected to a new url that has the same sentence.
So I imagine there is a slug for the organization that is not passed in the url but somewhere in the headers or cookies when I access this site in my browser. How do i identify this parameter in the HTTP header? Then how do I send it to requests so I can get my table and move on to try and automate downloading the .xlsx.
import requests
url = 'myorganization.com/adhocHTML.xsl?x=adhoc.AdHocFilter-listAdhocData&filterID=45678&source=live'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'}
data = requests.get(url, headers=headers, auth=('username', 'Password'))
Any help would be greatly appreciated as I am new to the requests library and just trying to automate some data flow before it ever gets to analyzing it.
You need to login with requests. You can do this with making a session and make other requests with this session (it will save all cookies and other stuffs).
Before going for code you should do a few steps:
make sure you are logged out. open browser Inspect in log in page. go to network tab. log in and find a POST request in network tab that is related to your login request. at the end of this tab you find some parameters for login. make does parameters a dictionary (login_data) in your code and go as below:
session = requests.Session()
session.post('url_to_login_page', login_data)
data = session.get(url, headers=headers)
Login data for each website are different from others so I can't give you a specific example. You should be able to find it as I said above. If you had problem with those, tell me.

How to requests.Session().get if website does not keep me logged in?

I am trying to complete a webscrape of a page that requires a log-in first. I am fairly certain that I have my code and input names ('login' and 'password') correct yet it still gives me a 'Login Failed' page. Here is my code:
payload = {'login': 'MY_USERNAME', 'password': 'MY_PASSWORD'}
login_url = "https://www.spatialgroup.com.au/property_daily/"
with requests.Session() as session:
session.post(login_url, data=payload)
response = session.get("https://www.spatialgroup.com.au/cgi-bin/login.cgi")
html = response.text
print(html)
I've done some snooping around and have figured out that the session doesn't stay logged in when I run my session.get("LOGGEDIN_PAGE"). For example, if I complete the log in process and then enter a URL into the address bar that I know for a fact is a page only accessible once logged in, it returns me to the 'Login Failed' page. How would I get around this if my login session is not maintained?
As others have mentioned, its hard to help here without knowing the actual site you are attempting to log in to.
I'd point out that you aren't using any set HTTP headers at all, which is a common validation check for logins on webpages. If you're sure that you are POSTing the data in the right format (form encoded versus json encoded), then I would open up Chrome inspector and copy the user-agent from your browser.
s = requests.Session()
s.headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'Accept': '*/*'
}
Also, it's good practice to check the response status code of each web request you make using a try/except pattern. This will help you catch errors as you write and test requests, instead of blindly guessing which requests are erroneous.
r = requests.get('http://mypage.com')
try:
r.raise_for_status()
except requests.exceptions.HTTPError:
print('oops bad status code {} on request!'.format(r.status_code))
Edit: Now that you've given us the site, inspecting a login attempt reveals that the form data isn't actually being POSTed to that website, but rather it's being sent to a CGI script url.
To find this, open up Chrome Inspector and watch the "Network" tab as you try to login. You'll see that the login is actually being sent to https://www.spatialgroup.com.au/cgi-bin/login.cgi, not the actual login page. When you submit to this login page, it executes a 302 redirect after logging in. We can check the location after performing the request to see if the login was successful.
Knowing this I would send a request like this:
s = requests.Session()
# try to login
r = s.post(
url='https://www.spatialgroup.com.au/cgi-bin/login.cgi',
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3'
},
data={
'login': USERNAME,
'password': PASSWORD
}
)
# now lets check to make sure we didnt get 4XX or 5XX errors
try:
r.raise_for_status()
except requests.exceptions.HTTPError:
print('oops bad status code {} on request!'.format(r.status_code))
else:
print('our login redirected to: {}'.format(r.url))
# subsequently if the login was successful, you can now make a request to the login-protected page at this point
It's very difficult to help you without having the actual website you are working with. That being said I would recommend you changing this line:
session.post(login_url, data=payload)
to this one:
session.post(login_url, json=payload)
hope this helps

Scraping Data from website with a login page

I am trying to login to my university website using python and the requests library using the following code, nonetheless I am not able to.
import requests
payloads = {"User_ID": <username>,
"Password": <passwrord>,
"option": "credential",
"Log in":"Log in"
}
with requests.Session() as session:
session.post('', data=payloads)
get = session.get("")
print(get.text)
Does anyone have any idea on what I am doing wrong?
In order to login you will need to to post all the informations requested by the <input> tag. In your case you will have also to provide the hidden inputs. You can do this by scraping for these values and then post them. You might also need to post some headers to simulate a browser behaviour.
from lxml import html
import requests
s = requests.Session()
login_url = "https://intranet.cardiff.ac.uk/students/applications"
session_url = "https://login.cardiff.ac.uk/nidp/idff/sso?sid=1&sid=1"
to_get = s.get(login_url)
tree = html.fromstring(to_get.text)
hidden_inputs = tree.xpath(r'//form//input[#type="hidden"]')
payloads = {x.attrib["name"]: x.attrib["value"] for x in hidden_inputs}
payloads["Ecom_User_ID"] = "<username>"
payloads["Ecom_Password"] = "<password>"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
result = s.post(session_url, data=payloads, headers = headers)
Hope this works
In order to login to a website with python, you will have to use a more involved method than the request library because you will have to simulate the browser in your code and have it make requests to login to the school's website servers. The reason for this is that you need the school's server to think that it is getting the request from the browser, then it should return you the contents of the resulting page, and then you have to have those contents rendered so that you can scrape it. Luckily, a great way to do this is with the selenium module in python.
I would recommend googling around to learn more about selenium. This blog post is a good example of using selenium to log into a web page with detailed explanations of what each line of code is doing. This SO answer on using selenium to login to a website is also good as an entry point into doing this.

Can't automate login using python mechanize (must "activate" specific browser)

I seem to have difficulty logging into a website, which requires browser authenticaton.
What happens is when you first log on, the website redirects you to a page saying "We have sent an email to your email, click on the link to authenticate this browser."
I'm using the mechanize module for python. The page would log in, however the website never recognizes the "browser" hence many "Please register this browser" emails! I tried giving custom headers as well as adding a cookie handler as per other examples... no luck. The website thinks the script is a new (unauthorized) browser each time I visit.
Init code looks like this:
self.br = mechanize.Browser( factory=mechanize.RobustFactory() )
self.br.add_handler(PrettifyHandler())
cj = cookielib.LWPCookieJar()
self.br.set_cookiejar(cj)
self.br.addheaders = [('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
('User-agent', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.17 (KHTML, like Gecko) Ubuntu Chromium/24.0.1312.56 Chrome/24.0.1312.56 Safari/537.17'),
('Referer', 'https://www.temp.com/logout'),
('Accept-Encoding', 'gzip,deflate,sdch'),
('Accept-Language', 'en-GB,en-US;q=0.8,en;q=0.6'),
('Accept-Charset', 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'),
]
And my login code looks like this. It fills in a simple html form and submits it.
self.br.open('https://www.temp.com/login')
# Select the first (index zero) form
self.br.select_form(nr=0)
# User credentials
self.br.form['username'] = 'temp'
self.br.form['password'] = 'temp'
# Login
self.br.submit()
# Inventory
body = self.br.response().read().split('\n')
And yet everytime I get this email : "To activate your browser, please click on the following link..." even after I follow the link and activate/authenticate the browser.
If you want to save session, try to save cookies with save/load function. Example:
cj = cookielib.LWPCookieJar()
cj.save('cookies.txt', ignore_discard=False, ignore_expires=False)
...
cj.load('cookies.txt', ignore_discard=False, ignore_expires=False)

Python script to fetch URL protected by DES/kerberos

I have a Python script that does an automatic download from a URL once a day.
Recently the authentication protecting the URL was changed. To get it to work with Internet Explorer I had to enable DES for Kerberos by adding SupportedEncryptionTypes " 0x7FFFFFFF" in a registry entry somewhere. Then it prompts me for my domain/user/password in IE when I browse to the site.
My python code that was working before is:
def __build_ntlm_opener(self):
passman = HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, self.answers_url, self.ntlm_username, self.ntlm_password)
ntlm_handler = HTTPNtlmAuthHandler(passman)
opener = urllib.request.build_opener(ntlm_handler)
opener.addheaders= [
#('User-agent', 'Mozilla/5.0 (Windows NT 6.0; rv:5.0) Gecko/20100101 Firefox/5.0')
('User-agent', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)')
]
return opener
Now the code is failing with a simple 401 when using the opener:
urllib.error.HTTPError: HTTP Error 401: Unauthorized
I don't know much about Kerberos or DES but from what I see so far I can't figure out if urllib supports using these.
Is there any 3rd party library or trick I can use to get this working again?
You could try using selenium's webdriver to directly drive a browser. I do that sometimes when I want to scrape sites that are dynamically generated. Here's a code example for opening a page and entering a password
from selenium import webdriver
b = webdriver.Chrome()
b.get('http://www.example.com')
username_field = b.find_element_by_id('username')
username_field.send_keys('my_username')
password_field = b.find_element_by_id('password')
password_field.send_keys('secret')
login_button = b.find_element_by_link_text('login').click()
That would get you past a typical login screen of a web site. Then
b.page_source
Will give you the source code for the page. Even if it was mainly generated with Javascript.
The source code is very simple to parse: http://code.google.com/p/selenium/source/browse/trunk/py/selenium/webdriver/remote/webelement.py

Categories