Trying to Download File via https - python

I am attempting to download a zip file from a website that sits with an https:// link. I have tried the following but can't seem to get any output. Could anyone suggest what I might be doing wrong?
URL = www.somewebsite.com
Download zip file = www.somewebsite.com/output/revisionId=40687821$$Xiiy75&action_id=
import requests
url = 'http://somewebsite.org'
user, password = 'bob', 'I love cats'
resp = requests.get(url, auth=(user, password))

To download a file from a non protected url do something like:
import requests
url = 'http://somewebsite.org'
user, password = 'bob', 'I love cats'
resp = requests.get(url, auth=(user, password))
with open("result.zip", "wb") as fout:
fout.write(resp.content)
If course you should check whether you got a valid response before writing the zip file.
For a considerable amount of websites with login following recipe will work:
However if asite.com uses too much javascript, this might not necessarily work.
Use a requests session in order to store any session cookies and perform following three steps.
GET the login url. This will get potential session cookies or CSRF protection cookies
POST to the login url with the username and password. the name of the forms to be posted depend on the page. Use your web browser in debug mode to learn about the right values that you have to post, this can be more parameters than username and password
List item
GET the document url and save the result to a file.
On Firefox for example you go to the website you want to login, you press F12 (for debug mode), click on the network tab and then on reload.
You might
Fill in the login form and submit and look in the debug panel for a POST request.
The generic python code would look like.
import requests
def login_and_download():
ses = requests.session()
# Step 1 get the login page
rslt = ses.get("https://www.asite.com/login-home")
# now any potentially required cookie will be set
if rslt.status_code != 200:
print("failed getting login page")
return False
# for simple pages you can procedd to login
# for a little more complicated pages you might have to parse the
# HTML
# for really annoying pages that use loads of javascript it might be
# even more complicated
# Step 2 perform a post request to login
login_post_url = # This depends on the site you want to connect to. you have analyze the login
# procedure
rslt = ses.post(login_post_url)
if rslt.status_code != 200:
print("failed logging in")
return False
# Step 3 download the url, that you want to get.
rslt = ses.get(url_of_your_document)
if rslt.status_code != 200:
print("failed fetching the file")
return False
with open("result.zip", "wb") as fout:
fout.write(resp.content)

Related

How to download a file with authentication?

I'm working with the website 'musescore.com' that has many files in the '.mxl' format that I need to download automatically with Python.
Each file on the website has a unique ID number. Here's a link to an example file:
https://musescore.com/user/43726/scores/76643
The last number in the URL is the id number for this file. I have no idea where on the website the mxl file for score is located, but I know that to download the file, one must visit this url:
https://musescore.com/score/76643/download/mxl
This link is the same for every file, but with that file's particular ID number in it. As I understand it, this url executes code that downloads the file, and is not an actual path to the file.
Here's my code:
import requests
url = 'https://musescore.com/score/76643/download/mxl'
user = 'myusername'
password = 'mypassword'
r = requests.get(url, auth=(user, password), stream=True)
with open('file.mxl', 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
f.write(chunk)
This code downloads a webpage saying I need to sign in to download the file. It is supposed to download the mxl file for this score. This must mean I am improperly authenticating the website. How can I fix this?
By passing an auth parameter to get, you're attempting to utilize HTTP Basic Authentication, which is not what this particular site uses. You'll need to use an instance of request.Session to post to their login endpoint and maintain the cookie(s) that result from that process.
Additionally, this site utilizes a csrf token that you must first extract from the login page in order to include it with your post to the login endpoint.
Here is a working example, obviously you will need to change the username and password to your own:
import requests
from bs4 import BeautifulSoup
s = requests.Session()
r = s.get('https://musescore.com/user/login')
soup = BeautifulSoup(r.content, 'html.parser')
csrf = soup.find('input', {'name': '_csrf'})['value']
s.post('https://musescore.com/user/auth/login/process', data={
'username': 'herp#derp.biz',
'password': 'secret',
'_csrf': csrf,
'op': 'Log in'
})
r = s.get('https://musescore.com/score/76643/download/mxl')
print(f"Status: {r.status_code}")
print(f"Content-Type: {r.headers['content-type']}")
Result, with content type showing it is successfully downloading the file:
Status: 200
Content-Type: application/vnd.recordare.musicxml

Python log into Voobly

I've read through dozens of pages on how to log into a web page using Python, but I can't seem to make my code work. I'm trying to log into a site called "Voobly", and I'm wondering if there might be something specific to Voobly that is making this more difficult. Here is my code:
import requests
loginURL = "https://www.voobly.com/login"
matchUrl = "https://www.voobly.com/profile/view/124993231/Matches"
s = requests.session()
loginInfo = {"username":"myUsername", "password":"myPassword"}
firstGetRequest = s.get(loginURL) # Get the login page using our session so we save the cookies
postRequest = s.post(loginURL,data=loginInfo) # Post data to the login page, the data being my login information
getRequest = s.get(matchUrl) # Get content from a login - restricted page
response = getRequest.content.decode() # Get the actual html text from restricted page
if "Page Access Failed" in response: # True if I'm blocked
print("Failed")
else: # If I'm not blocked, I have the result I want
print("Worked!") # I can't achieve this
As mentioned in the comments, the login form is submitted to /login/auth. But, the cookie is generated from the /login URL.
Use the following code:
form = {'username': USERNAME, 'password': PASSWORD}
with requests.Session() as s:
# Get the cookie
s.get('https://www.voobly.com/login')
# Post the login form data
s.post('https://www.voobly.com/login/auth', data=form)
# Go to home page
r = s.get('https://www.voobly.com/welcome')
# Check if username is in response.text
print(USERNAME in r.text)
# True
r2 = s.get('https://www.voobly.com/profile/view/124993231/Matches')
if "Page Access Failed" in r2.text:
print("Failed")
else:
print("Worked!")
# Worked!
Note: The Go to home page part is not at all needed for the login. It's used just to show that the login is successful.

Scrape website that uses javascript with python

I am attempting to scrape a website using the following code
import re
import requests
def get_csrf(page):
matchme = r'name="csrfToken" value="(.*)" /'
csrf = re.search(matchme, str(page))
csrf = csrf.group(1)
return csrf
def login():
login_url = 'https://www.edline.net/InterstitialLogin.page'
with requests.Session() as s:
login_page = s.get(login_url)
csrf = get_csrf(login_page.text)
username = 'USER'
password = 'PASS'
login = {'screenName': username,
'kclq': password,
'csrfToken': csrf,
'TCNK':'authenticationEntryComponent',
'submitEvent':'1',
'enterClicked':'true',
'ajaxSupported':'yes'}
page = s.post(login_url, data=login)
r = s.get("https://www.edline.net/UserDocList.page?")
print(r.text)
login()
Where I log into https://www.edline.net/InterstitialLogin.page, which is successful, but the problem I have is when I try to do
r = s.get("https://www.edline.net/UserDocList.page?")
print(r.text)
It doesn't print the expected page, instead it throws an error. Upon further testing I discovered that it throws this error even if you try to go directly to the page from a browser. So when I investigated the page source I found that the button used to link to the page I'm trying to scrape uses the following code
Private Reports
So essentially I am looking for a way to trigger the above javascript code in python in order to scrape the resulting page.
It is impossible to answer this question without having more context than this single link.
However, the first thing you want to check, in the case of javaScript driven content generation, are the requests made by your web page when clicking on that link.
To do this, take a look at the network-panel in the console of your browser. Record the requests being made, look especially for XHR-requests. Then, you can try to replicate this e.g. with the requests library.
content = requests.get('xhr-url')

Python: How to login to facebook before making a request using urllib

I am trying to generate an authorization code for using Facebook Ads API. Details for generating authorization code can be found here.
I need to generate this code often so I am planning to do this programmatically. The requested URL opens a dialog box on first ever request by a user. On subsequent requests, the dialog box does not appear and user is redirected to another page where the required code is present in the URL.
SO, my question is: How can I login to facebook and get the URL of the page redirected to.
I have written he following method:
def facebook_login():
username = raw_input("Username:")
password = getpass.getpass()
requests.get('https://facebook.com', auth=HTTPBasicAuth(username, password))
def generate_auth_code():
url = ("").join([BASE_URL,"/dialog/oauth?client_id=",APP_ID,"&redirect_uri=",APP_URL,"&scope=ads_management"])
facebook_login()
response = urllib.urlopen(url)
print response.geturl()
But I get the following output:
https://www.facebook.com/login.php?skip_api_login=1&api_key=APP_ID&signed_next=1&next=https%3A%2F%2Fwww.facebook.com%2Fv2.1%2Fdialog%2Foauth%3Fredirect_uri%3Dhttp%253A%252F%252FMY_APP_URL%252F%26scope%3Dads_management%26client_id%CLIENT_ID%26ret%3Dlogin&cancel_uri=http%3A%2F%2FMY_APP_URL%2F%3Ferror%3Daccess_denied%26error_code%3D200%26error_description%3DPermissions%2Berror%26error_reason%3Duser_denied%23_%3D_&display=page
What is the correct way of doing this

Using Python to login webpage (cookies GET and POST involved)

I am trying to use Python 2.7.6 to login a website. the login logic contains 2 steps in 2 webpages.
Putting in user ID and password onto page A, and the page A gives a cookie;
This cookie is used in the header to authenticate the login on page B.
It only logs in successfully once B authenticated it.
There’s a post here, HTTP POST and GET with cookies for authentication in python, asking the similar question. A solution is using requests.
import requests
url_0 = "http://www.PAGE_A.com/" # http://webapp.pucrs.br/consulta/principal.jsp, in original example
url = "http://www.PAGE_B.com/" # https://webapp.pucrs.br/consulta/servlet/consulta.aluno.ValidaAluno, in original example
data = {"field1_username": "ABC", " field_password": "123"}
s = requests.session()
s.get(url_0)
r = s.post(url, data)
I tired used this in Python for my case and it doesn't return error message so I guess it works fine.
But the question is, how do I know it’s logged in?
I added below to print the logged in page to see if it returned the right page.
import mechanize
br = mechanize.Browser()
open_page = br.open("http://www.PAGE_B.com/")
read_page = open_page.read()
print read_page
However, it stills shows the contents before login. What went wrong?
How about just going with one of the two?
import mechanize;
browser = mechanize.Browser()
browser.addheaders = [('...')]
browser.open(YOUR_URL)
browser.select_form(FORM_NAME)
browser.form['USERNAME_FIELD'] = 'abc'
browser.form['PASSWORD_FIELD'] = 'password'
browser.submit()
print browser.response().read()
print browser.geturl()

Categories