I am using am attempting to do a bulk download of a series of PDFs from a site that requires login authentication. I am able to successfully log in, however, when I attempt a GET request for '/transcripts/transcript.pdf?user_id=3007' but, the request returns the content for '/transcripts/transcript.pdf'.
Does anyone have any idea why the URL param is not sending? Or why it would be rerouted?
I have tried passing the parameter 'user_id' as data, params, and hardcoded in the URL.
I have removed the actual domain from the strings below just for privacy
with requests.Session() as s:
login = s.get('<domain>/login/canvas')
# print the html returned or something more intelligent to see if it's a successful login page.
print(login.text)
login_html = lxml.html.fromstring(login.text)
hidden_inputs = login_html.xpath(r'//form//input[#type="hidden"]')
form = {x.attrib["name"]: x.attrib["value"] for x in hidden_inputs}
print("form: ",form)
form['pseudonym_session[unique_id]']= username
form['pseudonym_session[password]']= password
response = s.post('<domain>/login/canvas',data=form)
print(response.url, response.status_code) # gets <domain>?login_success=1 200
# An authorised request.
data = { 'user_id':'3007'}
r = s.get('<domain>/transcripts/transcript.pdf?user_id=3007', data=data)
print(r.url) # gets <domain>/transcripts/transcript.pdf
print(r.status_code) # gets 200
with open('test.pdf', 'wb') as f:
f.write(r.content)
GET response returns /transcripts/transcript.pdf and not /transcripts/transcript.pdf?user_id=3007
From the looks of it, you are trying to use canvas. I'm pretty sure in canvas, you can bulk download all test attachments.
If that's not the case, There are a few things to try:
after logging in, try typing the url with user_id into a browser. Does that take you directly to the PDF file or links to one?
if so, look at the url, it may simply not display the parameters; some websites do this, don't worry about it
If not, GET may not be enough; perhaps the site uses javascript, etc.
after looking through the '.history' of the request I found a series of 302 redirects.
The first was to '/login?force_login=0&target_uri=%2Ftranscripts%2Ftranscript.pdf'
In a desperate attempt, I tried: s.get('/login?force_login=0&target_uri=%2Ftranscripts%2Ftranscript.pdf%3Fuser_id%3D3007') and this still rerouted me a few times but ultimately got me the file I wanted!
If anyone has a more elegant solution to this or any resources that I can read I would greatly appreciate it!
Related
I tried the following script, but unfortunately the output file is identical to the input file. I'm not sure what's wrong with it.
import requests
url_lines = open('banana1.txt').read().splitlines()
remove_from_urls = []
for url in url_lines:
remove_url = requests.get(url)
print(remove_url.status_code)
if remove_url.status_code == 404:
remove_from_urls.append(url)
continue
url_lines = [url for url in url_lines if url not in remove_from_urls]
print(url_lines)
# Save urls example
with open('banana2.txt', 'w+') as file:
for item in url_lines:
file.write(item + '\n')
There seems to be no error in your code, but there are few things that would help to make it more readable and consistent. The first course of action should be to make sure there is at least one url that would return a 404 status code.
Edit: After providing the actual URL.
The 404 problem
In your case, the problem is the Twitter actually does not return 404 error for your "Not found" url. You can test it using curl:
$ curl -o /dev/null -w "%{http_code}" "https://twitter.com/davemeltzerWON/status/1321279214365016064"
200
Or using Python:
import requests
response = requests.get("https://twitter.com/davemeltzerWON/status/1321279214365016064")
print(response.status_code)
The output for both should be 200.
Since Twitter is a JavaScript application that loads its content after it has been processed in browser, you cannot find the information you are looking for in the HTML response. You would need to use something like Selenium to actually process the JavaScript for you and then you would be able to look for actual text like "not found" on the web page.
Code review
Please make sure to close the file properly. Also, file object is a lines iterator, you can convert it to list very easily. Another trick to make the code more readable is to make use of Python set. So you may read the file like this:
with open("banana1.txt") as fid:
url_lines = set(fid)
Then you simply remove all the links that do not work:
not_working = set()
for url in url_lines:
if requests.get(url).status_code == 404:
not_working.add(url)
working = url_lines - not_working
with open("banana2.txt", "w") as fid:
fid.write("\n".join(working))
Also, if some of the links point to the same server, you should make use of requests.Session class:
from requests import Session
session = Session()
Then replace requests.get with session.get, you should get some performance boost since the Session uses keep-alive connection and many other things.
I am writing a script that downloads Sentinel 2 products (satellite imagery) using sentinelsat Python API.
A product's description is structured as JSON and contains the parameter quicklook_url.
Example:
https://apihub.copernicus.eu/apihub/odata/v1/Products('862619d6-9b82-4fe0-b2bf-4e1c78296990')/Products('Quicklook')/$value
Any Sentinel API calls require credentials. So does retrieving a product and also opening the link stored inside quicklook_url. When I call the example in my browser I get asked to enter username and password in order to get
with the name S2A_MSIL2A_20210625T065621_N0300_R063_T39NTJ_20210625T093748-ql.jpg.
Needless to say I am just starting with the API so I am probably missing something but
requests.post(product_description['quicklook_url'], verify=False, auth=HTTPBasicAuth(username, password)).content
yields 0KB damaged file and
requests.get(product_description['quicklook_url']).content
yields 1KB damaged file.
I have looked into requests.Session
session = requests.Session()
session.auth = (username, password)
auth = session.post('URL_FOR_LOGING_IN')
img = session.get(product_description['quicklook_url']).content
The problem is I am unable to find the URL I need to post my session authentification. I am somewhat sure that the sentinelsat API does that but my looks have not yielded any successful result.
I am currently looking into the SentinelAPI class. It has the download_quicklook() function, which I am using right now but I am still curious how to do this without the function.
I guess you don't need to sent a post request. Basic authentication works by sending a header along with each request. The following should work
session = requests.Session()
session.auth = (username, password)
img = session.get(product_description['quicklook_url']).content
Your first attempt is failed because of using POST I think.
requests.gett(product_description['quicklook_url'], verify=False, auth=HTTPBasicAuth(username, password)).content
should also work.
I am trying to connect to the api as explained in http://api.instatfootball.com/ , It is supposed to be something like the following get /[lang]/data/[action].[format]?login=[login]&pass=[pass]. I know the [lang], [action] and [format] I need to use and I also have a login and password but donĀ“t know how to access to the information inside the API.
If I write the following code:
import requests
r = requests.get('http://api.instatfootball.com/en/data/stat_params_players.json', auth=('login', 'pass'))
r.text
with the actual login and pass, I get the following output:
{"status":"error"}
This API requires authentication as parameters over an insecure connection, so be aware that this is highly lacking on the API part.
import requests
username = 'login'
password = 'password'
base_url = 'http://api.instatfootball.com/en/data/{endpoint}.json'
r = requests.get(base_url.format(endpoint='stat_params_players'), params={'login': username, 'pass': password})
data = r.json()
print(r.status_code)
print(r.text)
You will need to make a http-request using the URL. This will return the requested data in the response body. Depending on the [format] parameter, you will need to decode the data from xml / json to a native Python object.
As rdas already commented, you can use the request library for python (https://requests.readthedocs.io/en/master/). You will also find some code samples there. It will also do proper decoding of JSON data.
If you want to play around with the API a bit, you can use a tool like Postman for testing and debugging your requests. (https://www.postman.com/)
does anyone have any idea, why the output of this script, where i use requests.post to login is code 404, Not found, and the same script, where I use only requests.get has code 200 OK? What should I change?
import requests
URL = 'https://www.stratfor.com/login'
session = requests.Session()
page = session.post(URL)
print(page.status_code, page.reason)
Thank you.
it seem to be worked with get request and should returned 405 but it depends on the server
One good way to note the right page to log in is to log the network calls.
After looking at the calls, a request is sent to
URL = https://www.stratfor.com/api/v3/user/login
The API endpoint actually expects a payload like this:
payload = {username: "YOU_USER", password: "YOUR_PASS"}
Try something like this:
r = requests.post(URL,json=payload)
You might need to pass more headers, which you can poke the network call log for. Although, it seems like that user and password are passed as raw strings here? If so, that's definitely not safe.
Before downvoting/marking as duplicate, please note:
I have already tried out this, this, this, this,this, this - basically almost all the methods I could find pointed out by the Requests documentation but do not seem to find any solution.
Problem:
I want to make a POST request with a set of headers and form data.
There are no files to be uploaded. As per the request body in Postman, we set the parameters by selecting 'form-data' under the 'Body' section for the request.
Here is the code I have:
headers = {'authorization': token_string,
'content-type':'multipart/form-data; boundary=----WebKitFormBoundaryxxxxxXXXXX12345'} # I get 'unsupported application/x-www-form-url-encoded' error if I remove this line
body = {
'foo1':'bar1',
'foo2':'bar2',
#... and other form data, NO FILE UPLOADED
}
#I have also tried the below approach
payload = dict()
payload['foo1']='bar1'
payload['foo2']='bar2'
page = ''
page = requests.post(url, proxies=proxies, headers=headers,
json=body, files=json.dump(body)) # also tried data=body,data=payload,files={} when giving data values
Error
{"errorCode":404,"message":"Required String parameter 'foo1' is not
present"}
EDIT:
Adding a trace of the network console. I am defining it in the same way in the payload as mentioned on the request payload.
There isn't any gui at all? You could get the network data from chrome, although:
Try this:
headers = {'authorization': token_string}
Probably there is more authorization? Or smthng else?
You shouldn't add Content-Type as requests will handle it for you.
Important, you could see the content type as WebKitFormBoundary, so for the payload you must take, the data from the "name" variable.
Example:
(I know you won't upload any file, it just an example) -
So in this case, for my payload would look like this: payload = {'photo':'myphoto'} (yea there would be an open file etc etc, but I try to keep it simple)
So your payload would be this-> (So always use name from the WebKit)
payload = {'foo1':'foo1data',
'foo2':'foo2data'}
session.post(url,data = payload, proxies etc...)
Important! As I can see you use the method from requests library. Firstly you always should create a session like this
session = requests.session() -> it will handle cookies, headers, etc, and won't open a new session, or plain requests with every requests.get/post.