i'm trying to create a script to download a test excel (.xlsx) file. The script does output a 'test'.xlsx file, but it contains the following error message instead of the actual file contents.
Bad Request
Your browser sent a request that this server could not understand.
Additionally, a 400 Bad Request error was encountered while trying to use an ErrorDocument to handle the request.*
I did manage to download the test file using urllib, however would like to use requests instead as I require authentication for my actual use case. Here's the code i used, do let me know what's wrong, thanks!
import requests
dls = "http://www.excel-easy.com/examples/excel-files/fibonacci-sequence.xlsx"
resp = requests.get(dls)
output = open('test.xlsx', 'wb')
output.write(resp.content)
output.close()
Related
I have a .html file downloaded and want to send a request to this file to grab it's content.
However, if I do the following:
import requests
html_file = "/user/some_html.html"
r = requests.get(html_file)
Gives the following error:
Invalid URL 'some_html.html': No schema supplied.
If I add a schema I get the following error:
HTTPConnectionPool(host='some_html.html', port=80): Max retries exceeded with url:
I want to know how to specifically send a request to a html file when it's downloaded.
You are accessing html file from local directory. get() method uses HTTPConnection and port 80 to access data from website not a local directory. To access file from local directory using get() method use Xampp or Wampp.
for accessing file from local directory you can use open() while requests.get() is for accessing file from Port 80 using http Connection in simple word from internet not local directory
import requests
html_file = "/user/some_html.html"
t=open(html_file, "r")
for v in t.readlines():
print(v)
Output:
You don't "send a request to a html file". Instead, you can send a request to a HTTP server on the internet which will return a response with the contents of a html file.
The file itself knows nothing about "requests". If you have the file stored locally and want to do something with it, then you can open it just like any other file.
If you are interested in learning more about the request and response model, I suggest you try a something like
response = requests.get("http://stackoverflow.com")
You should also read about HTTP and requests and responses to better understand how this works.
You can do it by setting up a local server to your html file.
If you use Visual Studio Code, you can install Live Server by Ritwick Dey.
Then you do as follows:
1 - Make the first request and save the html content into a .html file:
my_req.py
import requests
file_path = './'
file_name = 'my_file'
url = "https://www.qwant.com/"
response = requests.request("GET", url)
w = open(file_path + file_name + '.html', 'w')
w.write(response.text)
2 - With Live Server installed on Visual Studio Code, click on my_file.html and then click on Go Live.
and
3 - Now you can make a request to your local http schema:
second request
import requests
url = "http://127.0.0.1:5500/my_file.html"
response = requests.request("GET", url)
print(response.text)
And, tcharan!! do what you need to do.
On a crawler work, I had one situation where there was a difference between the content displayed on the website and the content retrieved with the response.text so the xpaths did not were the same as on the website, so I needed to download the content, making a local html file, and get the new ones xpaths to get the info that I needed.
You can try this:
from requests_html import HTML
with open("htmlfile.html") as htmlfile:
sourcecode = htmlfile.read()
parsedHtml = HTML(html=sourcecode)
print(parsedHtml)
I tried the following script, but unfortunately the output file is identical to the input file. I'm not sure what's wrong with it.
import requests
url_lines = open('banana1.txt').read().splitlines()
remove_from_urls = []
for url in url_lines:
remove_url = requests.get(url)
print(remove_url.status_code)
if remove_url.status_code == 404:
remove_from_urls.append(url)
continue
url_lines = [url for url in url_lines if url not in remove_from_urls]
print(url_lines)
# Save urls example
with open('banana2.txt', 'w+') as file:
for item in url_lines:
file.write(item + '\n')
There seems to be no error in your code, but there are few things that would help to make it more readable and consistent. The first course of action should be to make sure there is at least one url that would return a 404 status code.
Edit: After providing the actual URL.
The 404 problem
In your case, the problem is the Twitter actually does not return 404 error for your "Not found" url. You can test it using curl:
$ curl -o /dev/null -w "%{http_code}" "https://twitter.com/davemeltzerWON/status/1321279214365016064"
200
Or using Python:
import requests
response = requests.get("https://twitter.com/davemeltzerWON/status/1321279214365016064")
print(response.status_code)
The output for both should be 200.
Since Twitter is a JavaScript application that loads its content after it has been processed in browser, you cannot find the information you are looking for in the HTML response. You would need to use something like Selenium to actually process the JavaScript for you and then you would be able to look for actual text like "not found" on the web page.
Code review
Please make sure to close the file properly. Also, file object is a lines iterator, you can convert it to list very easily. Another trick to make the code more readable is to make use of Python set. So you may read the file like this:
with open("banana1.txt") as fid:
url_lines = set(fid)
Then you simply remove all the links that do not work:
not_working = set()
for url in url_lines:
if requests.get(url).status_code == 404:
not_working.add(url)
working = url_lines - not_working
with open("banana2.txt", "w") as fid:
fid.write("\n".join(working))
Also, if some of the links point to the same server, you should make use of requests.Session class:
from requests import Session
session = Session()
Then replace requests.get with session.get, you should get some performance boost since the Session uses keep-alive connection and many other things.
I am communicating with an API using HTTP.client in Python 3.6.2.
In order to upload a file it requires a three stage process.
I have managed to talk successfully using POST methods and the server returns data as I expect.
However, the stage that requires the actual file to be uploaded is a PUT method - and I cannot figure out how to syntax the code to include a pointer to the actual file on my storage - the file is an mp4 video file.
Here is a snippet of the code with my noob annotations :)
#define connection as HTTPS and define URL
uploadstep2 = http.client.HTTPSConnection("grabyo-prod.s3-accelerate.amazonaws.com")
#define headers
headers = {
'accept': "application/json",
'content-type': "application/x-www-form-urlencoded"
}
#define the structure of the request and send it.
#Here it is a PUT request to the unique URL as defined above with the correct file and headers.
uploadstep2.request("PUT", myUniqueUploadUrl, body="C:\Test.mp4", headers=headers)
#get the response from the server
uploadstep2response = uploadstep2.getresponse()
#read the data from the response and put to a usable variable
step2responsedata = uploadstep2response.read()
The response I am getting back at this stage is an
"Error 400 Bad Request - Could not obtain the file information."
I am certain this relates to the body="C:\Test.mp4" section of the code.
Can you please advise how I can correctly reference a file within the PUT method?
Thanks in advance
uploadstep2.request("PUT", myUniqueUploadUrl, body="C:\Test.mp4", headers=headers)
will put the actual string "C:\Test.mp4" in the body of your request, not the content of the file named "C:\Test.mp4" as you expect.
You need to open the file, read it's content then pass it as body. Or to stream it, but AFAIK http.client does not support that, and since your file seems to be a video, it is potentially huge and will use plenty of RAM for no good reason.
My suggestion would be to use requests, which is a way better lib to do this kind of things:
import requests
with open(r'C:\Test.mp4'), 'rb') as finput:
response = requests.put('https://grabyo-prod.s3-accelerate.amazonaws.com/youruploadpath', data=finput)
print(response.json())
I do not know if it is useful for you, but you can try to send a POST request with requests module :
import requests
url = ""
data = {'title':'metadata','timeDuration':120}
mp3_f = open('/path/your_file.mp3', 'rb')
files = {'messageFile': mp3_f}
req = requests.post(url, files=files, json=data)
print (req.status_code)
print (req.content)
Hope it helps .
I use NASA GSFC server to retrieve data from their archives.
I send request and receives response as a simple text.
I discovered that they amended their page so that login is required.
However, even after logging I'm receiving an error.
I read information provided in thread how do python capture 302 redirect url
as well as tried to use urllib2 and request libraries, but still receiving an error.
Currently part of my code responsible for downloading data looks as follows:
def getSampleData():
import urllib
# I approved application according to:
# http://disc.sci.gsfc.nasa.gov/registration/authorizing-gesdisc-data-access-in-earthdata_login
# Query: http://hydro1.sci.gsfc.nasa.gov/dods/_expr_{GLDAS_NOAH025SUBP_3H}{ave(rainf,time=00Z23Oct2016,time=00Z24Oct2016)}{17.00:25.25,48.75:54.50,1:1,00Z23Oct2016:00Z23Oct2016}.ascii?result
sample_query = 'http://hydro1.sci.gsfc.nasa.gov/dods/_expr_%7BGLDAS_NOAH025SUBP_3H%7D%7Bave(rainf,time=00Z23Oct2016,time=00Z24Oct2016)%7D%7B17.00:25.25,48.75:54.50,1:1,00Z23Oct2016:00Z23Oct2016%7D.ascii?result'
# I've tried also:
# sock=urllib.urlopen(sample_query, urllib.urlencode({'username':'MyUserName','password':'MyPassword'}))
# but I was still asked to provide credentials, so I simplified mentioned line to just:
sock=urllib.urlopen(sample_query)
print('\n\nCurrent url:\n')
print(sock.geturl())
print('\nIs it the same as sample query?')
print(sock.geturl() == sample_query)
returnedData=sock.read()
# returnedData always stores simple page with 302. Why? StackOverflow suggests that urllib and urllib2 handle redirection automatically
sock.close()
with open("Output.html", "w") as text_file:
text_file.write(returnedData)
Output.html content is as follows:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved here.</p>
</body></html>
If I copy-paste sample_query (sample_query from function defined above) to browser, I have no problem with receiving data.
Thus, if there's no hope for solution, I'm thinking about rewriting my code to use Selenium.
It seems that i figured out how to download data:
How to authenticate on NASA gsfc server
However, I don't know how to process dataset.
I would like to display (or write to textfile) output as a raw data (in exactly the same way as I see them in browser)
I am testing my webpage software by sending requests from python to it. I am able to send requests, receive responses and parse the json. However, one option on the webpage is to download files. I send the download request and can confirm that the response headers contain what I expect (application/octet-stream and the appropriate filename) but the Content-Length is 0. If the length is 0, I assume the file was not actually sent. I am able to download files from other means so I know my software works but I am having trouble with getting it to work with python.
I build up the request then do:
f = urllib.request.urlopen(request)
f.body = f.read()
I expect data to be in f.body but it is empty (I see "b''")
Is there a different way to access the file contents from an attachment in python?
Is there a different way to access the file contents from an attachment in python?
This is in python-requests instead urllib, since I'm more familiar with that.
import requests
url = "http://example.com/foobar.jpg"
#make request
r = requests.get(url)
attachment_data = r.content
#save to file
with open(r"C:/pictures/foobar.jpg", 'wb') as f:
f.write(attachment_data)
Turns out I needed to throw some data into the file in order to have something in the body. I should've noticed this much sooner.