How to send requests to a downloaded html file - python

I have a .html file downloaded and want to send a request to this file to grab it's content.
However, if I do the following:
import requests
html_file = "/user/some_html.html"
r = requests.get(html_file)
Gives the following error:
Invalid URL 'some_html.html': No schema supplied.
If I add a schema I get the following error:
HTTPConnectionPool(host='some_html.html', port=80): Max retries exceeded with url:
I want to know how to specifically send a request to a html file when it's downloaded.

You are accessing html file from local directory. get() method uses HTTPConnection and port 80 to access data from website not a local directory. To access file from local directory using get() method use Xampp or Wampp.
for accessing file from local directory you can use open() while requests.get() is for accessing file from Port 80 using http Connection in simple word from internet not local directory
import requests
html_file = "/user/some_html.html"
t=open(html_file, "r")
for v in t.readlines():
print(v)
Output:

You don't "send a request to a html file". Instead, you can send a request to a HTTP server on the internet which will return a response with the contents of a html file.
The file itself knows nothing about "requests". If you have the file stored locally and want to do something with it, then you can open it just like any other file.
If you are interested in learning more about the request and response model, I suggest you try a something like
response = requests.get("http://stackoverflow.com")
You should also read about HTTP and requests and responses to better understand how this works.

You can do it by setting up a local server to your html file.
If you use Visual Studio Code, you can install Live Server by Ritwick Dey.
Then you do as follows:
1 - Make the first request and save the html content into a .html file:
my_req.py
import requests
file_path = './'
file_name = 'my_file'
url = "https://www.qwant.com/"
response = requests.request("GET", url)
w = open(file_path + file_name + '.html', 'w')
w.write(response.text)
2 - With Live Server installed on Visual Studio Code, click on my_file.html and then click on Go Live.
and
3 - Now you can make a request to your local http schema:
second request
import requests
url = "http://127.0.0.1:5500/my_file.html"
response = requests.request("GET", url)
print(response.text)
And, tcharan!! do what you need to do.
On a crawler work, I had one situation where there was a difference between the content displayed on the website and the content retrieved with the response.text so the xpaths did not were the same as on the website, so I needed to download the content, making a local html file, and get the new ones xpaths to get the info that I needed.

You can try this:
from requests_html import HTML
with open("htmlfile.html") as htmlfile:
sourcecode = htmlfile.read()
parsedHtml = HTML(html=sourcecode)
print(parsedHtml)

Related

How to make Python go through URLs in a text file, check their status codes, and exclude all ones with 404 error?

I tried the following script, but unfortunately the output file is identical to the input file. I'm not sure what's wrong with it.
import requests
url_lines = open('banana1.txt').read().splitlines()
remove_from_urls = []
for url in url_lines:
remove_url = requests.get(url)
print(remove_url.status_code)
if remove_url.status_code == 404:
remove_from_urls.append(url)
continue
url_lines = [url for url in url_lines if url not in remove_from_urls]
print(url_lines)
# Save urls example
with open('banana2.txt', 'w+') as file:
for item in url_lines:
file.write(item + '\n')
There seems to be no error in your code, but there are few things that would help to make it more readable and consistent. The first course of action should be to make sure there is at least one url that would return a 404 status code.
Edit: After providing the actual URL.
The 404 problem
In your case, the problem is the Twitter actually does not return 404 error for your "Not found" url. You can test it using curl:
$ curl -o /dev/null -w "%{http_code}" "https://twitter.com/davemeltzerWON/status/1321279214365016064"
200
Or using Python:
import requests
response = requests.get("https://twitter.com/davemeltzerWON/status/1321279214365016064")
print(response.status_code)
The output for both should be 200.
Since Twitter is a JavaScript application that loads its content after it has been processed in browser, you cannot find the information you are looking for in the HTML response. You would need to use something like Selenium to actually process the JavaScript for you and then you would be able to look for actual text like "not found" on the web page.
Code review
Please make sure to close the file properly. Also, file object is a lines iterator, you can convert it to list very easily. Another trick to make the code more readable is to make use of Python set. So you may read the file like this:
with open("banana1.txt") as fid:
url_lines = set(fid)
Then you simply remove all the links that do not work:
not_working = set()
for url in url_lines:
if requests.get(url).status_code == 404:
not_working.add(url)
working = url_lines - not_working
with open("banana2.txt", "w") as fid:
fid.write("\n".join(working))
Also, if some of the links point to the same server, you should make use of requests.Session class:
from requests import Session
session = Session()
Then replace requests.get with session.get, you should get some performance boost since the Session uses keep-alive connection and many other things.

Uploading and uploaded file without saving with python requests

I have a web server that acts as a proxy between users and a file server. Users can upload their files to my web server and I upload them to the file server. I want to be able to do this without saving the temp uploaded file but every time I get unexpected end of file error from the file server. This is my code (I'm using django rest framework for my APIs).
headers = {"content-type":"multipart/form; boundary={}".format(uuid.uuid4().hex)}
files = []
for f in request.FILES.getlist('file'):
files.append((f.name, open(f.file.name,'rb'), f.content_type))
files_dict = {'file': files}
r = requests.post(url, files=files, headers=headers)
You are misusing the content-type header in your request. There is no need to manually set multipart/form and boundary if you're using files argument from requests. That is why you're getting unexpected end of file error. Try sending your request without that header.
r = requests.post(url, files=files)

How to upload a binary/video file using Python http.client PUT method?

I am communicating with an API using HTTP.client in Python 3.6.2.
In order to upload a file it requires a three stage process.
I have managed to talk successfully using POST methods and the server returns data as I expect.
However, the stage that requires the actual file to be uploaded is a PUT method - and I cannot figure out how to syntax the code to include a pointer to the actual file on my storage - the file is an mp4 video file.
Here is a snippet of the code with my noob annotations :)
#define connection as HTTPS and define URL
uploadstep2 = http.client.HTTPSConnection("grabyo-prod.s3-accelerate.amazonaws.com")
#define headers
headers = {
'accept': "application/json",
'content-type': "application/x-www-form-urlencoded"
}
#define the structure of the request and send it.
#Here it is a PUT request to the unique URL as defined above with the correct file and headers.
uploadstep2.request("PUT", myUniqueUploadUrl, body="C:\Test.mp4", headers=headers)
#get the response from the server
uploadstep2response = uploadstep2.getresponse()
#read the data from the response and put to a usable variable
step2responsedata = uploadstep2response.read()
The response I am getting back at this stage is an
"Error 400 Bad Request - Could not obtain the file information."
I am certain this relates to the body="C:\Test.mp4" section of the code.
Can you please advise how I can correctly reference a file within the PUT method?
Thanks in advance
uploadstep2.request("PUT", myUniqueUploadUrl, body="C:\Test.mp4", headers=headers)
will put the actual string "C:\Test.mp4" in the body of your request, not the content of the file named "C:\Test.mp4" as you expect.
You need to open the file, read it's content then pass it as body. Or to stream it, but AFAIK http.client does not support that, and since your file seems to be a video, it is potentially huge and will use plenty of RAM for no good reason.
My suggestion would be to use requests, which is a way better lib to do this kind of things:
import requests
with open(r'C:\Test.mp4'), 'rb') as finput:
response = requests.put('https://grabyo-prod.s3-accelerate.amazonaws.com/youruploadpath', data=finput)
print(response.json())
I do not know if it is useful for you, but you can try to send a POST request with requests module :
import requests
url = ""
data = {'title':'metadata','timeDuration':120}
mp3_f = open('/path/your_file.mp3', 'rb')
files = {'messageFile': mp3_f}
req = requests.post(url, files=files, json=data)
print (req.status_code)
print (req.content)
Hope it helps .

ServiceNow - How to use SOAP to download reports

I need to automate download of reports from serviceNow.
I've been able to automate it using python and selenium and win32com by following method.
https://test.service-now.com/sys_report_template.do?CSV&jvar_report_id=92a....7aa
And using selenium to access serviceNow as well as modify firefox default download option to dump the file to a folder on windows machine.
However, Since all of this may be ported to a linux server , we would like to port it to SOAP or CURL.
I came across serviceNow libraries for python here.
I tried it out and following code is working if I set login , password and instance-name as listed at the site using following from ServiceNow.py
class Change(Base):
__table__ = 'change_request.do'
and following within clientside script as listed on site.
# Fetch changes updated on the last 5 minutes
changes = chg.last_updated(minutes=5)
#print changes client side script.
for eachline in changes:
print eachline
However, When I replace URL with sys_report_template.do, I am getting error
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\SOAPpy\Parser.py", line 1080, in _parseSOAP
parser.parse(inpsrc)
File "C:\Python27\Lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "C:\Python27\Lib\xml\sax\xmlreader.py", line 125, in parse
self.close()
File "C:\Python27\Lib\xml\sax\expatreader.py", line 220, in close
self.feed("", isFinal = 1)
File "C:\Python27\Lib\xml\sax\expatreader.py", line 214, in feed
self._err_handler.fatalError(exc)
File "C:\Python27\Lib\xml\sax\handler.py", line 38, in fatalError
raise exception
SAXParseException: <unknown>:1:0: no element found
Here is relevent code
from servicenow import ServiceNow
from servicenow import Connection
from servicenow.drivers import SOAP
# For SOAP connection
conn = SOAP.Auth(username='abc', password='def', instance='test')
rpt = ServiceNow.Base(conn)
rpt.__table__ = "sys_report_template.do?CSV"
#jvar_report_id replaced with .... to protect confidentiality
report = rpt.fetch_one({'jvar_report_id': '92a6760a......aas'})
for eachline in report:
print eachline
So, my question is , what can be done to make this work?
I looked on web for resources and help, but didn't find any.
Any help is appreciated.
After much research I was able to use following method to get report in csv format from servicenow. I thought I will post over here in case anyone else runs into similar issue.
import requests
import json
# Set the request parameters
url= 'https://myinstance.service-now.com/sys_report_template.do?CSV&jvar_report_id=929xxxxxxxxxxxxxxxxxxxx0c755'
user = 'my_username'
pwd = 'my_password'
# Set proper headers
headers = {"Accept":"application/json"}
# Do the HTTP request
response = requests.get(url, auth=(user, pwd), headers=headers )
response.raise_for_status()
print response.text
response.text now has report in csv format.
I need to next figure out, how to parse the response object to extract csv data in correct format.
Once done, I will post over here. But for now this answers my question.
I tried this and its working as expected.
`import requests
import json
url= 'https://myinstance.service-now.com/sys_report_template.do?CSV&jvar_report_id=929xxxxxxxxxxxxxxxxxxxx0c755'
user = 'my_username'
pwd = 'my_password'
response = requests.get(url, auth=(user, pwd), headers=headers )
file_name = "abc.csv"
with open(file_name, 'wb') as out_file:
out_file.write(response.content)
del response`

Receive attachment with urllib - Python

I am testing my webpage software by sending requests from python to it. I am able to send requests, receive responses and parse the json. However, one option on the webpage is to download files. I send the download request and can confirm that the response headers contain what I expect (application/octet-stream and the appropriate filename) but the Content-Length is 0. If the length is 0, I assume the file was not actually sent. I am able to download files from other means so I know my software works but I am having trouble with getting it to work with python.
I build up the request then do:
f = urllib.request.urlopen(request)
f.body = f.read()
I expect data to be in f.body but it is empty (I see "b''")
Is there a different way to access the file contents from an attachment in python?
Is there a different way to access the file contents from an attachment in python?
This is in python-requests instead urllib, since I'm more familiar with that.
import requests
url = "http://example.com/foobar.jpg"
#make request
r = requests.get(url)
attachment_data = r.content
#save to file
with open(r"C:/pictures/foobar.jpg", 'wb') as f:
f.write(attachment_data)
Turns out I needed to throw some data into the file in order to have something in the body. I should've noticed this much sooner.

Categories