Reading Local HTML File in Python - python

I've been reviewing examples of how to read in HTML from websites using XPass and lxml. For some reason when I try with a local file I keep running into this error.
AttributeError: 'str' object has no attribute 'content'
This is the code
with open(r'H:\Python\Project\File','r') as f:
file = f.read()
f.close()
tree = html.fromstring(file.content)

You have a few problems with your code. It looks like you are modifying code that is parsing html from an http/https request. In that case using .content() extracts the bytes from the response object.
However, when reading from a file, you are already reading in the contents of the file in your with context. Also, you don't need to use .close(), the context manager takes care of that for you.
Try this:
with open(r'H:\Python\Project\File','r') as f:
tree = html.fromstring(f.read())

Try encoding='utf-8'
f1 = open(new_file + '.html', 'r', encoding="utf-8")

Related

Passing Binary file over HTTP POST

I have a local python file that decodes binary files. This python file first reads from the file, opens it as binary and then saves it in a buffer and interprets it. Reading it is simply:
with open(filepath, 'rb') as f:
buff = f.read()
read_all(buff)
This works fine locally. Now I'd like to setup a Azure Python job where I can send the file, approx. 100kb, over a HTTP POST and then read the interpreted meta data which my original python script does well.
I've first removed the read function so that I'll now work with the buffer only.
In my Azure Python Job I have the following, triggered by a HttpRequest
my_data = reader.read_file(req.get_body())
To test my sending I've tried the following in python
import requests
url = 'http://localhost:7071/api/HttpTrigger'
files = {'file': open('test', 'rb')}
with open('test', 'rb') as f:
buff = f.read()
r = requests.post(url, files=files) #Try using files
r = requests.post(url, data=buff) #Try using data
I've also tried in Postman adding the file to the body as a binary and setting the headers to application/octet-stream
All this doesn't send the binary file the same way as the original f.read() did. So I'm getting a wrong interpretation of the binary file.
What is file.read doing differently to how I'm sending it over as a HTTP Body message?
Printing out the first line from the local python read file gives.
b'\n\n\xfe\xfe\x00\x00\x00\x00\\\x18,A\x18\x00\x00\x00(\x00\x00\x00\x1f\x00\x00\
Whereas printing it out at the req.get_body() shows me
b'\n\n\xef\xbf\xbd\xef\xbf\xbd\x00\x00\x00\x00\\\x18,A\x18\x00\x00\x00(\x00\x00\x00\x1f\x00\
So something is clearly wrong. Any help why this could be different?
Thanks
EDIT:
I've implemented a similar function in Flask and it works well.
The code in flask is simply grabbing the file from a POST. No encoding/decoding.
if request.method == 'POST':
f = request.files['file']
#f.save(secure_filename(f.filename))
my_data = reader.read_file(f.read())
Why is the Azure Function different?
You can try UTF-16 to decode and do the further action in your code.
Here is the code for that:
with open(path_to_file,'rb') as f:
contents = f.read()
contents = contents.rstrip("\n").decode("utf-16")
Basically after doing re.get_body, perform the below operation:
contents = contents.rstrip("\n").decode("utf-16")
See if it gives you the same output as your receive in local python file.
Hope it helps.

Open an XML file through URL and save it

With Python 3, I want to read an XML web page and save it in my local drive.
Also, if the file already exist, it must overwrite it.
I tested some script like :
import urllib.request
xml = urllib.request.urlopen('URL')
data = xml.read()
file = open("file.xml","wb")
file.writelines(data)
file.close()
But I have an error :
TypeError: a bytes-like object is required, not 'int'
First suggestion: do what even the official urllib docs says and don't use urllib, use requests instead.
Your problem is that you use .writelines() and it expects a list of lines, not a bytes objects (for once in Python the error message is not very helpful). Use .write() instead
import requests
resp = requests.get('URL')
with open('file.xml', 'wb') as foutput:
foutput.write(resp.content)
I found a solution :
from urllib.request import urlopen
xml = open("import.xml", "r+")
xml.write(urlopen('URL').read().decode('utf-8'))
xml.close()
Thanks for your help.

Can anyone tell me what error msg "line 1182 in parse" means when I'm trying to parse and xml in python

This is the code that results in an error message:
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.parse(data)
The error:
I'm new to python. I did read documentation and a couple of tutorials, but clearly I still have done something wrong. I don't believe it is the xml file itself because it does this to two different xml files.
Consider using ElementTree's fromstring():
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
# http://feeds.bbci.co.uk/news/rss.xml?edition=int
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.fromstring(data)
print ET.tostring(tree, encoding='utf8', method='xml')
data is a reference to the XML content as a string, but the parse() function expects a filename or file object as argument. That's why there is an an error.
urlhandle is a file object, so tree = ET.parse(urlhandle) should work for you.
The error message indicates that your code is trying to open a file, who's name is stored in the variable source.
It's failing to open that file (IOError) because the variable source contains a bunch of XML, not a file name.

How to print '<!DOCTYPE html>'?

My client requests a page from a server written in python 3.
The server return an html page that is presented by client.
Therefore, I did a dummy.html page and when client asks for it, my python reads it and returns it to the client:
filename = "dummy.html"
fh = open(filename, 'rt')
line = fh.readline()
while line:
print(line)
line = fh.readline()
fh.close()
However, this code does not read the <!DOCTYPE html> that is placed in the top of my dummy.html file (and thus, things like bootstrap don't work for me...).
I also tried printing it manually print('<!DOCTYPE html>') but that also does not work.
print('<!DOCTYPE html>') <---- IT IS PRINTED TO SDOUT BUT WHEN PRINTED TO CLIENT, THE PAGE DOES NOT HAVE THIS LINE ....
filename = CURRENTPATH+"\\..\\su.html"
fh = open(filename, 'rt')
line = fh.readline()
print('hello')
print('<'+'!'+'DOCTYPE html>')
while line:
print(line)
line = fh.readline()
fh.close()
How can I fix it?
It looks like you're trying to reimplement a web server in Python. Please consider using an existing web framework, such as Django (https://www.djangoproject.com/), Flask (http://flask.pocoo.org/) or Pyramid (http://www.pylonsproject.org/), which will do most of the work for you (including built-in support for a wide variety of HTML templating libraries, and actual performance).
As for your actual answer, a bare print statement prints to stdout, as expected. You need, instead, to write to the file-like object whose contents will be sent to the client (is it a socket? a file? who knows? stop reinventing the wheel).

Read file using urllib and write adding extra characters

I have a script that regularly reads a text file on a server and over writes a copy of the text to a local copy of the text file. I have an issue of the process adding extra carriage returns and an extra invisible character after the last character. How do I make an identical copy of the server file?
I use the following to read the file
for link in links:
try:
f = urllib.urlopen(link)
myfile = f.read()
except IOError:
pass
and to write it to the local file
f = open("C:\\localfile.txt", "w")
try:
f.write(myfile)
except NameError:
pass
finally:
f.close()
This is how the file looks on the server
!http://i.imgur.com/rAnUqmJ.jpg
and this is how the file looks locally. Besides, an additional invisible character after the last 75
!http://i.imgur.com/xfs3E8D.jpg
I have seen quite a few similar questions, but not sure how to handle the urllib to read in binary
Any solution please?
If you want to copy a remote file denoted by a URL to a local file i would use urllib.urlretrieve:
import urllib
urllib.urlretrieve("http://anysite.co/foo.gz", "foo.gz")
I think urllib is reading binary.
Try changing
f = open("C:\\localfile.txt", "w")
to
f = open("C:\\localfile.txt", "wb")

Categories