Write to an HTML file with Python

Write to an HTML file with Python - python

I have a couple of graphs I need to display in my browser offline, MPLD3 outputs the html as a string and I need to be able to make an html file containing that string. What I'm doing right now is:
tohtml = mpld3.fig_to_html(fig, mpld3_url='/home/pi/webpage/mpld3.js',
d3_url='/home/pi/webpage/d3.js')
print(tohtml)
Html_file = open("graph.html","w")
Html_file.write(tohtml)
Html_file.close();
tohtml is the variable where the HTML string is stored. I've printed this string to the terminal and then pasted it into an empty HTML file and I get my desired result. However, when I run my code, I get an empty file named graph.html

It seems like you may be reinventing the wheel here. Have you tried something like,
mpld3_url='/home/pi/webpage/mpld3.js'
d3_url='/home/pi/webpage/d3.js'
with open('graph.html', 'w') as fileobj:
mpld3.save_html(fig, fileobj, d3_url=d3_url, mpld3_url=mpld3_url)
Note, this is untested just going off of mpld3.save_html documentation and using prior knowledge about Python IO Streams

Related

Why Python is writting the UNICODE code instead the character on a file

I'm making a Python program that (after doing a lot of things haha) creates a HTML file with some of the generated info.
I open a HTML template and then I replace some 'tokens' with the generated info.
The way I open and replace the info is the following:
def getPlantilla():
with open('assets/plantillas/plantilla_proyecto3.html','r') as file:
plantilla = file.read()
return plantilla
def remplazarTokens(plantilla:str,PID,Pmap):
tabla_html = tabulate(Pmap,headers="firstrow",tablefmt='html')
return plantilla.format(PID=PID,TABLA=tabla_html)
But before 'replace the tokens' I generate some HTML code with the generated info with this function:
def crearTrigger(uso,id):
return f"{uso}"
And finally I create the file:
with open(filename,'w',encoding='UTF-8') as file:
file.write(html)
The problem is that in the final .html, the code that was generated with crearTrigger() dosen't works well because some characters are remplaced with the UNICODE code.
Example:
Out: <a href="#heap">Heap</a>
How it should be: Heap
I think that this is a encoding problem, but I had tried to encode it with .encode("utf-8") and still have the same problem.
Hope someone can help me. Thanks
Update: When I was writting the question, I realised that the library tabulate that I using to convert the info into a HTML table, it's creating the problem (Putting the UNICODE code instead the char), because the out's from crearTrigger() are saving in a list, that later tabulate converts into a HTLM table. But I still dont know how to solve it.

requests - Python command line behavior differs from behavior when script is run

I'm trying to write a script that will input data I supply into a web form at a url I supply.
To start with, I'm testing it out by simply getting the html of the page and outputting it as a text file. (I'm using Windows, hence .txt.)
import sys
import requests
sys.stdout = open('html.txt', 'a')
content = requests.get('http://www.york.ac.uk/teaching/cws/wws/webpage1.html')
content.text
When I do this (i.e., the last two lines) on the python command line (>>>), I get what I expect. When I do it in this script and run it from the normal command line, the resulting html.txt is blank. If I add print(content) then html.txt contains only: <Response [200]>.
Can anyone elucidate what's going on here? Also, as you can probably tell, I'm a beginner, and I can't for the life of me find a beginner-level tutorial that explains how to use requests (or urllib[2] or selenium or whatever) to send data to webpages and retrieve the results. Thanks!

You want:
import sys
import requests
result = requests.get('http://www.york.ac.uk/teaching/cws/wws/webpage1.html')
if result.status_code == requests.codes.ok:
with open('html.txt', 'a') as sys.stdout:
print result.content
Requests returns an instance of type request.Response. When you tried to print that, the __repr__ method was called, which looks like this:
def __repr__(self):
return '<Response [%s]>' % (self.status_code)
That is where the <Response [200]> came from.
The requests.Reponse has a content attribute which is an instance of str (or bytes for Python 3) that contains your HTML.
The text attribute is type unicode which may or may not be what you want. You mention in the comments that you saw a UnicodeDecodeError when you tried to write it to a file. I was able to replace the print result.content above with print result.text and I did not get that error.
If you need help solving your unicode problems, I recommend reading this unicode presentation. It explains why and when to decode and encode unicode.

The interactive interpreter echoes the result of every expression that doesn't produce None. This doesn't happen in regular scripts.
Use print to explicitly echo values:
print response.content
I used the undecoded version here as you are redirecting stdout to a file with no further encoding information.
You'd be better of writing the output directly to a file however:
with open('html.txt', 'ab') as outputfile:
outputfile.write(response.content)
This writes the response body, undecoded, directly to the file.

why different download way result in different display?

When i down the file on the web with my firefox,
http://quotes.money.163.com/service/lrb_000559.html
it looks fine in my EXCEL.
When i down the file with my python code,
from urllib.request import urlopen
url="http://quotes.money.163.com/service/lrb_000559.html"
html=urlopen(url)
outfile=open("g:\\000559.csv","w")
outfile.write(html.read().decode("gbk"))
outfile.close()
it looks stange, when open it with my EXCEL,there is one line filled with proper content ,and one line filled with blank ,you can try it in your pc.
Why will different download way result in different display ?

My guess is that line endings are changed when decoding and writing the result in python. Try using a binary file instead. Off the top of my head, I think it would go something like this:
outfile=open("g:\\000559.csv","wb")
outfile.write(html.read())

Add a 'b' flag to the file open, i.e. change this:
outfile=open("g:\\000559.csv","w")
To this:
outfile=open("g:\\000559.csv","wb")
Explanation here. The original file had a \r\n, and Python is converting the \n to \r\n, meaning you have an extra carriage return at the end of every line (\r\r\n).

Creating pdfs in Python with Pisa / xhtml2pdf

I know there are a lot of questions based on pdf creation in Python but I haven't seen anything based on creating pdfs with Pisa or xhtml2pdf.
Here is my code.
pisa.pisaDocument(cStringIO.StringIO(a).encode('utf-8'),file('mypdf.pdf','wb'))
and then
pisa.startViewer('mypdf.pdf')
I assembled this over a couple different tutorials and examples but every single thing that I've tried always results in the pdf being corrupted and I get this message when trying to open the pdf.
"Adobe Reader could not open 'awesomer.pdf' because it is either not a supported file type or because the file has been damaged (for example, it was sent as an email attachment and wasn't correctly decoded)."
This message occurs even when I don't use the .encode('utf-8') on the string.
What am I doing wrong? Does the encoding on my Mac have to do with this?

I'd suggest closing the file manually, had a simmilar problem. Try this:
f = file('mypdf.pdf', 'wb')
pisa.pisaDocument(cStringIO.StringIO(a).encode('utf-8'),f)
f.close()

I recommend doing the following:
pdf = pisa.pisaDocument(cStringIO.StringIO(a).encode('utf-8'),file('mypdf.pdf','wb'))
if pdf.err:
print "*** %d ERRORS OCCURED" % pdf.err
And then see what the error output is.
I'm not sure what string you are encoding but this might also help:
pdf = pisa.pisaDocument(cStringIO.StringIO(html.encode(a)).encode('utf-8'),file('mypdf.pdf','wb'))
It depends on if a needs to be html encoded

pyPdf unable to extract text from some pages in my PDF

I'm trying to use pyPdf to extract and print pages from a multipage PDF. Problem is, text is not extracted from some pages. I've put an example file here:
http://www.4shared.com/document/kmJF67E4/forms.html
If you run the following, the first 81 pages return no text, while the final 11 extract properly. Can anyone help?
from pyPdf import PdfFileReader
input = PdfFileReader(file("forms.pdf", "rb"))
for page in input1.pages:
print page.extractText()

Note that extractText() still has problems extracting the text properly. From the documentation for extractText():
This works well for some PDF files,
but poorly for others, depending on
the generator used. This will be
refined in the future. Do not rely on
the order of text coming out of this
function, as it will change if this
function is made more sophisticated.
Since it is the text you want, you can use the Linux command pdftotext.
To invoke that using Python, you can do this:
>>> import subprocess
>>> subprocess.call(['pdftotext', 'forms.pdf', 'output'])
The text is extracted from forms.pdf and saved to output.
This works in the case of your PDF file and extracts the text you want.

This isn't really an answer, but the problem with pyPdf is this: it doesn't yet support CMaps. PDF allows fonts to use CMaps to map character IDs (bytes in the PDF) to Unicode character codes. When you have a PDF that contains non-ASCII characters, there's probably a CMap in use, and even sometimes when there's no non-ASCII characters. When pyPdf encounters strings that are not in standard Unicode encoding, it just sees a bunch of byte code; it can't convert those bytes to Unicode, so it just gives you empty strings. I actually had this same problem and I'm working on the source code at the moment. It's time consuming, but I hope to send a patch to the maintainer some time around mid-2011.

You could also try the pdfminer library (also in python), and see if it's better at extracting the text. For splitting however, you will have to stick with pyPdf as pdfminer doesn't support that.

I find it sometimes useful to convert it to ps (try with pdf2psand pdftops for potential differences) then back to pdf (ps2pdf). Then try your original script again.

I had similar problem with some pdfs and for windows, this is working excellent for me:
1.- Download Xpdf tools for windows
2.- copy pdftotext.exe from xpdf-tools-win-4.00\bin32 to C:\Windows\System32 and also to C:\Windows\SysWOW64
3.- use subprocess to run command from console:
import subprocess
try:
extInfo = subprocess.check_output('pdftotext.exe '+filePath + ' -',shell=True,stderr=subprocess.STDOUT).strip()
except Exception as e:
print (e)

I'm starting to think I should adopt a messy two-part solution. there are two sections to the PDF, pp 1-82 which have text page labels (pdftotext can extract), and pp 83-end which have no page labels but pyPDF can extract and it explicitly knows pages.
I think I need to combine the two. Clunky, but I don't see any way round it. Sadly I'm having to do this on a Windows machine.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Write to an HTML file with Python - python

Related

Why Python is writting the UNICODE code instead the character on a file

requests - Python command line behavior differs from behavior when script is run

why different download way result in different display?

Creating pdfs in Python with Pisa / xhtml2pdf

pyPdf unable to extract text from some pages in my PDF

Categories

Resources