How to render umlauts with jinja2? - python

I am trying to render some basic umlauts with jinja2.
test.html
<!doctype html>
<link type="text/css" rel="stylesheet" href="style.css"/>
<meta charset="UTF-8">
<h3>Umlauts: ä ü ö</h3>
Result.html
<!doctype html>
<link type="text/css" rel="stylesheet" href="style.css"/>
<meta charset="UTF-8">
<h3>Umlauts: ä ü ö</h3>
My code
from jinja2 import Template
file = open("test.html")
data = file.read()
Template(data).stream().dump("index.html")
Now I don't understand how to get jinja to process the umlauts correctly. How can I do this? I am using stream, because in my actual usecase I am providing some data to fill in, then dumping it to an html to be displayed.
EDIT: Is what I want even possible? As I understand this from here it is not?
It is not possible to use Jinja2 to process non-Unicode data. The
reason for this is that Jinja2 uses Unicode already on the language
level. For example Jinja2 treats the non-breaking space as valid
whitespace inside expressions which requires knowledge of the encoding
or operating on an Unicode string.

With Python3 you can specify the encoding with open.
from jinja2 import Template
file = open("test.html", 'r', encoding='utf-8')
data = file.read()
Template(data).stream().dump('index.html')
For Python2 you can use the io module to specify encoding.
import io
file = io.open("test.html", 'r', encoding='utf-8')

Related

show binary photo with flask and python

i have a encode binary picture in database , i want to show the photo with flask . So i need to decode and after that, show it in the browser. I dont want to use html file, i know that i can do it without html.
I have this code , but it is only from directory , i dont know how decode the data and show
PATH_FILES = getcwd() + "/files/"
#images.route("/file/<string:name_file>")
def get_image(name_file):
return send_from_directory(PATH_FILES, path=name_file, as_attachment=False)
imagine that already have de encodeData:
encodeimage=b''
so how can i do it ?
its a base64 encoded image, you can use it as in below example:
<!DOCTYPE html>
<html>
<head>
<title>Display base64 Image</title>
</head>
<body>
<img src='data:image/jpeg;base64, ' />
</body>
</html>
if you save the above file as html and open it with a browser, you will see the actual image.

Windows adding a bunch of whitespace/newlines to an html file write in python using request

Using the following code, I end up with one or more newlines between each and every line in my file when running the code on windows (in jupyter notebook on python3) but NOT when running on mac or Linux?
I assume it's some kind of encoding issue? something to do with window's "/r/n" shenanigans? doing a ;str(page.content)instead leaves me with a file full of/r/n` as expected but I'm not sure why it's chalk full of newlines to begin with?
note: I have commented out a quick way to remove whitespace but it's a bit of a hack and not really what I'm after, i'm more looking for why the whitespace is being added to begin with.
import requests
url = 'https://stackoverflow.com/questions/3030487/is-there-a-way-to-get-the-xpath-in-google-chrome'
page=requests.get(url)
newhtml = page.text
# import re
# newhtml = re.sub(r'\s\s+', ' ', page.text)
f = open('webpage.html', 'w', encoding='utf-8')
f.write(newhtml)
f.close()
Result Sample:
<html itemscope itemtype="http://schema.org/QAPage" class="html__responsive">
<head>
<title>Is there a way to get the xpath in google chrome? - Stack Overflow</title>
<link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
<link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
<link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
<meta name="viewport" content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0">
<meta property="og:type" content= "website" />
<meta property="og:url" content="https://stackoverflow.com/questions/3030487/is-there-a-way-to-get-the-xpath-in-google-chrome"/>
<meta property="og:site_name" content="Stack Overflow" />
Looks like C14L nailed it. (how do I give you internet points as a comment, can only do that as an answer, right?)
I switched over to f = open('webpage.html', 'wb', encoding='utf-8') and it complained
ValueError: binary mode doesn't take an encoding argument
so made that f = open('webpage.html', 'wb') which complained
TypeError: a bytes-like object is required, not 'str'
so I switched up newhtml = page.textto newhtml = page.content and voila, the output is as expected. now to test and see that it doesn't break anything running on mac/Linux
Final functional code:
import requests
url = 'https://stackoverflow.com/questions/3030487/is-there-a-way-to-get-the-xpath-in-google-chrome'
page=requests.get(url)
newhtml = page.content
f = open('webpage.html', 'wb')
f.write(newhtml)
f.close()

Why does Python 3 urllib redirect to Yahoo?

I am using urlopen in urllib.request in Python 3.5.1 (64-bit version on Windows 10) to load content from www.wordreference.com for a French project. Somehow, whenever I request anything outside the domain itself, page content is instead loaded from yahoo.com.
Here, I print the first 350 characters from http://www.wordreference.com:
>>> from urllib import request
>>> page = request.urlopen("http://www.wordreference.com")
>>> content = page.read()
>>> print(content.decode()[:350])
<!DOCTYPE html>
<html lang="en">
<head>
<title>English to French, Italian, German & Spanish Dictionary -
WordReference.com</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="description" content="Free online dictionaries - Spanish, French,
Italian, German and more. Conjugations, audio pronunciations and
Next, I requested a specific document on the domain:
>>> page = request.urlopen("http://www.wordreference.com/enfr/test")
>>> content = page.read()
>>> print(content.decode()[:350])
<!DOCTYPE html>
<html id="atomic" lang="en-US" class="atomic my3columns l-out Pos-r https fp
fp-v2 rc1 fp-default mini-uh-on viewer-right ltr desktop Desktop bkt201">
<head>
<title>Yahoo</title><meta http-equiv="x-dns-prefetch-control" content="on"
<link rel="dns-prefetch" href="//s.yimg.com"><link rel="preconnect"
href="//s.yimg.com"><li
The last request takes about six seconds longer to read (which could be my slow internet) and the content comes straight from http://www.yahoo.com/. I can access the above URLs fine in a web browser.
Why is this happening? Is this something related to Windows 10? I have tried this on other domains and the problem does not occur.
I tried the following code and it's working.
import requests
page = requests.get("http://www.wordreference.com/enfr/test")
content = page.text
print(content.encode('utf-8')[:350])

HTML encoding and lxml parsing

I'm trying to finally solve some encoding issues that pop up from trying to scrape HTML with lxml. Here are three sample HTML documents that I've encountered:
1.
<!DOCTYPE html>
<html lang='en'>
<head>
<title>Unicode Chars: 은 —’</title>
<meta charset='utf-8'>
</head>
<body></body>
</html>
2.
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR">
<head>
<title>Unicode Chars: 은 —’</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body></body>
</html>
3.
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Unicode Chars: 은 —’</title>
</head>
<body></body>
</html>
My basic script:
from lxml.html import fromstring
...
doc = fromstring(raw_html)
title = doc.xpath('//title/text()')[0]
print title
The results are:
Unicode Chars: ì ââ
Unicode Chars: 은 —’
Unicode Chars: 은 —’
So, obviously an issue with sample 1 and the missing <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> tag. The solution from here will correctly recognize sample 1 as utf-8 and so it is functionally equivalent to my original code.
The lxml docs appear conflicted:
From here the example seems to suggest we should use UnicodeDammit to encode the markup as unicode.
from BeautifulSoup import UnicodeDammit
def decode_html(html_string):
converted = UnicodeDammit(html_string, isHTML=True)
if not converted.unicode:
raise UnicodeDecodeError(
"Failed to detect encoding, tried [%s]",
', '.join(converted.triedEncodings))
# print converted.originalEncoding
return converted.unicode
root = lxml.html.fromstring(decode_html(tag_soup))
However here it says:
[Y]ou will get errors when you try [to parse] HTML data in a unicode string that specifies a charset in a meta tag of the header. You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone.
If I try to follow the the first suggestion in the lxml docs, my code is now:
from lxml.html import fromstring
from bs4 import UnicodeDammit
...
dammit = UnicodeDammit(raw_html)
doc = fromstring(dammit.unicode_markup)
title = doc.xpath('//title/text()')[0]
print title
I now get the following results:
Unicode Chars: 은 —’
Unicode Chars: 은 —’
ValueError: Unicode strings with encoding declaration are not supported.
Sample 1 now works correctly but sample 3 results in an error due to the <?xml version="1.0" encoding="utf-8"?> tag.
Is there a correct way to handle all of these cases? Is there a better solution than the following?
dammit = UnicodeDammit(raw_html)
try:
doc = fromstring(dammit.unicode_markup)
except ValueError:
doc = fromstring(raw_html)
lxml has several issues related to handling Unicode. It might be best to use bytes (for now) while specifying the character encoding explicitly:
#!/usr/bin/env python
import glob
from lxml import html
from bs4 import UnicodeDammit
for filename in glob.glob('*.html'):
with open(filename, 'rb') as file:
content = file.read()
doc = UnicodeDammit(content, is_html=True)
parser = html.HTMLParser(encoding=doc.original_encoding)
root = html.document_fromstring(content, parser=parser)
title = root.find('.//title').text_content()
print(title)
Output
Unicode Chars: 은 —’
Unicode Chars: 은 —’
Unicode Chars: 은 —’
The problem probably stems from the fact that <meta charset> is a relatively new standard (HTML5 if I'm not mistaken, or it wasn't really used before it.)
Until such a time when the lxml.html library is updated to reflect it, you will need to handle that case specially.
If you only care about ISO-8859-* and UTF-8, and can afford to throw away non-ASCII compatible encodings (such as UTF-16 or the East Asian traditional charsets), you can do a regular expression substitution on the byte string, replacing the newer <meta charset> with the older http-equiv format.
Otherwise, if you need a proper solution, your best bet is to patch the library yourself (and contributing the fix while you're at it.) You might want to ask the lxml developers if they have any half-baked code laying around for this particular bug, or if they are tracking the bug on their bug tracking system in the first place.

Django output word files(.doc),only show raw html in the contents

I am writing a web app using Django 1.4.I want one of my view to output mirosoft word docs using the follwoing codes:
response = HttpResponse(view_data, content_type='application/vnd.ms-word')
response['Content-Disposition'] = 'attachment; filename=file.doc'
return response
Then ,I can download the file.doc successfully ,but when I open the .doc file ,I only find the raw html like this
<h1>some contents</h1>
not a heading1 title.
I am new to python & Django ,I know this maybe some problems with html escape,can some one please help me with this ?
Thank you !:)
Unless you have some method of converting your response (here HTML I assume) to a .doc file, all you will get is a text file containing your response with the extension .doc. If you are willing to go for .docx files there is a wonderful python library called python-docx you should look in to that allows you to generate well formed docx files using the lxml library.
Alternatively, use a template such as:
<html>
<head>
<META HTTP-EQUIV=""Content-Type"" CONTENT=""text/html; charset=UTF-8"">
<meta name=ProgId content=Word.Document>
<meta name=Generator content=""Microsoft Word 9"">
<meta name=Originator content=""Microsoft Word 9"">
<style>
#page Section1 {size:595.45pt 841.7pt; margin:1.0in 1.25in 1.0in 1.25in;mso-header-margin:.5in;mso-footer-margin:.5in;mso-paper-source:0;}
div.Section1 {page:Section1;}
#page Section2 {size:841.7pt 595.45pt;mso-page-orientation:landscape;margin:1.25in 1.0in 1.25in 1.0in;mso-header-margin:.5in;mso-footer-margin:.5in;mso-paper-source:0;}
div.Section2 {page:Section2;}
</style>
</head>
<body>
<div class=Section2>
'Section1: Portrait, Section2: Landscape
[your text here]
</div>
</body>
</html>
This should, according to this asp.net forum post make a valid .doc file when returned as mime type application/msword using UTF-8 charset (so make sure strings are all unicode).

Categories