Compress(minimize) HTML from python - python

How is to compress (minimize) HTML from python; I know I can use some regex to strip spaces and other things, but I want a real compiler using pure python(so it can be used on Google App Engine).
I did a test on a online html compressor and it saved 65% of the html size. I want that, but from python.

You can use htmlmin to minify your html:
import htmlmin
html = """
<!DOCTYPE html>
<html lang="en">
<head>
<title>Bootstrap Case</title>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"></script>
</head>
<body>
<div class="container">
<h2>Well</h2>
<div class="well">Basic Well</div>
</div>
</body>
</html>
"""
minified = htmlmin.minify(html.decode("utf-8"), remove_empty_space=True)
print(minified)

htmlmin and html_slimmer are some simple html minifying tools for python. I have millions of html pages stored in my database and running htmlmin, I am able to reduce the page size between 5 and 50%. Neither of them do an optimal job at complete html minification (i.e. the font color #00000 can be reduced to #000), but it's a good start. I have a try/except block that runs htmlmin and then if that fails, html_slimmer because htmlmin seems to provide better compression, but it does not support non ascii characters.
Example Code:
import htmlmin
from slimmer import html_slimmer # or xhtml_slimmer, css_slimmer
try:
html=htmlmin.minify(html, remove_comments=True, remove_empty_space=True)
except:
html=html_slimmer( html.strip().replace('\n',' ').replace('\t',' ').replace('\r',' ') )
Good Luck!

I suppose that in GAE there is no really need for minify your html as GAE already gzip it Caching & GZip on GAE (Community Wiki)
I did not test but minified version of html will probably win only 1% of size as it only remove space once both version are compressed.
If you want to save storage, for example by memcached it, you have more interest to gzip it (even at low level of compression) than removing space as in python it will be probably smaller and faster as processed in C instead of pure python

import htmlmin
code='''<body>
Hello World
<div style='color:red;'>Hi</div>
</body>
'''
htmlmin.minify(code)
Last line output
<body> Hello World <div style=color:red;>Hi</div> </body>
You can use this code to delete spaces
htmlmin.minify(code,remove_empty_space=True)

I wrote a build script that duplicates my templates into another directory and then I use this trick to tell my application to select the correct template in development mode, or in production:
DEV = os.environ['SERVER_SOFTWARE'].startswith('Development') and not PRODUCTION_MODE
TEMPLATE_DIR = 'templates/2012/head/' if DEV else 'templates/2012/output/'
Whether it is gzipped by your webserver is not really the point, you should save every byte that you can for performance reasons.
If you look at some of the biggest sites out there, they often do things like writing invalid html to save bytes, for example, it is common to omit double quotes in id attributes in html tags, for example:
<!-- Invalid HTML -->
<div id=mydiv> ... </div>
<!-- Valid HTML -->
<div id="mydiv"> ... </div>
And there are several examples like this one, but that's beside the scope of the thread I guess.
Back to the question, I put together a little build script that minifies your HTML, CSS and JS. Caveat: It doesn't cover the case of the PRE tag.
import os
import re
import sys
from subprocess import call
HEAD_DIR = 'templates/2012/head/'
OUT_DIR = 'templates/2012/output/'
REMOVE_WS = re.compile(r"\s{2,}").sub
YUI_COMPRESSOR = 'java -jar tools/yuicompressor-2.4.7.jar '
CLOSURE_COMPILER = 'java -jar tools/compiler.jar --compilation_level ADVANCED_OPTIMIZATIONS '
def ensure_dir(f):
d = os.path.dirname(f)
if not os.path.exists(d):
os.makedirs(d)
def getTarget(fn):
return fn.replace(HEAD_DIR, OUT_DIR)
def processHtml(fn, tg):
f = open(fn, 'r')
content = f.read()
content = REMOVE_WS(" ", content)
ensure_dir(tg)
d = open(tg, 'w+')
d.write(content)
content
def processCSS(fn, tg):
cmd = YUI_COMPRESSOR + fn + ' -o ' + tg
call(cmd, shell=True)
return
def processJS(fn, tg):
cmd = CLOSURE_COMPILER + fn + ' --js_output_file ' + tg
call(cmd, shell=True)
return
# Script starts here.
ensure_dir(OUT_DIR)
for root, dirs, files in os.walk(os.getcwd()):
for dir in dirs:
print "Processing", os.path.join(root, dir)
for file in files:
fn = os.path.join(root) + '/' + file
if fn.find(OUT_DIR) > 0:
continue
tg = getTarget(fn)
if file.endswith('.html'):
processHtml(fn, tg)
if file.endswith('.css'):
processCSS(fn, tg)
if file.endswith('.js'):
processJS(fn, tg)

Related

(python website) how to let website get output from other python script

follow by discussion Get output of python script from within python script
I create online webpage (.html) run other .py script result, but face model not found error
logically seems the python code is fine, might the setting error or other issue I don't come up myself
printbob.py
#!/usr/bin/env python
import sys
def main(args):
for arg in args:
print(arg)
if __name__ == '__main__':
main(sys.argv)
test_0109_003.html
<html>
<head>
<link rel="stylesheet" href="https://pyscript.net/latest/pyscript.css" />
<script defer src="https://pyscript.net/latest/pyscript.js"></script>
</head>
<body>
<b><p>title test 1.10-test_get_ print </p></b>
<br>
<py-script>
import printbob
printbob.main('arg1 arg2 arg3 arg4'.split(' '))
</py-script>
</body>
</html>
(pic 01) the result website showing
(pic 02) I put .py and .html script in WinSCP, online host system
so how to solve this problem, I locate that there might be the resaon
my winscp ip port is private that public cannot access to private's file, I'm not sure if this the reason, and if so, how to deal with it?

Python String Replace Truncating New String

I have the following HTML document
<!DOCTYPE HTML>
<html>
<head>
<link href="style.css" rel="stylesheet">
<script src="firstScript.js"></script>
<script src="secondScript.js"></script>
...
</head>
<body onload='function()' ..."></body>
</html>
This is great for development, but in the end I need to put all of those scripts and .css file directly into my html document rather than referencing them. To achieve this I wrote a little build script in python to replace each line containing a filename with the contents of that file wrapped in the appropriate html tags. Here's a little snippet to show what happens with javascript files.
FILES = [ "firstScript.js", "secondScript.js", ... ]
OUTPUT = "path/to/build.html"
for f in FILES:
scriptFile = open(f, "r")
scriptDAT = "<script>\n"+scriptFile.read()+"</script>"
scriptFile.close()
with fileinput.FileInput(OUTPUT, inplace=True) as file:
for line in file:
if line.find(f) >= 0: line = line.replace(line, scriptDAT)
print(line)
This mostly works, but sometimes the line.replace will write everything in scriptDAT except for the </script> tag at the end. For example, if firstScript.js contains
function helloWorld() {
console.log(helloWorld);
}
Then this script after replacing that first line might produce the html file
<!DOCTYPE HTML>
<html>
<head>
<link href="style.css" rel="stylesheet">
<script>
function helloWorld() {
console.log("Hello World!");
}
<script src="secondScript.js"></script>
...
</head>
<body onload='function()' ..."></body>
</html>
The line.replace(line, scripDAT) ignoring the closing tag at the end of the string. The really strange thing is that this behaviour only happens sometimes; when the python script replaces secondScript.js it might include the closing tag. Does anyone know why the replace method is behaving this way?

Dynamic plotly visualization (or image) in Flask Python

I've got a plotly visualization ('viz.html') in a html format. I've embedded it in my webpage using Flask's url_for() syntax, referencing the static folder:
<iframe id="igraph" scrolling="no" style="border:none;" seamless="seamless" src="{{ url_for('static', filename='viz.html') }}" height="400" width="120%"></iframe>
It deploys with no issues. However, my 'viz.html' will be rewritten frequently, updating itself with new data points. The updated visualization won't show with a control + R or webpage refresh. If I press control + F5 and do a cache refresh, then updated visualization will show.
I want users to be able to see the updated visualization without having to manually refresh the cache. So far I've tried:
Reloading flask app when file changes are detected in 'static' folder (Reload Flask app when template file changes)
check_folder = './static'
check_files = []
for dirname, dirs, files in os.walk(check_folder):
for filename in files:
filename = os.path.join(check_folder, filename)
check_files += [filename]
app.run(debug = True, extra_files = check_files)
Disabling browser cacheing on html page (http://cristian.sulea.net/blog/disable-browser-caching-with-meta-html-tags)
<meta http-equiv="Cache-Control" content="no-cache, no-store, must-revalidate" />
<meta http-equiv="Pragma" content="no-cache" />
<meta http-equiv="Expires" content="0" />
Disabling Python on Flask Cacheing
resp.headers["Cache-Control"] = "no-cache, no-store, must-revalidate"
resp.headers["Pragma"] = "no-cache"
resp.headers["Expires"] = "0"
Would like to know, is what I'm trying to do even possible?
If it's possible how can I go about doing it? Advice appreciated!
I am not a web developer but I worked with a few closely for 3 years. As far as I know, there is no way to control what other browsers saved as cache. If my browser has a version saved in the cache, it will use that one. Usually, browsers will check if all the files are updated, and eventually, it will refresh for everyone. I actually heard developers being asked "why didn't you fix what I told you?" and they were like, "I have, just hard refresh." So I don't think there is a way. IDK if anything new is out there.
I figured it out!
Instead of writing the plotly fig to a viz.html file, do viz = Markup(fig) in flask:
#app.route("/index/",methods=["POST", "GET"])
def foo():
from flask import Markup
# include_plotlyjs = True builds in the js library
# output_type = 'div' outputs the html code
viz = plot(fig, include_plotlyjs = True, output_type = 'div')
# Markup directly renders the html code
viz = Markup(viz)
return render_template('foo.html', viz = viz)
In foo.html, you can directly use viz as so:
<html>
<body>
viz
</body>
<html>
viola!

HTML string in python file doesn't execute as expected

When I type the code below, it gives me a blank HTML page. Even though I put a <h1> and a <a href> tag. Only the <title> tag is executed. Does anyone know why and how to fix it?
Code:
my_variable = '''
<html>
<head>
<title>My HTML File</title>
</head>
<body>
<h1>Hello world!</h1>
Click me
</body>
</html>'''
my_html_file = open(r"\Users\hp\Desktop\Code\Python testing\CH\my_html_file.html", "w")
my_html_file.write(my_variable)
Thanks in advance!
As #bill Bell said, it's probably because you haven't closed your file (so it hasn't flushed its buffer).
So, in your case:
my_html_file = open(r"\Users\hp\Desktop\Code\Python testing\CH\my_html_file.html", "w")
my_html_file.write(my_variable)
my_html_file.close()
But, this is not the right way to do it. Indeed, if an errors occurs in the second line for example, the file'll never get closed. So, you can use the with statement to make sure that it always is. (just as #Rawing said)
with open('my-file.txt', 'w') as my_file:
my_file.write('hello world!')
So, in fact, it's like if you did:
my_file = open('my-file.txt', 'w')
try:
my_file.write('hello world!')
finally:
# this part is always executed, whatever happens in the try block
# (a return, an exception)
my_file.close()

Using pybtex to convert from bibtex to formatted HTML bibliography in e.g. Harvard style

I'm using Django and am storing bibtex in my model and want to be able to pass my view the reference in the form of a formatted HTML string made to look like the Harvard reference style.
Using the method described in Pybtex does not recogonize bibtex entry it is possible for me to convert a bibtex string into a pybtex BibliographyData object. I believe it should be possible to get from this to an HTML format based on the docs https://pythonhosted.org/pybtex/api/formatting.html but I just don't seem to be able to get it working.
Pybtex seems to be set up to be used from the command line rather than python, and there are very few examples of it being used on the internet. Has anyone done anything like this? Perhaps it would be easier to pass the bibtex to my template and use a javascript library like https://github.com/pcooksey/bibtex-js to try and get an approximation of the Harvard style?
To do that I adapted some code from here. I am not sure what is the name of this particular formatting style, but most probably you can change/edit it. This is how it looks:
import io
import six
import pybtex.database.input.bibtex
import pybtex.plugin
pybtex_style = pybtex.plugin.find_plugin('pybtex.style.formatting', 'plain')()
pybtex_html_backend = pybtex.plugin.find_plugin('pybtex.backends', 'html')()
pybtex_parser = pybtex.database.input.bibtex.Parser()
my_bibtex = '''
#Book{1985:lindley,
author = {D. Lindley},
title = {Making Decisions},
publisher = {Wiley},
year = {1985},
edition = {2nd},
}
'''
data = pybtex_parser.parse_stream(six.StringIO(my_bibtex))
data_formatted = pybtex_style.format_entries(six.itervalues(data.entries))
output = io.StringIO()
pybtex_html_backend.write_to_stream(data_formatted, output)
html = output.getvalue()
print (html)
This generates the following HTML formatted reference:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head><meta name="generator" content="Pybtex">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Bibliography</title>
</head>
<body>
<dl>
<dt>1</dt>
<dd>D. Lindley.
<em>Making Decisions</em>.
Wiley, 2nd edition, 1985.</dd>
</dl></body></html>
I've notice the command line pybtex-format tool produces a fair output for HTML:
$ pybtex-format myinput.bib myoutput.html
So I went to the source code at pybtex/database/format/__main__.py and found an incredibly simple solution that worked like a charm for me:
from pybtex.database.format import format_database
format_database('myinput.bib', 'myoutput.html', 'bibtex', 'html')
Here are my input and output files:
#inproceedings{Batista18b,
author = {Cassio Batista and Ana Larissa Dias and Nelson {Sampaio Neto}},
title = {Baseline Acoustic Models for Brazilian Portuguese Using Kaldi Tools},
year = {2018},
booktitle= {Proc. IberSPEECH 2018},
pages = {77--81},
doi = {10.21437/IberSPEECH.2018-17},
url = {http://dx.doi.org/10.21437/IberSPEECH.2018-17}
}
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head><meta name="generator" content="Pybtex">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Bibliography</title>
</head>
<body>
<dl>
<dt>1</dt>
<dd>Cassio Batista, Ana Larissa Dias, and Nelson <span class="bibtex-protected">Sampaio Neto</span>.
Baseline acoustic models for brazilian portuguese using kaldi tools.
In <em>Proc. IberSPEECH 2018</em>, 77–81. 2018.
URL: http://dx.doi.org/10.21437/IberSPEECH.2018-17, doi:10.21437/IberSPEECH.2018-17.</dd>
</dl></body></html>

Categories