maintaining formatting of imported text with mako and rst2pdf - python

I've created a template which renders pdf files from csv input. However, when the csv input fields contain user formatting, with line breaks and indentations, it messes with the rst2pdf formatting engine. Is there a way to consistently deal with user input in a way that doesn't break the document flow, but also maintains the formatting of the input text? Example script below:
from mako.template import Template
from rst2pdf.createpdf import RstToPdf
mytext = """This is the first line
Then there is a second
Then a third
This one could be indented
I'd like it to maintain the formatting."""
template = """
My PDF Document
===============
It starts with a paragraph, but after this I'd like to insert `mytext`.
It should keep the formatting intact, though I don't know what formatting to expect.
${mytext}
"""
mytemplate = Template(template)
pdf = RstToPdf()
pdf.createPdf(text=mytemplate.render(mytext=mytext),output='foo.pdf')
I have tried adding the following function in the template to insert | at the start of each line, but that doesn't seem to work either.
<%!
def wrap(text):
return text.replace("\\n", "\\n|")
%>
Then ${mytext} would become |${mytext | wrap}. This throws the error:
<string>:10: (WARNING/2) Inline substitution_reference start-string without end-string.

Actually it turns out I was on the right track, I just needed a space between the | and the text. So the following code works:
from mako.template import Template
from rst2pdf.createpdf import RstToPdf
mytext = """This is the first line
Then there is a second
Then a third
How about an indent?
I'd like it to maintain the formatting."""
template = """
<%!
def wrap(text):
return text.replace("\\n", "\\n| ")
%>
My PDF Document
===============
It starts with a paragraph, but after this I'd like to insert `mytext`.
It should keep the formatting intact.
| ${mytext | wrap}
"""
mytemplate = Template(template)
pdf = RstToPdf()
#print mytemplate.render(mytext=mytext)
pdf.createPdf(text=mytemplate.render(mytext=mytext),output='foo.pdf')

Related

Getting raw text from lxml

Trying to get the text from an HtmlElement in lxml. For example, I have the HTML read in by
thing = lxml.html.fromstring("<code><div></code>")
But when I call thing.text I get <div>, meaning that lxml is translating escape characters. Is there a way to get this raw text, i.e., <div>? This is part of the output when I do lxml.html.tostring(thing), but that includes the opening and closing tags which I don't want.
Tried calling tostring with a few different encoding options but no luck.
So I looked into it a bit closer:
cdef tostring(...) in src\lxml\etree.pyx - see https://github.com/lxml/lxml/blob/master/src/lxml/etree.pyx
cdef _tostring(...) in src\lxml\serializer.pxi - see https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi
and I couldn't find anything that would suggest you could get the escaped string by configuring the parameters of the tostring() function. It seems like it will always return the unescaped string maybe due to security concerns ...
The way I see it, you would have to use another function such as html.escape to get the escaped string:
import lxml.html
from html import escape as html_escape
thing = lxml.html.fromstring("<code><div>MY TEST DIV</div></code>")
raw_thing = lxml.html.tostring(thing, method="text", encoding="unicode") # <div>MY TEST DIV</div>
escaped_thing = html_escape(raw_thing) # <div>MY TEST DIV</div>
print(escaped_thing)
Essentialy what you are looking for is lxml.html.tostring(root, method='text', encoding="unicode"):
import lxml.html
thing = lxml.html.fromstring("<code><div>MY TEST DIV</div></code>")
output = lxml.html.tostring(thing, method='xml', encoding="unicode")
print(output) # <code><div>MY TEST DIV</div></code>
The problem is that it cannot separate the root element from its child in <code><div>MY TEST DIV</div></code>
However with a different approach you can get the desired output:
import xml.etree.ElementTree as ET
thing = """
<code><div>MY TEST DIV</div><div><div>AAA</div></div><div><div>BBB</div></div></code>
"""
root = ET.fromstring(thing)
root_text = ET._escape_attrib(root.text)
print(root_text)
for child in root:
child_text = ET._escape_attrib(child.text)
print(child_text)
The code above prints out:
<div>MY TEST DIV</div>
<div>AAA</div>
<div>BBB</div>

Why does saving an text containing html inside of variable causing beautifulsoup4 causing unexpected behavior?

I am using beautifulsoup to automate posting products on one of the shopping platforms, unfortunately their API is disabled currently, so the only option right now is to use beautifulsoup.
How is program expected to work?
Program is expected to read .csv file (I provide the name of the file) and store the product data within the variables - after it, it goes through the multiple steps (filling out the form) - like inputting the name which it gets from variable, example of it:
ime_artikla = driver.find_element(By.ID, 'kategorija_sug').send_keys(csvName) #Here it inputs the name
where csvName is passed value to the function along with some other parameters:
def postAutomation(csvName, csvPrice, csvproductDescription):
The way that I am reading file is following:
filename = open(naziv_fajla, 'r', encoding="utf-8") #File name to open + utf-8 encoding was necess.
file = csv.DictReader(filename)
The above lines of code are within the try: statement.
The way that I am reading columns from csv file is following:
for col in file:
print("Reading column with following SKU: " + color.GREEN + color.BOLD + (col['SKU']) + color.END + "\n")
csvSKU = (col['SKU'])
csvName = (col['Name'])
#csvCategory = (col['Categories'])
csvPrice = (col['Price'])
csvproductDescription = (col['Description'])
print(csvName)
#print(csvCategory)
print(csvPrice)
print(csvproductDescription)
postAutomation(csvName, csvPrice, csvproductDescription)
i+=1
counterOfProducts = counterOfProducts + 1
This is working as expected (the product is published on online store successfully) all until there's HTML and/or inline-css for product description
The problem :
As I've said the problem is happening when I have column containing html.
As I am populating the field for product description (Tools > Source Code), the site is using tinymce for editing text and putting description etc...
There are actually two scenarios that are happening, where program is acting as not expected:
Case:
In the first case, the product is published successfully, but, the <li> and \n is not treated as HTML for some reason, here's an example of one product's description (where this problem occurs):
<p style="text-align: center;">Some product description.\n<ul>\n <li>Product feature 1</li>\n <li>Prod Feature 2</li>\n<li>Prod Feature 3</li>\n<li>Prod Feature 3</li>\n<li>Prod feature 4</li>\n</ul>
What I get when I submit this code:
\n\nProduct feature 1\nProd Feature 2\nProd Feature 3\nProd Feature 3\nProd feature 4\n
Case:
In the second case what happens, is that program crashes. What happens is following:
Somehow the product description which is taken from csv file confuses (I think its due to complex html) program - the part of the product description gets into the field for price &nbsp..., <--- this, which is on totally next page (you have to click next onto the end of the page where product description goes) and then input the price, which seems weird to me.
The werid thing is that I have template for product description (which is HTML and CSS) and I save it into the string literal, as template1 = """" A LOT OF HTML AND INLINE CSS """ and end_of_template = """ A LOT OF HTML AND INLINE CSS """ and it gets rendered perfectly after doing this:
final_description = template1 + csvproductDescription + end_of_template
But the html and inline css inside of csvproductDescription variable doesn't get treated as HTML and CSS.
How can I fix this?
Seems like problem was that I have had whitespaces inside of the product description, so I have solved it like this:
final_description = html_and_css
final_description = final_description + csvproductDescription
final_description = final_description + html_and_css2
final_description = " ".join(re.split("\s+", final_description, flags=re.UNICODE))

Parsing HTML with entity ref

I am trying to parse some HTML which has as an example
<solids>
&sub2;
</solids>
The html file is read in as a string. I need to insert the HTML from a file that sub2 defines into the appropriate part of the string before then processing the whole string as XML.
I have tried HTMLParser and using its handlers with
class MyHTMLParser(HTMLParser):
def handle_entityref(self, name):
# This gets called when the entity is referenced
print "Entity reference : "+ name
print "Current Section : "+ self.get_starttag_text()
print self.getpos()
But getpos returns a line number and offset rather than position in the string. ( The insertion can be at any point in the file )
I found this link and this suggest to use lxml. I have looked at lxml but cannot see how it would solve the problem. Its scanner does not seem to have an entity handler and seems to be xml rather than html
Okay found that lxml will handle the ENTITY references for me.
Just had to setup parser with the option resolve_entities=True
parser = etree.XMLParser(resolve_entities=True)
root = etree.parse(filename, parser=parser)

Python flask application not displaying generated html file for second time

I have a Python flask application which takes input id's and dynamically generates data into a html file. Below is my app.py file.
#app.route('/execute', methods=['GET', 'POST'])
def execute():
if request.method == 'POST':
id = request.form['item_ids']
list = [id]
script_output = subprocess.Popen(["python", "Search_Script.py"] + list)
# script_output = subprocess.call("python Search_Script.py "+id, shell=True)
# render_template('running.html')
script_output.communicate()
#driver = webdriver.Chrome()
#driver.get("home.html")
#driver.execute_script("document.getElementById('Executed').style.display = '';")
return render_template('execute.html')
#app.route('/output')
def output():
return render_template('output.html')
output.html file has below code at the bottom.
<div class="container" style="text-align: center;">
{% include 'itemSearchDetails.html' %}
</div>
itemSearchDetails.html is generated every time dynamically based on the input. I check for different inputs and it is generating perfectly. When I run it with some input(assume 2) values for the first time, it runs perfectly and shows the output correctly. But, when I run for different values(assume 4) for the next time, the file 'itemSearchDetails.html' is generated for those 4 values but the browser only shows output for the first 2 values. No matter how many times I run it, browser shows only output with the first run values.
So, every time only the first inputted values are shown no matter how many times I run. I am not sure if it is browser cache issue since I tried "disabling cache" in chrome. Still it didn't work. Please let me know if there is something I am missing.
Try solution from this answer:
Parameter TEMPLATES_AUTO_RELOAD
Whether to check for modifications of the template source and reload
it automatically. By default the value is None which means that Flask
checks original file only in debug mode.
Original documentation could be found here.
Looks like Jinja is caching the included template.
If you don't need to interpret the HTML as a Jinja template, but instead just include its contents as-is, read the file first and pass the contents into the template:
with open('itemSearchDetails.html', 'r') as infp:
data = infp.read()
return render_template('execute.html', data=data)
...
{{ data|safe }}
(If you do need to interpret the HTML page as Jinja (as include will), you can parse a Jinja Template out of data, then use the include tag with that dynamically compiled template.)

Using other languages with Pylatex

I'm trying to get hebrew to pring into a pdf using pylatex. In a sample hebrew .tex file that I'm trying to emulate the format of, the header looks like this:
%\title{Hebrew document in WriteLatex - מסמך בעברית}
\documentclass{article}
\usepackage[utf8x]{inputenc}
\usepackage[english,hebrew]{babel}
\selectlanguage{hebrew}
\usepackage[top=2cm,bottom=2cm,left=2.5cm,right=2cm]{geometry}
I was able to emulate this entire header except for the line \selectlanguage{hebrew}. I'm not sure how I should go about getting this in my .tex file using pylatex. The code for generating the rest of the file is:
doc = pylatex.Document('basic', inputenc = 'utf8x', lmodern = False, fontenc = None, textcomp = None)
packages = [Package('babel', options = ['english', 'hebrew']), Package('inputenc', options = 'utf8enc')]
doc.packages.append(Package('babel', options = ['english', 'hebrew']))
doc.append(text.decode('utf-8'))
doc.generate_pdf(clean_tex=False, compiler = "XeLaTeX ")
doc.generate_tex()
And the header of the .tex file generated is:
\documentclass{article}%
\usepackage[utf8x]{inputenc}%
\usepackage{lastpage}%
\usepackage[english,hebrew]{babel}%
How do you get the selectlanguage line there? I'm pretty new to latex so I apologize for not being so accurate with my terminology.
You want to use Command:
from pylatex import Command
To add it to your preamble,
doc.preamble.append(Command('selectlanguage', 'hebrew'))
or to another specific place in your document,
doc.append(Command('selectlanguage', 'hebrew'))

Categories