Convert html to pdf using Python/Flask - python

I want to generate pdf file from html using Python + Flask. To do this, I use xhtml2pdf. Here is my code:
def main():
pdf = StringIO()
pdf = create_pdf(render_template('cvTemplate.html', user=user))
pdf_out = pdf.getvalue()
response = make_response(pdf_out)
return response
def create_pdf(pdf_data):
pdf = StringIO()
pisa.CreatePDF(StringIO(pdf_data.encode('utf-8')), pdf)
return pdf
In this code file is generating on the fly. BUT! xhtml2pdf doesn't support many styles in CSS, because of this big problem to mark page correctly. I found another instrument(wkhtmltopdf). But when I wrote something like:
pdf = StringIO()
data = render_template('cvTemplate1.html', user=user)
WKhtmlToPdf(data.encode('utf-8'), pdf)
return pdf
Was raised error:
AttributeError: 'cStringIO.StringO' object has no attribute 'rfind'
And my question is how to convert html to pdf using wkhtmltopdf (with generating file on the fly) in Flask?
Thanks in advance for your answers.

The page need render, You can use pdfkit:
https://pypi.python.org/pypi/pdfkit
https://github.com/JazzCore/python-pdfkit
Example in document.
import pdfkit
pdfkit.from_url('http://google.com', 'out.pdf')
pdfkit.from_file('test.html', 'out.pdf')
pdfkit.from_string('Hello!', 'out.pdf') # Is your requirement?

Have you tried with Flask-WeasyPrint, which uses WeasyPrint? There are good examples in their web sites so I don't replicate them here.

Not sure if this would assist anyone but my issue was capturing Bootstrap5 elements as a pdf. pdfkit did not do so and heres a work around on windows using html2image and PIL. This is limited and does not take a full page screenshot.
from html2image import Html2Image
from PIL import Image
try:
hti.screenshot(html_file=C:\yourfilepath\file.html, save_as="test.png")
finally:
image1 = Image.open(r'C:\yourfilepath\test.png')
im1 = image1.convert('RGB')
im1.save(r'C:\yourfilepath\newpdf.pdf')

Related

how can i get html output of streamlit webpage?

I have developed an application with streamlit. but I couldn't find a solution to get the html of the whole page. How do you think I can find a solution?
I tried to output the texts I created with streamlit as a whole in html. but I couldn't find a solution.
I tried using streamlit html component. however, when I follow a path like this, I cannot print out the entire page. If I could reach the whole page in any way, it would be easy for me to handle the convert operations. but unfortunately I can't access all the items I created in streamlit with a single variable.
from fpdf import FPDF
import base64
import streamlit as st
a = st.write("asdasd")
export_as_pdf = st.button("Export Report")
def create_download_link(val, filename):
b64 = base64.b64encode(val) # val looks like b'...'
return f'Download file'
if export_as_pdf:
pdf = FPDF()
pdf.add_page()
pdf.cell(0,5, a)
html = create_download_link(pdf.output(dest="S").encode("latin-1"), "test")
st.markdown(html, unsafe_allow_html=True)

How to Publish Docx file with images to WordPress site?

So I am able to post Docx files to WordPress using WP REST-API using mammoth docx package in Python
I am able to upload an image to WordPress.
But when there are images in the docx file they are not uploading on the WordPress media section.
Any input on this?
I am using python for this.
Here is the code for Docx to HTML conversion
with open(file_path, "rb") as docx_file:
# html = mammoth.extract_raw_text(docx_file)
result = mammoth.convert_to_html(docx_file, convert_image=mammoth.images.img_element(convert_image))
html = result.value # The generated HTML
kindly do note that I am able to see images in the actual published post but they have a weird source image URL & are not appearing in the WordPress media section.
Weird image source URL like
data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAEBAQEBAQEBAQEBAQECAgMCAgICAgQDAwIDBQQFBQUEBAQFBgcGBQUHBgQEBgkGBwgICAgIBQYJCgkICgcICAj/2wBDAQEBAQICAgQCAgQIBQQFCAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAj/wAARCAUABQADASIAAhEBAxEB/8QAHwAAAQMFAQEBAAAAAAAAAAAAAAUGBwMECAkKAgsB/8QAhxAAAQIEBAMEBQYHCAUOFggXAQIDAAQFEQYHEiETMUEIIlFhCRQ & so on
Also Huge thanks to Contributors for the Python to WordPress repo
The mammoth cli has a function that extracts images, saves them to a directory and inserts the file names in the img tags in the html code. If you don't want to use mammoth in command line you could use this code:
import os
from mammoth.cli import ImageWriter, _write_output
output_dir = './output'
filename = 'filename.docx'
with open(filename, "rb") as docx_fileobj:
convert_image = mammoth.images.img_element(ImageWriter(output_dir))
output_filename = "{0}.html".format(os.path.basename(filename).rpartition(".")[0])
output_path = os.path.join(output_dir, output_filename)
result = mammoth.convert(
docx_fileobj,
convert_image=convert_image,
output_format='html',
)
_write_output(output_path, result.value)
Note that you would still need to change the img links as you'll be uploading the images to Wordpress, but this solves your mapping issue. You might also want to change the ImageWriter class to save the images to something else than tiff.

How to remove watermark from PDF file using Python's PyPDF2 lib

I have wrote a code that extracts the text from PDF file with Python and PyPDF2 lib.
Code works good for most docs but sometimes it returns some strange characters. I think thats because PDF has watermark over the page so it does not recognise the text:
import requests
from io import StringIO, BytesIO
import PyPDF2
def pdf_content_extraction(pdf_link):
all_pdf_content = ''
#sending requests
response = requests.get(pdf_link)
my_raw_data = response.content
pdf_file_text = 'PDF File: ' + pdf_link + '\n\n'
#extract text page by page
with BytesIO(my_raw_data) as data:
read_pdf = PyPDF2.PdfFileReader(data)
#looping trough each page
for page in range(read_pdf.getNumPages()):
page_content = read_pdf.getPage(page).extractText()
page_content = page_content.replace("\n\n\n", "\n").strip()
#store data into variable for each page
pdf_file_text += page_content + '\n\nPAGE '+ str(page+1) + '/' + str(read_pdf.getNumPages()) +'\n\n\n'
all_pdf_content += pdf_file_text + "\n\n"
return all_pdf_content
pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'
print(pdf_content_extraction(pdf_link))
This is the result that I'm getting:
#$%˘˘
&'(˝˙˝˙)*+"*˜
˜*
,*˜*˜ˆ+-*˘!(
.˜($*%(#%*˜-/
"*
*˜˜0!0˘˘*˜˘˜ˆ
+˜(%
*
*(+%*˜+"*˜'
$*1˜ˆ
...
...
My question is, how can I fix this problem?
Is there a way to remove watermark from page or something like that?
I mean, maybe this problem can be fixed in some other way, maybe the problem is not in that watermark/logo?
The garbled text issue that you're having has nothing to do with the watermark in the document. Your issue seems to be related to the encoding in the document. The German characters within your document should be able to be extracted using PyPDF2, because it uses the latin-1 (iso-8859-1) encoding/decoding model. This encoding model isn't working with your PDF.
When I look at the underlying info of your PDF I note that it was created using these apps:
'Producer': 'GPL Ghostscript 9.10'
'Creator': 'PDFCreator Version 1.7.3
When I look at one of the PDFs in this question also written in German, I note that it was created using different apps:
'/Creator': 'Acrobat PDFMaker 11 für Excel'
'/Producer': 'Adobe PDF Library 11.0'
I can read the second file perfectly with PyPDF2.
When I look at this file from your other question I noted that is also cannot be read correctly by PyPDF2. This file was created with the same apps as the file from this bounty question.
'Producer': 'GPL Ghostscript 9.10'
'Creator': 'PDFCreator Version 1.7.3
This is the same file that throw an error when attempting to extract the text using pdfreader.SimplePDFViewer.
I looked at the bugs for ghostscript and noted that there are some font related issues for Ghostscript 9.10, which was release in 2015. I also noted that some people mentioned that PDFCreator Version 1.7.3 released in 2018 also had some font embedding issues.
I have been trying to find the correct decoding/encoding sequence, but some far I haven't been able to extract the text correctly.
Here are some of the sequences:
page_content.encode('raw_unicode_escape').decode('ascii', 'xmlcharrefreplace'))
# output
\u02d8
\u02c7\u02c6\u02d9\u02dd\u02d9\u02db\u02da\u02d9\u02dc
\u02d8\u02c6!"""\u02c6\u02d8\u02c6!
page_content.encode('ascii', 'xmlcharrefreplace').decode('raw_unicode_escape'))
# output
# ˘
ˇˆ˙˝˙˛˚˙˜
˘ˆ!"""ˆ˘ˆ!
I will keep looking for the correct encoding/decoding sequence to use with PyPDF2. It is worth nothing that PyPDF2 hasn't been updated since May 18, 2016. Also encoding issues is common problem with the module. Plus the maintenance of this module is dead, thus the ports to the modules PyPDF3 and PyPDF4.
I attempted to extract the text from your PDF using PyPDF2, PyPDF3 and PyPDF4. All 3 modules failed to extract the content from the PDF that you provided.
You can definitely extract the content from your document using other Python modules.
Tika
This example uses Tika and BeautifulSoup to extract the content in German from your source document.
import requests
from tika import parser
from io import BytesIO
from bs4 import BeautifulSoup
pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'
response = requests.get(pdf_link)
with BytesIO(response.content) as data:
parse_pdf = parser.from_buffer(data, xmlContent=True)
# Parse metadata from the PDF
metadata = parse_pdf['metadata']
# Parse the content from the PDF
content = parse_pdf['content']
# Convert double newlines into single newlines
content = content.replace('\n\n', '\n')
soup = BeautifulSoup(content, "lxml")
body = soup.find('body')
for p_tag in body.find_all('p'):
print(p_tag.text.strip())
pdfminer
This example uses pdfminer to extract the content from your source document.
import requests
from io import BytesIO
from pdfminer.high_level import extract_text
pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'
response = requests.get(pdf_link)
with BytesIO(response.content) as data:
text = extract_text(data, password='', page_numbers=None, maxpages=0, caching=True,
codec='utf-8', laparams=None)
print(text.replace('\n\n', '\n').strip())
import requests
from io import StringIO, BytesIO
import PyPDF2
def remove_watermark(wm_text, inputFile, outputFile):
from PyPDF4 import PdfFileReader, PdfFileWriter
from PyPDF4.pdf import ContentStream
from PyPDF4.generic import TextStringObject, NameObject
from PyPDF4.utils import b_
with open(inputFile, "rb") as f:
source = PdfFileReader(f, "rb")
output = PdfFileWriter()
for page in range(source.getNumPages()):
page = source.getPage(page)
content_object = page["/Contents"].getObject()
content = ContentStream(content_object, source)
for operands, operator in content.operations:
if operator == b_("Tj"):
text = operands[0]
if isinstance(text, str) and text.startswith(wm_text):
operands[0] = TextStringObject('')
page.__setitem__(NameObject('/Contents'), content)
output.addPage(page)
with open(outputFile, "wb") as outputStream:
output.write(outputStream)
wm_text = 'wm_text'
inputFile = r'input.pdf'
outputFile = r"output.pdf"
remove_watermark(wm_text, inputFile, outputFile)
In contrast to my initial assumption in comments to the question, the issue is not some missing ToUnicode map. I didn't see the URL to the file immediately and, therefore, guessed. Instead, the issue is a very primitively implemented text extraction method.
The PageObject method extractText is documented as follows:
extractText()
Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.
Returns: a unicode string object.
(PyPDF2 1.26.0 documentation, visited 2021-03-15)
So this method extracts the string arguments of text drawing instructions in the content stream ignoring the encoding information in the respectively current font object. Thus, only text drawn using a font with some ASCII'ish encoding are properly extracted.
As the text in question uses a custom ad-hoc encoding (generated while creating the page, containing the used characters in the order of their first occurrence), that extractText method is unable to extract the text.
Proper text extraction methods, on the other hand, can extract the text without issue as tested by Life is complex and documented in his answer.

Download images from multiple websites (urls)

i am not a programmer but i want to do that. I want to download images from multiple urls ( urllist.txt ) the url not have the image in it so need to recognise the image > 400kb and have 20 sec delay beetween the download so the site not lock me out
thnx in advance
Stack Overflow is for stuff like error handling, but I thought I could help. This worked for me:
downloader.py
import requests
import random
def download_imgs(file):
'''
Downloads images based
on the URLs given in `file`.
'''
with open(file, 'r') as url_file:
data = url_file.read().strip().split('\n') # Read the URLs in the file
for url in data:
img = requests.get(url.strip()) # Open the link
with open(str(random.uniform(1, 10000)), 'wb') as write_img:
# Random module to generate a random name for the image
write_img.write(img.content)
# Saved the image
return True
download_imgs('~/Desktop/urllist.txt')
urllist.txt
https://lh3.googleusercontent.com/a-/AOh14GhJAxUW_Gcq2xzMqe3_tc3eLV6e9-sMTqDWuRY7=s88-c-k-c0x00ffffff-no-rj-mo
https://i.ytimg.com/vi/m4jmapVMaQA/hqdefault.jpg?sqp=-oaymwEZCPYBEIoBSFXyq4qpAwsIARUAAIhCGAFwAQ==&rs=AOn4CLBqRJKwS9ZzMwnUZvmkXrAw5EzH5w
Even though the images don't have file extensions (Ex: PNG or PING), this program seems to work fine for me.

In Python, is there a way I can download all/some the image files (e.g. JPG/PNG) from a **Google Images** search result?

Is there a way I can download all/some the image files (e.g. JPG/PNG) from a Google Images search result?
I can use the following code to download one image that I already know its url:
import urllib.request
file = "Facts.jpg" # file to be written to
url = "http://www.compassion.com/Images/Hunger-Facts.jpg"
response = urllib.request.urlopen (url)
fh = open(file, "wb") #open the file for writing
fh.write(response.read()) # read from request while writing to file
To download multiple images, it has been suggested that I define a function and use that function to repeat the task for each image url that I would like to write to disk:
def image_request(url, file):
response = urllib.request.urlopen(url)
fh = open(file, "wb") #open the file for writing
fh.write(response.read())
And then loop over a list of urls with:
for i, url in enumerate(urllist):
image_request(url, str(i) + ".jpg")
However, what I really want to do is download all/some image files (e.g. JPG/PNG) from my own search result from Google Images without necessarily having a list of the image urls beforehand.
P.S.
Please I am a complete beginner and would favour an answer that breaks down the broad steps to achieve this over one that is bogs down on specific codes. Thanks.
You can use the Google API like this, where BLUE and DOG are your search parameters:
https://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=BLUE%20DOG
There is a developer guide about this here:
https://developers.google.com/image-search/v1/jsondevguide
You need to parse this JSON format before you can use the links directly.
Here's a start to your JSON parsing:
import json
j = json.loads('{"one" : "1", "two" : "2", "three" : "3"}')
print(j['two'])

Categories