I have developed an application with streamlit. but I couldn't find a solution to get the html of the whole page. How do you think I can find a solution?
I tried to output the texts I created with streamlit as a whole in html. but I couldn't find a solution.
I tried using streamlit html component. however, when I follow a path like this, I cannot print out the entire page. If I could reach the whole page in any way, it would be easy for me to handle the convert operations. but unfortunately I can't access all the items I created in streamlit with a single variable.
from fpdf import FPDF
import base64
import streamlit as st
a = st.write("asdasd")
export_as_pdf = st.button("Export Report")
def create_download_link(val, filename):
b64 = base64.b64encode(val) # val looks like b'...'
return f'Download file'
if export_as_pdf:
pdf = FPDF()
pdf.add_page()
pdf.cell(0,5, a)
html = create_download_link(pdf.output(dest="S").encode("latin-1"), "test")
st.markdown(html, unsafe_allow_html=True)
Related
So I am able to post Docx files to WordPress using WP REST-API using mammoth docx package in Python
I am able to upload an image to WordPress.
But when there are images in the docx file they are not uploading on the WordPress media section.
Any input on this?
I am using python for this.
Here is the code for Docx to HTML conversion
with open(file_path, "rb") as docx_file:
# html = mammoth.extract_raw_text(docx_file)
result = mammoth.convert_to_html(docx_file, convert_image=mammoth.images.img_element(convert_image))
html = result.value # The generated HTML
kindly do note that I am able to see images in the actual published post but they have a weird source image URL & are not appearing in the WordPress media section.
Weird image source URL like
data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAEBAQEBAQEBAQEBAQECAgMCAgICAgQDAwIDBQQFBQUEBAQFBgcGBQUHBgQEBgkGBwgICAgIBQYJCgkICgcICAj/2wBDAQEBAQICAgQCAgQIBQQFCAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAj/wAARCAUABQADASIAAhEBAxEB/8QAHwAAAQMFAQEBAAAAAAAAAAAAAAUGBwMECAkKAgsB/8QAhxAAAQIEBAMEBQYHCAUOFggXAQIDAAQFEQYHEiETMUEIIlFhCRQ & so on
Also Huge thanks to Contributors for the Python to WordPress repo
The mammoth cli has a function that extracts images, saves them to a directory and inserts the file names in the img tags in the html code. If you don't want to use mammoth in command line you could use this code:
import os
from mammoth.cli import ImageWriter, _write_output
output_dir = './output'
filename = 'filename.docx'
with open(filename, "rb") as docx_fileobj:
convert_image = mammoth.images.img_element(ImageWriter(output_dir))
output_filename = "{0}.html".format(os.path.basename(filename).rpartition(".")[0])
output_path = os.path.join(output_dir, output_filename)
result = mammoth.convert(
docx_fileobj,
convert_image=convert_image,
output_format='html',
)
_write_output(output_path, result.value)
Note that you would still need to change the img links as you'll be uploading the images to Wordpress, but this solves your mapping issue. You might also want to change the ImageWriter class to save the images to something else than tiff.
I have wrote a code that extracts the text from PDF file with Python and PyPDF2 lib.
Code works good for most docs but sometimes it returns some strange characters. I think thats because PDF has watermark over the page so it does not recognise the text:
import requests
from io import StringIO, BytesIO
import PyPDF2
def pdf_content_extraction(pdf_link):
all_pdf_content = ''
#sending requests
response = requests.get(pdf_link)
my_raw_data = response.content
pdf_file_text = 'PDF File: ' + pdf_link + '\n\n'
#extract text page by page
with BytesIO(my_raw_data) as data:
read_pdf = PyPDF2.PdfFileReader(data)
#looping trough each page
for page in range(read_pdf.getNumPages()):
page_content = read_pdf.getPage(page).extractText()
page_content = page_content.replace("\n\n\n", "\n").strip()
#store data into variable for each page
pdf_file_text += page_content + '\n\nPAGE '+ str(page+1) + '/' + str(read_pdf.getNumPages()) +'\n\n\n'
all_pdf_content += pdf_file_text + "\n\n"
return all_pdf_content
pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'
print(pdf_content_extraction(pdf_link))
This is the result that I'm getting:
#$%˘˘
&'(˝˙˝˙)*+"*˜
˜*
,*˜*˜ˆ+-*˘!(
.˜($*%(#%*˜-/
"*
*˜˜0!0˘˘*˜˘˜ˆ
+˜(%
*
*(+%*˜+"*˜'
$*1˜ˆ
...
...
My question is, how can I fix this problem?
Is there a way to remove watermark from page or something like that?
I mean, maybe this problem can be fixed in some other way, maybe the problem is not in that watermark/logo?
The garbled text issue that you're having has nothing to do with the watermark in the document. Your issue seems to be related to the encoding in the document. The German characters within your document should be able to be extracted using PyPDF2, because it uses the latin-1 (iso-8859-1) encoding/decoding model. This encoding model isn't working with your PDF.
When I look at the underlying info of your PDF I note that it was created using these apps:
'Producer': 'GPL Ghostscript 9.10'
'Creator': 'PDFCreator Version 1.7.3
When I look at one of the PDFs in this question also written in German, I note that it was created using different apps:
'/Creator': 'Acrobat PDFMaker 11 für Excel'
'/Producer': 'Adobe PDF Library 11.0'
I can read the second file perfectly with PyPDF2.
When I look at this file from your other question I noted that is also cannot be read correctly by PyPDF2. This file was created with the same apps as the file from this bounty question.
'Producer': 'GPL Ghostscript 9.10'
'Creator': 'PDFCreator Version 1.7.3
This is the same file that throw an error when attempting to extract the text using pdfreader.SimplePDFViewer.
I looked at the bugs for ghostscript and noted that there are some font related issues for Ghostscript 9.10, which was release in 2015. I also noted that some people mentioned that PDFCreator Version 1.7.3 released in 2018 also had some font embedding issues.
I have been trying to find the correct decoding/encoding sequence, but some far I haven't been able to extract the text correctly.
Here are some of the sequences:
page_content.encode('raw_unicode_escape').decode('ascii', 'xmlcharrefreplace'))
# output
\u02d8
\u02c7\u02c6\u02d9\u02dd\u02d9\u02db\u02da\u02d9\u02dc
\u02d8\u02c6!"""\u02c6\u02d8\u02c6!
page_content.encode('ascii', 'xmlcharrefreplace').decode('raw_unicode_escape'))
# output
# ˘
ˇˆ˙˝˙˛˚˙˜
˘ˆ!"""ˆ˘ˆ!
I will keep looking for the correct encoding/decoding sequence to use with PyPDF2. It is worth nothing that PyPDF2 hasn't been updated since May 18, 2016. Also encoding issues is common problem with the module. Plus the maintenance of this module is dead, thus the ports to the modules PyPDF3 and PyPDF4.
I attempted to extract the text from your PDF using PyPDF2, PyPDF3 and PyPDF4. All 3 modules failed to extract the content from the PDF that you provided.
You can definitely extract the content from your document using other Python modules.
Tika
This example uses Tika and BeautifulSoup to extract the content in German from your source document.
import requests
from tika import parser
from io import BytesIO
from bs4 import BeautifulSoup
pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'
response = requests.get(pdf_link)
with BytesIO(response.content) as data:
parse_pdf = parser.from_buffer(data, xmlContent=True)
# Parse metadata from the PDF
metadata = parse_pdf['metadata']
# Parse the content from the PDF
content = parse_pdf['content']
# Convert double newlines into single newlines
content = content.replace('\n\n', '\n')
soup = BeautifulSoup(content, "lxml")
body = soup.find('body')
for p_tag in body.find_all('p'):
print(p_tag.text.strip())
pdfminer
This example uses pdfminer to extract the content from your source document.
import requests
from io import BytesIO
from pdfminer.high_level import extract_text
pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'
response = requests.get(pdf_link)
with BytesIO(response.content) as data:
text = extract_text(data, password='', page_numbers=None, maxpages=0, caching=True,
codec='utf-8', laparams=None)
print(text.replace('\n\n', '\n').strip())
import requests
from io import StringIO, BytesIO
import PyPDF2
def remove_watermark(wm_text, inputFile, outputFile):
from PyPDF4 import PdfFileReader, PdfFileWriter
from PyPDF4.pdf import ContentStream
from PyPDF4.generic import TextStringObject, NameObject
from PyPDF4.utils import b_
with open(inputFile, "rb") as f:
source = PdfFileReader(f, "rb")
output = PdfFileWriter()
for page in range(source.getNumPages()):
page = source.getPage(page)
content_object = page["/Contents"].getObject()
content = ContentStream(content_object, source)
for operands, operator in content.operations:
if operator == b_("Tj"):
text = operands[0]
if isinstance(text, str) and text.startswith(wm_text):
operands[0] = TextStringObject('')
page.__setitem__(NameObject('/Contents'), content)
output.addPage(page)
with open(outputFile, "wb") as outputStream:
output.write(outputStream)
wm_text = 'wm_text'
inputFile = r'input.pdf'
outputFile = r"output.pdf"
remove_watermark(wm_text, inputFile, outputFile)
In contrast to my initial assumption in comments to the question, the issue is not some missing ToUnicode map. I didn't see the URL to the file immediately and, therefore, guessed. Instead, the issue is a very primitively implemented text extraction method.
The PageObject method extractText is documented as follows:
extractText()
Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.
Returns: a unicode string object.
(PyPDF2 1.26.0 documentation, visited 2021-03-15)
So this method extracts the string arguments of text drawing instructions in the content stream ignoring the encoding information in the respectively current font object. Thus, only text drawn using a font with some ASCII'ish encoding are properly extracted.
As the text in question uses a custom ad-hoc encoding (generated while creating the page, containing the used characters in the order of their first occurrence), that extractText method is unable to extract the text.
Proper text extraction methods, on the other hand, can extract the text without issue as tested by Life is complex and documented in his answer.
This code is supposed to download a list of pdfs into a directory
for pdf in preTag:
pdfUrl = "https://the-eye.eu/public/Books/Programming/" +
pdf.get("href")
print("Downloading...%s"% pdfUrl)
#downloading pdf from url
page = requests.get(pdfUrl)
page.raise_for_status()
#saving pdf to new directory
pdfFile = open(os.path.join(filePath, os.path.basename(pdfUrl)), "wb")
for chunk in page.iter_content(1000000):
pdfFile.write(chunk)
pdfFile.close()
I used os.path.basename() just to make sure the files would actually download. However, I want to know how to change the file name from 3D%20Printing%20Blueprints%20%5BeBook%5D.pdf to something like "3D Printing Blueprints.pdf"
You can use the urllib2 unquote function:
import urllib2
print urllib2.unquote("3D%20Printing%20Blueprints%20%5BeBook%5D.pdf") #3D Printing Blueprints.pdf
use this:
os.rename("3D%20Printing%20Blueprints%20%5BeBook%5D.pdf", "3D Printing Blueprints.pdf")
you can find more info here
I am trying to create a script that scrapes a webpage and downloads any image files found.
My first function is a wget function that reads the webpage and assigns it to a variable.
My second function is a RegEx that searches for the 'ssrc=' in a webpages html, below is the function:
def find_image(text):
'''Find .gif, .jpg and .bmp files'''
documents = re.findall(r'\ssrc="([^"]+)"', text)
count = len(documents)
print "[+] Total number of file's found: %s" % count
return '\n'.join([str(x) for x in documents])
The output from this is something like this:
example.jpg
image.gif
http://www.webpage.com/example/file01.bmp
I am trying to write a third function that downloads these files using urllib.urlretrieve(url, filename) but I am not sure how to go about this, mainly because some of the output is absolute paths where as others are relative. I am also unsure how to download these all at same time and download without me having to specify a name and location every time.
Path-Agnostic fetching of resources (Can handle absolute/relative paths) -
from bs4 import BeautifulSoup as bs
import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
import os
def fetch_url(url, out_folder="test/"):
"""Downloads all the images at 'url' to /test/"""
soup = bs(urlopen(url))
parsed = list(urlparse.urlparse(url))
for image in soup.findAll("img"):
print "Image: %(src)s" % image
filename = image["src"].split("/")[-1]
parsed[2] = image["src"]
outpath = os.path.join(out_folder, filename)
if image["src"].lower().startswith("http"):
urlretrieve(image["src"], outpath)
else:
urlretrieve(urlparse.urlunparse(parsed), outpath)
fetch_url('http://www.w3schools.com/html/')
I can't write you the complete code and I'm sure that's not what you would want as well, but here are some hints:
1) Do not parse random HTML pages with regex, there are quite a few parsers made for that. I suggest BeautifulSoup. You will filter all img elements and get their src values.
2) With the src values at hand, you download your files the way you are already doing. About the relative/absolute problem, use the urlparse module, as per this SO answer. The idea is to join the src of the image with the URL from which you downloaded the HTML. If the src is already absolute, it will remain that way.
3) As for downloading them all, simply iterate over a list of the webpages you want to download images from and do steps 1 and 2 for each image in each page. When you say "at the same time", you probably mean to download them asynchronously. In that case, I suggest downloading each webpage in one thread.
I want to generate pdf file from html using Python + Flask. To do this, I use xhtml2pdf. Here is my code:
def main():
pdf = StringIO()
pdf = create_pdf(render_template('cvTemplate.html', user=user))
pdf_out = pdf.getvalue()
response = make_response(pdf_out)
return response
def create_pdf(pdf_data):
pdf = StringIO()
pisa.CreatePDF(StringIO(pdf_data.encode('utf-8')), pdf)
return pdf
In this code file is generating on the fly. BUT! xhtml2pdf doesn't support many styles in CSS, because of this big problem to mark page correctly. I found another instrument(wkhtmltopdf). But when I wrote something like:
pdf = StringIO()
data = render_template('cvTemplate1.html', user=user)
WKhtmlToPdf(data.encode('utf-8'), pdf)
return pdf
Was raised error:
AttributeError: 'cStringIO.StringO' object has no attribute 'rfind'
And my question is how to convert html to pdf using wkhtmltopdf (with generating file on the fly) in Flask?
Thanks in advance for your answers.
The page need render, You can use pdfkit:
https://pypi.python.org/pypi/pdfkit
https://github.com/JazzCore/python-pdfkit
Example in document.
import pdfkit
pdfkit.from_url('http://google.com', 'out.pdf')
pdfkit.from_file('test.html', 'out.pdf')
pdfkit.from_string('Hello!', 'out.pdf') # Is your requirement?
Have you tried with Flask-WeasyPrint, which uses WeasyPrint? There are good examples in their web sites so I don't replicate them here.
Not sure if this would assist anyone but my issue was capturing Bootstrap5 elements as a pdf. pdfkit did not do so and heres a work around on windows using html2image and PIL. This is limited and does not take a full page screenshot.
from html2image import Html2Image
from PIL import Image
try:
hti.screenshot(html_file=C:\yourfilepath\file.html, save_as="test.png")
finally:
image1 = Image.open(r'C:\yourfilepath\test.png')
im1 = image1.convert('RGB')
im1.save(r'C:\yourfilepath\newpdf.pdf')