I am trying to read a pdf file in html without downloading the pdf. The down code is working fine for images . In place of images i want to read a pdf. is that possible.
db.define_table('mytable',
Field('image', type='upload'))
controller
def tables():
return dict(tables=db().select(db.mytable.ALL))
View
{{for table in tables:}}
<img src="{{=URL('default', 'download', args=table.image)}}" /> <br />
{{pass}}
You could create an png from the PDF. PDF is not something a browser can render in HTML (sure Chrome and Firefox can open a PDF; but that's fullscreen, just the pdf itself; e.g. another URL). Could be out of scope: In some projects I use a crontab that converts uploaded PDF's to jpg to provide a preview:
#Render all PDF's to thumb when no thumb is found
#convert -thumbnail x600 -background white "1.pdf"[0] "1.jpg"
import os,sys
OVERWRITE_EXISTING=False
PDFDIR='uploads'
THUMBDIR=os.path.join('static','thumbs')
import glob
import os
def insensitive_glob(pattern):
def either(c):
return '[%s%s]'%(c.lower(),c.upper()) if c.isalpha() else c
return glob.glob(''.join(map(either,pattern)))
#get all the pdf file names in the current folder
os.chdir('uploads')
files = insensitive_glob("*.pdf")
# and convert each file if needed
for f in files:
convert_needed=True
newFile=os.path.join('..',THUMBDIR,'%s.jpg' % f[:-4])
if not OVERWRITE_EXISTING:
if os.path.exists(newFile):
#print 'Warning:: .jpg already exists; skipping %s' % f
convert_needed=False
if convert_needed:
cmd='convert -density 144 %s[0] %s' % (f,newFile)
print "Converting with: %s" % cmd
os.system(cmd)
Related
I want to get the pdf pages as jpg , (suppose pdf contains 3 pages then output should be 3 images with JPG extension)
I tried 2-3 different ways but not getting the results!
Below is the script I wrote but it only gives the picture of heading present on first page , and also the image created is not stored at the specific folder
`import os
from PyPDF2 import PdfReader
from wand.image import Image as WImage
pdf_file = r"C:\Users\saura\Aidetic\image_processing_data\sample_file.pdf"
def pdf_to_img(pdf_file):
# Open the PDF file
# pdf = open("pdf_file", "rb")
pdf = PdfReader(pdf_file)
# Create a folder to store the images
folder_name = pdf_file[:-4]
if not os.path.exists(folder_name):
os.makedirs(folder_name)
page = pdf.pages[0]
count = 0
# Iterate through each page of the PDF
# for i in pdf.pages:
# Get the current page
for image in page.images:
# page = pdf.pages[i]
# Convert the page to an image
with open(str(count) + image.name, "wb") as fp:
fp.write(image.data)
count += 1
# Save the image to the folder
# img.save(filename=os.path.join(folder_name, str(i) + ".jpg"))
# Get the list of PDF files in the current directory
pdf_files = [f for f in os.listdir() if f.endswith(".pdf")]
# Iterate through each PDF file
for pdf_file in pdf_files:
pdf_to_img(pdf_file)
`
As far as I know, PyPDF2 is not capable of rendering a page of a document as an image.
In contrast, PyMuPDF can do this. Snippet here:
import fitz # import PyMuPDF
doc = fitz.open("your.pdf")
# NOTE: apart from PDF, you can do the exact same thing for EPUB,
# MOBI, XPS, FB2, CBZ documents - no code change required.
for page in doc: # iterate over document pages
pix = page.get_pixmap(dpi=150) # render full page with desired DPI
pix.save(f"page-%04i.png" % page.number) # PNG directly supported
# if JPEG desired, use a variant that employs Pillow:
# pix.pil_save(f"page-%04i.jpg" % page.number)
In next version, JPEG will be directly supported.
Method get_pixmap() has several parameters to choose the colorspace (e.g. gray), include transparency channel, or restrict the page area to be rendered.
Here is a version that iterates over a list of files (PDF or other document types), and saves all their pages in a folder "images". This time we are using Python context manager.
import os
import fitz
filelist = [] # list of filenames (PDFs, XPS, EPUB, MOBI, ...)
for fname in filenames:
basename = os.path.basename(fname)
first = os.path.splitext(basename)[0]
with fitz.open(fname) as doc:
for page in doc:
pix = page.get_pixmap(dpi=150)
pix.save(os.path.join("images", "%s_%04.png" % (first, page.number)))
So I am able to post Docx files to WordPress using WP REST-API using mammoth docx package in Python
I am able to upload an image to WordPress.
But when there are images in the docx file they are not uploading on the WordPress media section.
Any input on this?
I am using python for this.
Here is the code for Docx to HTML conversion
with open(file_path, "rb") as docx_file:
# html = mammoth.extract_raw_text(docx_file)
result = mammoth.convert_to_html(docx_file, convert_image=mammoth.images.img_element(convert_image))
html = result.value # The generated HTML
kindly do note that I am able to see images in the actual published post but they have a weird source image URL & are not appearing in the WordPress media section.
Weird image source URL like
data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAEBAQEBAQEBAQEBAQECAgMCAgICAgQDAwIDBQQFBQUEBAQFBgcGBQUHBgQEBgkGBwgICAgIBQYJCgkICgcICAj/2wBDAQEBAQICAgQCAgQIBQQFCAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAj/wAARCAUABQADASIAAhEBAxEB/8QAHwAAAQMFAQEBAAAAAAAAAAAAAAUGBwMECAkKAgsB/8QAhxAAAQIEBAMEBQYHCAUOFggXAQIDAAQFEQYHEiETMUEIIlFhCRQ & so on
Also Huge thanks to Contributors for the Python to WordPress repo
The mammoth cli has a function that extracts images, saves them to a directory and inserts the file names in the img tags in the html code. If you don't want to use mammoth in command line you could use this code:
import os
from mammoth.cli import ImageWriter, _write_output
output_dir = './output'
filename = 'filename.docx'
with open(filename, "rb") as docx_fileobj:
convert_image = mammoth.images.img_element(ImageWriter(output_dir))
output_filename = "{0}.html".format(os.path.basename(filename).rpartition(".")[0])
output_path = os.path.join(output_dir, output_filename)
result = mammoth.convert(
docx_fileobj,
convert_image=convert_image,
output_format='html',
)
_write_output(output_path, result.value)
Note that you would still need to change the img links as you'll be uploading the images to Wordpress, but this solves your mapping issue. You might also want to change the ImageWriter class to save the images to something else than tiff.
I am trying to create a script that scrapes a webpage and downloads any image files found.
My first function is a wget function that reads the webpage and assigns it to a variable.
My second function is a RegEx that searches for the 'ssrc=' in a webpages html, below is the function:
def find_image(text):
'''Find .gif, .jpg and .bmp files'''
documents = re.findall(r'\ssrc="([^"]+)"', text)
count = len(documents)
print "[+] Total number of file's found: %s" % count
return '\n'.join([str(x) for x in documents])
The output from this is something like this:
example.jpg
image.gif
http://www.webpage.com/example/file01.bmp
I am trying to write a third function that downloads these files using urllib.urlretrieve(url, filename) but I am not sure how to go about this, mainly because some of the output is absolute paths where as others are relative. I am also unsure how to download these all at same time and download without me having to specify a name and location every time.
Path-Agnostic fetching of resources (Can handle absolute/relative paths) -
from bs4 import BeautifulSoup as bs
import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
import os
def fetch_url(url, out_folder="test/"):
"""Downloads all the images at 'url' to /test/"""
soup = bs(urlopen(url))
parsed = list(urlparse.urlparse(url))
for image in soup.findAll("img"):
print "Image: %(src)s" % image
filename = image["src"].split("/")[-1]
parsed[2] = image["src"]
outpath = os.path.join(out_folder, filename)
if image["src"].lower().startswith("http"):
urlretrieve(image["src"], outpath)
else:
urlretrieve(urlparse.urlunparse(parsed), outpath)
fetch_url('http://www.w3schools.com/html/')
I can't write you the complete code and I'm sure that's not what you would want as well, but here are some hints:
1) Do not parse random HTML pages with regex, there are quite a few parsers made for that. I suggest BeautifulSoup. You will filter all img elements and get their src values.
2) With the src values at hand, you download your files the way you are already doing. About the relative/absolute problem, use the urlparse module, as per this SO answer. The idea is to join the src of the image with the URL from which you downloaded the HTML. If the src is already absolute, it will remain that way.
3) As for downloading them all, simply iterate over a list of the webpages you want to download images from and do steps 1 and 2 for each image in each page. When you say "at the same time", you probably mean to download them asynchronously. In that case, I suggest downloading each webpage in one thread.
I am trying to let a user upload an image, save the image to disk, and then have it display on a webpage, but I can't get the image to display properly. Here is my bin/app.py:
import web
urls = (
'/hello', 'index'
)
app = web.application(urls, globals())
render = web.template.render('templates/', base="layout")
class index:
def GET(self):
return render.hello_form()
def POST(self):
form = web.input(greet="Hello", name="Nobody", datafile={})
greeting = "%s, %s" % (form.greet, form.name)
filedir = 'absolute/path/to/directory'
filename = None
if form.datafile:
# replaces the windows-style slashes with linux ones.
filepath = form.datafile.filename.replace('\\','/')
# splits the and chooses the last part (the filename with extension)
filename = filepath.split('/')[-1]
# creates the file where the uploaded file should be stored
fout = open(filedir +'/'+ filename,'w')
# writes the uploaded file to the newly created file.
fout.write(form.datafile.file.read())
# closes the file, upload complete.
fout.close()
filename = filedir + "/" + filename
return render.index(greeting, filename)
if __name__ == "__main__":
app.run()
and here is templates/index.html:
$def with (greeting, datafile)
$if greeting:
I just wanted to say <em style="color: green; font-size: 2em;">$greeting</em>
$else:
<em>Hello</em>, world!
<br>
$if datafile:
<img src=$datafile alt="your picture">
<br>
Go Back
When I do this, I get a broken link for the image. How do I get the image to display properly? Ideally, I wouldn't have to read from disk to display it, although I'm not sure if that's possible. Also, is there a way to write the file to the relative path, instead of the absolute path?
You can also insert a path to all images in a folder by adding an entry to your URL.
URL = ('/hello','Index',
'/hello/image/(.*)','ImageDisplay'
)
...
class ImageDisplay(object):
def GET(self,fileName):
imageBinary = open("/relative/path/from/YourApp}"+fileName,'rb').read()
return imageBinary
Not the ../YourApp, not ./YourApp. It looks up one directory from where your prgram is. Now, in the html, you can use
<img src="/image/"+$datafile alt="your picture">
I would recommend using with or try with the "imageBinary = open("{..." line.
Let me know if more info is needed. This is my first response.
Sorry to ask a question in a responce, but is there a way to use a regular expression, like (.jpg) in place of the (.) I have in the URL definition?
web.py doesn't automatically serve all of the files from the directory your application is running in — if it did, anyone could be able to read your application's source code. It does, however, have a directory it serves files out of: static.
To answer your other question: yes, there is a way to avoid using an absolute path: give it a relative path!
Here's how your code might look afterwards:
filename = form.datafile.filename.replace('\\', '/').split('/')[-1]
# It might be a good idea to sanitize filename further.
# A with statement ensures that the file will be closed even if an exception is
# thrown.
with open(os.path.join('static', filename), 'wb') as f:
# shutil.copyfileobj copies the file in chunks, so it will still work if the
# file is too large to fit into memory
shutil.copyfileobj(form.datafile.file, f)
Do omit the filename = filedir + "/" + filename line. Your template need not include the absolute path: in fact, it should not; you must include static/; no more, no less:
<img src="static/$datafile" alt="your picture">
I had a problem before where it wouldn't show Chinese characters even when I specified #font-face to use a UTF-8 font. It turns out I cannot display images as well... so I seems like I am unable to get any of the files embeded into my pdf.
This is the code I use:
def render_to_pdf(template_src, context_dict):
"""Function to render html template into a pdf file"""
template = get_template(template_src)
context = Context(context_dict)
html = template.render(context)
result = StringIO.StringIO()
pdf = pisa.pisaDocument(StringIO.StringIO(html.encode("UTF-8")),
dest=result,
encoding='UTF-8',
link_callback=fetch_resources)
if not pdf.err:
response = http.HttpResponse(result.getvalue(), mimetype='application/pdf')
return response
return HttpResponse('We had some errors<pre>%s</pre>' % escape(html))
def fetch_resources(uri, rel):
import os.path
from django.conf import settings
path = os.path.join(
settings.STATIC_ROOT,
uri.replace(settings.STATIC_URL, ""))
return path
html
<img src="/static/images/bc_logo_bw_pdf.png" />
and
#font-face {
font-family: "Wingdings";
src: url("/static/fonts/wingdings.ttf");
}
I looked at the other quests on SO but it was no help. There are also no exceptions happening in the two functions. Also in fetch_resources function the path returned was the correct full path to the file i.e. /home/<user>/project/static/images/bc_logo_bw_pdf.png and /home/<user>/project/static/fonts/wingdings.ttf and I am at a loss as to what is wrong.
UPDATE
Everytime I create a pdf, I get this message on the console
No handlers could be found for logger "ho.pisa"
could this be related?
UPDATE #2
The font works now I made a dumb mistake... The font I was using did not have the Chinese unicode. But I still cannot embed any images onto the pdf, be it jpeg, gif or png.
I have finally solved the problem I was having... it turns out it doesn't work if I set the body's height with css... once I removed that line the image was loading perfectly...
For me (django 1.4, python 2.7 pisa==3.0.33), If I put the full path of image instead of relative, it works for me.
Try doing the same.
Everything looks better . Try once with JPG image file. In my case PNG file was also not working.
<img src="/static/images/<name>.jpg" />
without width and height attribute image will not work. add width and height attribute.
<img src="{% static 'images/logo.png' %}" alt="image" width="200" height="150" />
this fix works for me.
I have the same problem here. Don't give up with XHTML2PDF Pisa.
Pisa use PIL for generate PDF and use lib zip decoder to inserting images.
You should check if your PIL already installed properly with zip decoder, fonts and several components
I have solve this problem by installing PIL with zip decoder.
http://obroll.com/install-python-pil-python-image-library-on-ubuntu-11-10-oneiric/
If you need more detail information, you can read my article here :
http://obroll.com/how-to-load-images-files-style-css-in-pdf-using-pisa-xhtml2pdf-on-django/