How to Publish Docx file with images to WordPress site? - python

So I am able to post Docx files to WordPress using WP REST-API using mammoth docx package in Python
I am able to upload an image to WordPress.
But when there are images in the docx file they are not uploading on the WordPress media section.
Any input on this?
I am using python for this.
Here is the code for Docx to HTML conversion
with open(file_path, "rb") as docx_file:
# html = mammoth.extract_raw_text(docx_file)
result = mammoth.convert_to_html(docx_file, convert_image=mammoth.images.img_element(convert_image))
html = result.value # The generated HTML
kindly do note that I am able to see images in the actual published post but they have a weird source image URL & are not appearing in the WordPress media section.
Weird image source URL like
data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAEBAQEBAQEBAQEBAQECAgMCAgICAgQDAwIDBQQFBQUEBAQFBgcGBQUHBgQEBgkGBwgICAgIBQYJCgkICgcICAj/2wBDAQEBAQICAgQCAgQIBQQFCAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAj/wAARCAUABQADASIAAhEBAxEB/8QAHwAAAQMFAQEBAAAAAAAAAAAAAAUGBwMECAkKAgsB/8QAhxAAAQIEBAMEBQYHCAUOFggXAQIDAAQFEQYHEiETMUEIIlFhCRQ & so on
Also Huge thanks to Contributors for the Python to WordPress repo

The mammoth cli has a function that extracts images, saves them to a directory and inserts the file names in the img tags in the html code. If you don't want to use mammoth in command line you could use this code:
import os
from mammoth.cli import ImageWriter, _write_output
output_dir = './output'
filename = 'filename.docx'
with open(filename, "rb") as docx_fileobj:
convert_image = mammoth.images.img_element(ImageWriter(output_dir))
output_filename = "{0}.html".format(os.path.basename(filename).rpartition(".")[0])
output_path = os.path.join(output_dir, output_filename)
result = mammoth.convert(
docx_fileobj,
convert_image=convert_image,
output_format='html',
)
_write_output(output_path, result.value)
Note that you would still need to change the img links as you'll be uploading the images to Wordpress, but this solves your mapping issue. You might also want to change the ImageWriter class to save the images to something else than tiff.

Related

Download images from multiple websites (urls)

i am not a programmer but i want to do that. I want to download images from multiple urls ( urllist.txt ) the url not have the image in it so need to recognise the image > 400kb and have 20 sec delay beetween the download so the site not lock me out
thnx in advance
Stack Overflow is for stuff like error handling, but I thought I could help. This worked for me:
downloader.py
import requests
import random
def download_imgs(file):
'''
Downloads images based
on the URLs given in `file`.
'''
with open(file, 'r') as url_file:
data = url_file.read().strip().split('\n') # Read the URLs in the file
for url in data:
img = requests.get(url.strip()) # Open the link
with open(str(random.uniform(1, 10000)), 'wb') as write_img:
# Random module to generate a random name for the image
write_img.write(img.content)
# Saved the image
return True
download_imgs('~/Desktop/urllist.txt')
urllist.txt
https://lh3.googleusercontent.com/a-/AOh14GhJAxUW_Gcq2xzMqe3_tc3eLV6e9-sMTqDWuRY7=s88-c-k-c0x00ffffff-no-rj-mo
https://i.ytimg.com/vi/m4jmapVMaQA/hqdefault.jpg?sqp=-oaymwEZCPYBEIoBSFXyq4qpAwsIARUAAIhCGAFwAQ==&rs=AOn4CLBqRJKwS9ZzMwnUZvmkXrAw5EzH5w
Even though the images don't have file extensions (Ex: PNG or PING), this program seems to work fine for me.

How to translate url encoded string python

This code is supposed to download a list of pdfs into a directory
for pdf in preTag:
pdfUrl = "https://the-eye.eu/public/Books/Programming/" +
pdf.get("href")
print("Downloading...%s"% pdfUrl)
#downloading pdf from url
page = requests.get(pdfUrl)
page.raise_for_status()
#saving pdf to new directory
pdfFile = open(os.path.join(filePath, os.path.basename(pdfUrl)), "wb")
for chunk in page.iter_content(1000000):
pdfFile.write(chunk)
pdfFile.close()
I used os.path.basename() just to make sure the files would actually download. However, I want to know how to change the file name from 3D%20Printing%20Blueprints%20%5BeBook%5D.pdf to something like "3D Printing Blueprints.pdf"
You can use the urllib2 unquote function:
import urllib2
print urllib2.unquote("3D%20Printing%20Blueprints%20%5BeBook%5D.pdf") #3D Printing Blueprints.pdf
use this:
os.rename("3D%20Printing%20Blueprints%20%5BeBook%5D.pdf", "3D Printing Blueprints.pdf")
you can find more info here

Python urlretrieve downloading corrupted images

I am downloading a list of images (all .jpg) from the web using this python script:
__author__ = 'alessio'
import urllib.request
fname = "inputs/skyscraper_light.txt"
with open(fname) as f:
content = f.readlines()
for link in content:
try:
link_fname = link.split('/')[-1]
urllib.request.urlretrieve(link, "outputs_new/" + link_fname)
print("saved without errors " + link_fname)
except:
pass
In OSX preview I see the images just fine, but I can't open them with any image editor (for example Photoshop says "Could not complete your request because Photoshop does not recognize this type of file."), and when i try to attach them to a word document, the files are not even showed as picture files in the dialog for browsing for image.
What am i doing wrong?
As J.F. Sebastian suggested me in the comments, the issue was related to the newline in the filename.
To make my script work, you need to replace
link_fname = link.split('/')[-1]
with
link_fname = link.strip().split('/')[-1]

Convert html to pdf using Python/Flask

I want to generate pdf file from html using Python + Flask. To do this, I use xhtml2pdf. Here is my code:
def main():
pdf = StringIO()
pdf = create_pdf(render_template('cvTemplate.html', user=user))
pdf_out = pdf.getvalue()
response = make_response(pdf_out)
return response
def create_pdf(pdf_data):
pdf = StringIO()
pisa.CreatePDF(StringIO(pdf_data.encode('utf-8')), pdf)
return pdf
In this code file is generating on the fly. BUT! xhtml2pdf doesn't support many styles in CSS, because of this big problem to mark page correctly. I found another instrument(wkhtmltopdf). But when I wrote something like:
pdf = StringIO()
data = render_template('cvTemplate1.html', user=user)
WKhtmlToPdf(data.encode('utf-8'), pdf)
return pdf
Was raised error:
AttributeError: 'cStringIO.StringO' object has no attribute 'rfind'
And my question is how to convert html to pdf using wkhtmltopdf (with generating file on the fly) in Flask?
Thanks in advance for your answers.
The page need render, You can use pdfkit:
https://pypi.python.org/pypi/pdfkit
https://github.com/JazzCore/python-pdfkit
Example in document.
import pdfkit
pdfkit.from_url('http://google.com', 'out.pdf')
pdfkit.from_file('test.html', 'out.pdf')
pdfkit.from_string('Hello!', 'out.pdf') # Is your requirement?
Have you tried with Flask-WeasyPrint, which uses WeasyPrint? There are good examples in their web sites so I don't replicate them here.
Not sure if this would assist anyone but my issue was capturing Bootstrap5 elements as a pdf. pdfkit did not do so and heres a work around on windows using html2image and PIL. This is limited and does not take a full page screenshot.
from html2image import Html2Image
from PIL import Image
try:
hti.screenshot(html_file=C:\yourfilepath\file.html, save_as="test.png")
finally:
image1 = Image.open(r'C:\yourfilepath\test.png')
im1 = image1.convert('RGB')
im1.save(r'C:\yourfilepath\newpdf.pdf')

In Python, is there a way I can download all/some the image files (e.g. JPG/PNG) from a **Google Images** search result?

Is there a way I can download all/some the image files (e.g. JPG/PNG) from a Google Images search result?
I can use the following code to download one image that I already know its url:
import urllib.request
file = "Facts.jpg" # file to be written to
url = "http://www.compassion.com/Images/Hunger-Facts.jpg"
response = urllib.request.urlopen (url)
fh = open(file, "wb") #open the file for writing
fh.write(response.read()) # read from request while writing to file
To download multiple images, it has been suggested that I define a function and use that function to repeat the task for each image url that I would like to write to disk:
def image_request(url, file):
response = urllib.request.urlopen(url)
fh = open(file, "wb") #open the file for writing
fh.write(response.read())
And then loop over a list of urls with:
for i, url in enumerate(urllist):
image_request(url, str(i) + ".jpg")
However, what I really want to do is download all/some image files (e.g. JPG/PNG) from my own search result from Google Images without necessarily having a list of the image urls beforehand.
P.S.
Please I am a complete beginner and would favour an answer that breaks down the broad steps to achieve this over one that is bogs down on specific codes. Thanks.
You can use the Google API like this, where BLUE and DOG are your search parameters:
https://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=BLUE%20DOG
There is a developer guide about this here:
https://developers.google.com/image-search/v1/jsondevguide
You need to parse this JSON format before you can use the links directly.
Here's a start to your JSON parsing:
import json
j = json.loads('{"one" : "1", "two" : "2", "three" : "3"}')
print(j['two'])

Categories