Add watermark to word document using Python - python

I want to add an image as background/watermark to a new word document using Python. I tried Python-docx but couldn't find anything useful
from docx import Document
from docx.shared import Inches
document = Document()
document.add_picture(r'D:\Python\Projects\raw_imgs\3b057d6199d95c4339ef532001cb20cd.jpg', width=Inches(6))
document.save('demo.docx')
The above code just inserts the image but I want to add it as the background image.

Aspose.Words Cloud SDK for Python can insert an image as a background to the DOC/DOCX. Though it is paid product, its free trial allows 150 free API calls monthly.
# For complete examples and data files, please go to https://github.com/aspose-words-cloud/aspose-words-cloud-python
# Import module
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile
# Please get your Client ID and Secret from https://dashboard.aspose.cloud.
client_id='xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx'
client_secret='xxxxxxxxxxxxxxxxxxxxxxxxxxx'
words_api = asposewordscloud.WordsApi(client_id,client_secret)
words_api.api_client.configuration.host='https://api.aspose.cloud'
localFile = 'C:/Temp/Sections.docx'
imageFile= 'C:/Temp/Tulips.jpg'
outputFile= 'C:/Temp/Watermark.docx'
request = asposewordscloud.models.requests.InsertWatermarkImageOnlineRequest(document=open(localFile, 'rb'), image_file=open(imageFile, 'rb'))
result = words_api.insert_watermark_image_online(request)
copyfile(result.document, outputFile)

Related

Convert .pdf to .docx on Adobe pdf services API (using Python)

I'm trying to write a Python program converting ".pdf" files to ".docx" ones, using Adobe PDF Server API (free trial).
I've found literature enabling to transform any ".pdf" file to a ".zip" file containing ".txt" files (restoring text data) and ".excel" files (returning tabular data).
import logging
import os.path
from adobe.pdfservices.operation.auth.credentials import Credentials
from adobe.pdfservices.operation.exception.exceptions import ServiceApiException, ServiceUsageException, SdkException
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_pdf_options import ExtractPDFOptions
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_element_type import ExtractElementType
from adobe.pdfservices.operation.execution_context import ExecutionContext
from adobe.pdfservices.operation.io.file_ref import FileRef
from adobe.pdfservices.operation.pdfops.extract_pdf_operation import ExtractPDFOperation
logging.basicConfig(level=os.environ.get("LOGLEVEL", "INFO"))
try:
# get base path.
base_path =os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath("C:/..link.../extractpdf/extract_txt_from_pdf.ipynb"))))
# Initial setup, create credentials instance.
credentials = Credentials.service_account_credentials_builder()\
.from_file(base_path + "\\pdfservices-api-credentials.json") \
.build()
#Create an ExecutionContext using credentials and create a new operation instance.
execution_context = ExecutionContext.create(credentials)
extract_pdf_operation = ExtractPDFOperation.create_new()
#Set operation input from a source file.
source = FileRef.create_from_local_file(base_path + "/resources/trs_pdf_file.pdf")
extract_pdf_operation.set_input(source)
# Build ExtractPDF options and set them into the operation
extract_pdf_options: ExtractPDFOptions = ExtractPDFOptions.builder() \
.with_element_to_extract(ExtractElementType.TEXT) \
.with_element_to_extract(ExtractElementType.TABLES) \
.build()
extract_pdf_operation.set_options(extract_pdf_options)
#Execute the operation.
result: FileRef = extract_pdf_operation.execute(execution_context)
# Save the result to the specified location.
result.save_as(base_path + "/output/Extract_TextTableau_From_trs_pdf_file.zip")
except (ServiceApiException, ServiceUsageException, SdkException):
logging.exception("Exception encountered while executing operation")
But I can't yet get the conversion done to a ".docx" file, event after changing the name of the extracted file to name.docx
I went to read the litterature of adobe.pdfservices.operation.pdfops.options.extractpdf.extract_pdf_options.ExtractPDFOptions() but didn't found ways to tune the extraction and change it from ".zip" to ".docx". What things can I try next?
Unfortunately, right now the Python SDK is only supporting the Extract portion of our PDF services. You could use the services via the REST APIs (https://documentcloud.adobe.com/document-services/index.html#how-to-get-started-) as an alternative.

How to save a python docxTemplate as pdf quickly

At the moment I use a script to populate a template for each of the entries in our database and generate a docx file for each entry. Following that I convert that docx file to a pdf and mail it to the user.
For this I use following code :
from docxtpl import DocxTemplate
from docx2pdf import convert
pathToTemplate='template.docx'
outputPath='output.docx'
template = DocxTemplate(pathToTemplate)
context = person.get_context(short) # gets the context used to render the document
template.render(context)
template.save(outputPath)
pdfpath = outputPath[:-4]+'pdf'
convert(outputPath, pdfpath)
This part of the code is embedded in a loop and when measuring the time needed to generate the context from the database (in the person.get_context(short) function) and generating the docx file it gives me a result between 0.5s - 1.0s. When measuring the time needed to convert this docx to pdf it gives me a time of 5.0s - 7.0s.
Because the loop has to loop over > 1000 users, this is the difference can add up. Does anyone have an idea how the DocxTemplate kan save to pdf directly (and how fast this is) or if there is a faster way to generate the pdf files?
as far as I know you just can't do it with the docx library itself, but I have found an alternate way to achieve this, we can convert the docx to pdf using the following code
from docxtpl import DocxTemplate
import pandas as pd
df = pd.read_excel("Data.xlsx")
import time
import os
from win32com import client
word_app = client.Dispatch("Word.Application")
for i , j in df.iterrows():
Name = j["Party_Name"]
tpl = DocxTemplate("Invoice_Template.docx")
dicty = df.to_dict()
x = df.to_dict(orient="records")
context = x
tpl.render(context[i])
tpl.save("hello.docx")
rod = os.path.dirname(os.path.abspath(__file__))
print(rod)
time.sleep(2)
#converting to pdf
doc = word_app.Documents.Open(rod + "\\1.docx")
doc.SaveAs(rod + "\\hello.pdf", FileFormat=17)

How to set thumbnail for mp3 using eyed3 python module?

I can't set image thumbnails for mp3 files using eyed3 module in Python.
I try next script:
import eyed3
from eyed3.id3.frames import ImageFrame
th = 'url_to_my_pic'
file = 'to_mp3_pleer/file.mp3'
audiofile = eyed3.load(file)
audiofile.initTag()
audiofile.tag.frames = ImageFrame(image_url=th)
audiofile.tag.save()
But this do nothing with thumbnails in my file.
In google no information about settings thumbnails using eyed3. How can I set it?
After several hours learning of eyeD3, googling and experimenting with file cover, I think, I have a solution for you.
You need to follow these rules:
use ID3v2.3 (not v2.4 as by default in eyeD3);
add right description for cover image (word cover);
pass image as binary;
I'll give you an example of code, which works fine on my Windows 10 (should works on other platforms as well):
import eyed3
import urllib.request
audiofile = eyed3.load("D:\\tmp\\tmp_mp3\\track_example.mp3")
audiofile.initTag(version=(2, 3, 0)) # version is important
# Other data for demonstration purpose only (from docs)
audiofile.tag.artist = "Token Entry"
audiofile.tag.album = "Free For All Comp LP"
audiofile.tag.album_artist = "Various Artists"
audiofile.tag.title = "The Edge"
# Read image from local file (for demonstration and future readers)
with open("D:\\tmp\\tmp_covers\\cover_2021-03-13.jpg", "rb") as image_file:
imagedata = image_file.read()
audiofile.tag.images.set(3, imagedata, "image/jpeg", u"cover")
audiofile.tag.save()
# Get image from the Internet
response = urllib.request.urlopen("https://example.com/your-picture-here.jpg")
imagedata = response.read()
audiofile.tag.images.set(3, imagedata, "image/jpeg", u"cover")
audiofile.tag.save()
Credits: My code is based on several pages: 1, 2, 3

Extracting text from scanned PDF without saving the scan as a new file image

I would like to extract text from scanned PDFs.
My "test" code is as follows:
from pdf2image import convert_from_path
from pytesseract import image_to_string
from PIL import Image
converted_scan = convert_from_path('test.pdf', 500)
for i in converted_scan:
i.save('scan_image.png', 'png')
text = image_to_string(Image.open('scan_image.png'))
with open('scan_text_output.txt', 'w') as outfile:
outfile.write(text.replace('\n\n', '\n'))
I would like to know if there is a way to extract the content of the image directly from the object converted_scan, without saving the scan as a new "physical" image file on the disk?
Basically, I would like to skip this part:
for i in converted_scan:
i.save('scan_image.png', 'png')
I have a few thousands scans to extract text from. Although all the generated new image files are not particularly heavy, it's not negligible and I find it a bit overkill.
EDIT
Here's a slightly different, more compact approach than Colonder's answer, based on this post. For .pdf files with many pages, it might be worth adding a progress bar to each loop using e.g. the tqdm module.
from wand.image import Image as w_img
from PIL import Image as p_img
import pyocr.builders
import regex, pyocr, io
infile = 'my_file.pdf'
tool = pyocr.get_available_tools()[0]
tool = tools[0]
req_image = []
txt = ''
# to convert pdf to img and extract text
with w_img(filename = infile, resolution = 200) as scan:
image_png = scan.convert('png')
for i in image_png.sequence:
img_page = w_img(image = i)
req_image.append(img_page.make_blob('png'))
for i in req_image:
content = tool.image_to_string(
p_img.open(io.BytesIO(i)),
lang = tool.get_available_languages()[0],
builder = pyocr.builders.TextBuilder()
)
txt += content
# to save the output as a .txt file
with open(infile[:-4] + '.txt', 'w') as outfile:
full_txt = regex.sub(r'\n+', '\n', txt)
outfile.write(full_txt)
UPDATE MAY 2021
I realized that although pdf2image is simply calling a subprocess, one doesn't have to save images to subsequently OCR them. What you can do is just simply (you can use pytesseract as OCR library as well)
from pdf2image import convert_from_path
for img in convert_from_path("some_pdf.pdf", 300):
txt = tool.image_to_string(img,
lang=lang,
builder=pyocr.builders.TextBuilder())
EDIT: you can also try and use pdftotext library
pdf2image is a simple wrapper around pdftoppm and pdftocairo. It internally does nothing more but calls subprocess. This script should do what you want, but you need a wand library as well as pyocr (I think this is a matter of preference, so feel free to use any library for text extraction you want).
from PIL import Image as Pimage, ImageDraw
from wand.image import Image as Wimage
import sys
import numpy as np
from io import BytesIO
import pyocr
import pyocr.builders
def _convert_pdf2jpg(in_file_path: str, resolution: int=300) -> Pimage:
"""
Convert PDF file to JPG
:param in_file_path: path of pdf file to convert
:param resolution: resolution with which to read the PDF file
:return: PIL Image
"""
with Wimage(filename=in_file_path, resolution=resolution).convert("jpg") as all_pages:
for page in all_pages.sequence:
with Wimage(page) as single_page_image:
# transform wand image to bytes in order to transform it into PIL image
yield Pimage.open(BytesIO(bytearray(single_page_image.make_blob(format="jpeg"))))
tools = pyocr.get_available_tools()
if len(tools) == 0:
print("No OCR tool found")
sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'libtesseract'
langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
# Ex: Will use lang 'fra'
# Note that languages are NOT sorted in any way. Please refer
# to the system locale settings for the default language
# to use.
for img in _convert_pdf2jpg("some_pdf.pdf"):
txt = tool.image_to_string(img,
lang=lang,
builder=pyocr.builders.TextBuilder())

Append barcode images to docx file using pybarcode ImageWriter and docx module

How can i reduce the image size which is generated by pybarcode ImageWriter, and also how can append multiple images to docx file with proper alignment?
I read about dpi option for ImageWriter but not getting how to use it.
import barcode
from barcode.writer import ImageWriter
from docx import *
if __name__ == '__main__':
# Default set of relationshipships - these are the minimum components of a document
ean = barcode.get_barcode('ean', '123456789102', writer=ImageWriter())
ean.default_writer_options['module_height'] = 3.0
ean.default_writer_options['module_width'] = 0.1
filename = ean.save('bar_image')
relationships = relationshiplist()
# Make a new document tree - this is the main part of a Word document
document = newdocument()
# This xpath location is where most interesting content lives
docbody = document.xpath('/w:document/w:body', namespaces=nsprefixes)[0]
# Add an image
relationships,picpara = picture(relationships, filename,'This is a test description')
docbody.append(picpara)
# Create our properties, contenttypes, and other support files
coreprops = coreproperties(title='Python docx demo',subject='A practical example of making docx from Python',creator='Mike MacCana',keywords=['python','Office Open XML','Word'])
appprops = appproperties()
contenttypes = contenttypes()
websettings = websettings()
wordrelationships = wordrelationships(relationships)
# Save our document
savedocx(document,coreprops,appprops,contenttypes,websettings,wordrelationships,'sample_barcode.docx')
Generally, the barcode.writer doesn't give a parameter on the generated output image size, and you may ask PIL help. And the proper alignment is quite not accurate to code, but you may try using tables to make them in right place.
In PIL, you can resize the png image to (480,320) by
from PIL import Image
im = Image.open("barcode.png")
im.resize((480,320)).save("barcode_resized.png")
And for the docx file, some table example is here, you may need know what proper aliment is and then type the code.

Categories