Use PyPDF2 to detect Embedded Subset fonts in PDF

Use PyPDF2 to detect Embedded Subset fonts in PDF - python

I have modified the following script using PyPDF2 to traverse through a PDF and determine whether the PDF contains unembedded fonts. It works for figuring out the list of all fonts in the PDF, and which of those are embedded. However, some PDFs have fonts in which only the subset of the font used is embedded (see https://blogs.mtu.edu/gradschool/2010/04/27/how-to-determine-if-fonts-are-embedded/) - How do you determine in a PDF whether a subset of a font is embedded? Thank you!
from PyPDF2 import PdfFileReader
import sys
fontkeys = set(['/FontFile', '/FontFile2', '/FontFile3'])
def walk(obj, fnt, emb):
if '/BaseFont' in obj:
fnt.add(obj['/BaseFont'])
elif '/FontName' in obj and fontkeys.intersection(set(obj)):
emb.add(obj['/FontName'])
for k in obj:
if hasattr(obj[k], 'keys'):
walk(obj[k], fnt, emb)
if type(obj) == PyPDF2.generic.ArrayObject: # You can also do ducktyping here
for i in obj:
if hasattr(i, 'keys'):
walk(i, all_fonts, embedded_fonts)
return fnt, emb
if __name__ == '__main__':
fname = sys.argv[1]
pdf = PdfFileReader(fname)
fonts = set()
embedded = set()
for page in pdf.pages:
obj = page.getObject()
f, e = walk(obj['/Resources'], fonts, embedded)
fonts = fonts.union(f)
embedded = embedded.union(e)
unembedded = fonts - embedded
print 'Font List'
pprint(sorted(list(fonts)))
if unembedded:
print '\nUnembedded Fonts'
pprint(unembedded)

By convention the PostScript name of a subset font in a PDF file has a name which begins with XXXXXX+ where 'X' is any upper case ASCII character.
See Section 5.3 of the PDF Reference Manual (version 1.7)
Additionally the presence of a CharSet or CIDSet in the font descriptor can be used to indicate a subset font (both of these are optional).
However, all of these are 'conventions', there is no actual guaranteed way to be sure that a font which does not have any of these conventions is not actually a subset font.

Related

Python how to delete first line of word file and font.bold not working

I am writing a python program that open a docx file and writing text into it. using "aspose.words"
and I have two problems:
I have a problem that when I open a file its starting with the sentence
"Evaluation Only. Created with Aspose.Words. Copyright 2003-2021 Aspose Pty Ltd."
and I want to delete that line after I creating the file (I can delete it manually so it's deletable)
my second problem is when I am using "font.bold = True" on an english text it's working just fine but when I am using it on an text that in other language its doesen't work.
Someone know how can I solves those two problems (it's the first time I am using "aspose.words")
here is my code:
import aspose.words as aw
def main():
doc = aw.Document()
builder = aw.DocumentBuilder(doc)
writeDest(1, builder)
doc.save("out.docx")
def writeDest(designation, builder):
font = builder.font
font.size = 12
font.bold = True
font.name = "David"
paragraphFormat = builder.paragraph_format
paragraphFormat.alignment = aw.ParagraphAlignment.RIGHT
label = 'ייעוד: ' + str(designation)
builder.write(label)
builder.write("\n")
font.bold = False
if designation == 1:
file = open('destenationTextFiles/1', encoding="utf8")
for word in file:
builder.write(word)
builder.write('\n')
font.bold = True
builder.write(':תיקון ')
builder.write("\n")
font.bold = False
file.close()
file = open("destenationTextFiles/fixed1", encoding="utf8")
for word in file:
builder.write(word)
file.close()
if __name__ == "__main__":
main()

This message indicates you are using Aspose.Words in evaluation mode. Please see the following article to learn more about evaluation version limitations of Aspose.Words.
To test Aspose.Words for Python without these limitations you can request a temporary 30 days license.
To format right-to-left text you should use bidi font properties. For example see the following python code:
import aspose.words as aw
def main():
doc = aw.Document()
builder = aw.DocumentBuilder(doc)
# Define a set of font settings for left-to-right text.
builder.font.name = "Courier New"
builder.font.size = 16
builder.font.italic = False
builder.font.bold = False
builder.font.locale_id = 1033
# Define another set of font settings for right-to-left text.
builder.font.name_bi = "David"
builder.font.size_bi = 24
builder.font.italic_bi = True
builder.font.bold_bi = True
builder.font.locale_id_bi = 1037;
# We can use the Bidi flag to indicate whether the text we are about to add
# with the document builder is right-to-left. When we add text with this flag set to true,
# it will be formatted using the right-to-left set of font settings.
builder.font.bidi = True
builder.write("ברוך הבא")
# Set the flag to false, and then add left-to-right text.
# The document builder will format these using the left-to-right set of font settings.
builder.font.bidi = False
builder.write(" Hello world!")
doc.save("C:\\Temp\\Font.Bidi.docx")
if __name__ == "__main__":
main()

underline text with odfpy

I'd like to generate an odf file with odfpy, and am stuck on underlining text.
Here is a minimal example inspired from official documentation, where i can't find any information about what attributes can be used and where.
Any suggestion?
from odf.opendocument import OpenDocumentText
from odf.style import Style, TextProperties
from odf.text import H, P, Span
textdoc = OpenDocumentText()
ustyle = Style(name="Underline", family="text")
#uprop = TextProperties(fontweight="bold") #uncommented, this works well
#uprop = TextProperties(attributes={"fontsize":"26pt"}) #this either
uprop = TextProperties(attributes={"underline":"solid"}) # bad guess, wont work !!
ustyle.addElement(uprop)
textdoc.automaticstyles.addElement(ustyle)
p = P(text="Hello world. ")
underlinedpart = Span(stylename=ustyle, text="This part would like to be underlined. ")
p.addElement(underlinedpart)
p.addText("This is after the style test.")
textdoc.text.addElement(p)
textdoc.save("myfirstdocument.odt")

Here is how I finally got it:
I created a sample document with underlining using libreoffice, and unzipped it. Looking in styles.xml part of the extracted files, I got the part that makes underlining in the document:
<style:style style:name="Internet_20_link" style:display-name="Internet link" style:family="text">
<style:text-properties fo:color="#000080" fo:language="zxx" fo:country="none" style:text-underline-style="solid" style:text-underline-width="auto" style:text-underline-color="font-color" style:language-asian="zxx" style:country-asian="none" style:language-complex="zxx" style:country-complex="none"/>
</style:style>
The interesting style attributes are named: text-underline-style,
text-underline-width and text-underline-color.
To use them in odfpy, '-' characters must be removed, and attributes keys must be used as str (with quotes) like in the following code. A correct style family (text in our case) must be specified in the Style constructor call.
from odf.opendocument import OpenDocumentText
from odf.style import Style, TextProperties
from odf.text import H, P, Span
textdoc = OpenDocumentText()
#underline style
ustyle = Style(name="Underline", family="text") #here style family
uprop = TextProperties(attributes={
"textunderlinestyle":"solid",
"textunderlinewidth":"auto",
"textunderlinecolor":"font-color"
})
ustyle.addElement(uprop)
textdoc.automaticstyles.addElement(ustyle)
p = P(text="Hello world. ")
underlinedpart = Span(stylename=ustyle, text="This part would like to be underlined. ")
p.addElement(underlinedpart)
p.addText("This is after the style test.")
textdoc.text.addElement(p)
textdoc.save("myfirstdocument.odt")

How to update table of contents in docx-file with python on linux?

I've got a problem with updating table of contents in docx-file, generated by python-docx on Linux. Generally, it is not difficult to create TOC (Thanks for this answer https://stackoverflow.com/a/48622274/9472173 and this thread https://github.com/python-openxml/python-docx/issues/36)
from docx.oxml.ns import qn
from docx.oxml import OxmlElement
paragraph = self.document.add_paragraph()
run = paragraph.add_run()
fldChar = OxmlElement('w:fldChar') # creates a new element
fldChar.set(qn('w:fldCharType'), 'begin') # sets attribute on element
instrText = OxmlElement('w:instrText')
instrText.set(qn('xml:space'), 'preserve') # sets attribute on element
instrText.text = 'TOC \o "1-3" \h \z \u' # change 1-3 depending on heading levels you need
fldChar2 = OxmlElement('w:fldChar')
fldChar2.set(qn('w:fldCharType'), 'separate')
fldChar3 = OxmlElement('w:t')
fldChar3.text = "Right-click to update field."
fldChar2.append(fldChar3)
fldChar4 = OxmlElement('w:fldChar')
fldChar4.set(qn('w:fldCharType'), 'end')
r_element = run._r
r_element.append(fldChar)
r_element.append(instrText)
r_element.append(fldChar2)
r_element.append(fldChar4)
p_element = paragraph._p
But later to make TOC visible it requires to update fields. Mentioned bellow solution involves update it manually (right-click on TOC hint and choose 'update fields'). For the automatic updating, I've found the following solution with word application simulation (thanks to this answer https://stackoverflow.com/a/34818909/9472173)
import win32com.client
import inspect, os
def update_toc(docx_file):
word = win32com.client.DispatchEx("Word.Application")
doc = word.Documents.Open(docx_file)
doc.TablesOfContents(1).Update()
doc.Close(SaveChanges=True)
word.Quit()
def main():
script_dir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
file_name = 'doc_with_toc.docx'
file_path = os.path.join(script_dir, file_name)
update_toc(file_path)
if __name__ == "__main__":
main()
It pretty works on Windows, but obviously not on Linux. Have someone any ideas about how to provide the same functionality on Linux. The only one suggestion I have is to use local URLs (anchors) to every heading, but I am not sure is it possible with python-docx, also I'm not very strong with these openxml features. I will very appreciate any help.

I found a solution from this Github Issue. It work on ubuntu.
def set_updatefields_true(docx_path):
namespace = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
doc = Document(docx_path)
# add child to doc.settings element
element_updatefields = lxml.etree.SubElement(
doc.settings.element, f"{namespace}updateFields"
)
element_updatefields.set(f"{namespace}val", "true")
doc.save(docx_path)## Heading ##

import docx.oxml.ns as ns
def update_table_of_contents(doc):
# Find the settings element in the document
settings_element = doc.settings.element
# Create an "updateFields" element and set its "val" attribute to "true"
update_fields_element = docx.oxml.shared.OxmlElement('w:updateFields')
update_fields_element.set(ns.qn('w:val'), 'true')
# Add the "updateFields" element to the settings element
settings_element.append(update_fields_element)

How do I add a page background image in pylatex?

I have written the following code, and have given the latex commands for drawing background, by using the NoEscape module.
I have an image reportbg.png in the same directory as the program. Now, I want this image to appear as a background in all pages of the report.
types = ('../Faults/*.png', '../Faults/*.jpg')
imgnames = []
for files in types:
imgnames.extend(natsort.natsorted(glob.glob(files)))
geometry_options = { "head": "30pt",
"margin": "0.3in",
"top": "0.2in",
"bottom": "0.4in",
"includeheadfoot": True}
doc = Document(geometry_options=geometry_options)
first_page = PageStyle("firstpage")
doc.preamble.append(first_page)
doc.change_document_style("firstpage")
new_comm1 = NoEscape(r'\usepackage{wallpaper}')
doc.append(new_comm1)
new_comm2 = NoEscape(r'\CenterWallPaper{reportbg.png}')
doc.append(new_comm2)
with doc.create(Section('Faults identified')):
doc.append("Report")
with doc.create(Subsection('Fault pictures')):
for i,imgname in enumerate(imgnames):
with doc.create(Figure(position='h!')) as f_pic:
f_pic.add_image(imgname, width='220px')
f_pic.add_caption('Height: '+str(56)+', Angle: '+str(20))
doc.append('Some regular text')
However, I got the following error:
! LaTeX Error: Can be used only in preamble.
See the LaTeX manual or LaTeX Companion for explanation.
Type H <return> for immediate help.
...
l.23 \usepackage
{wallpaper}%
! Undefined control sequence.
l.24 \CenterWallPaper
{reportbg.png}%
<../Faults/1.jpg, id=1, 1927.2pt x 1084.05pt> <use ../Faults/1.jpg>
<../Faults/2.jpg, id=2, 1927.2pt x 1084.05pt> <use ../Faults/2.jpg>
<../Faults/3.jpg, id=3, 1927.2pt x 1084.05pt> <use ../Faults/3.jpg>
<../Faults/4.jpg, id=4, 1003.75pt x 1003.75pt> <use ../Faults/4.jpg>
LaTeX Warning: '!h' float specifier changed to '!ht'.

To implement a Background Image on all the pages of the document, you can generate first the PDF document in pylatex and then add the image as a watermark with PyPDF2. To do so, you need to have your 'reportbg.png' image into a pdf format (reportbg.pdf).
Here's a modified example based on the pylatex documentation (https://jeltef.github.io/PyLaTeX/current/examples/basic.html):
CODE
from pylatex import Document, Section, Subsection, Command
from pylatex.utils import italic, NoEscape
import PyPDF2
class Document_Watermark():
def __init__(self, doc):
self.doc = doc
self.fill_document()
self.create_document()
self.Watermark()
def fill_document(self):
"""Add a section, a subsection and some text to the document.
:param doc: the document
:type doc: :class:`pylatex.document.Document` instance
"""
with self.doc.create(Section('A section')):
self.doc.append('Some regular text and some ')
self.doc.append(italic('italic text. '))
with self.doc.create(Subsection('A subsection')):
self.doc.append('Also some crazy characters: $&#{}')
def create_document(self):
# Add stuff to the document
with self.doc.create(Section('A second section')):
self.doc.append('Some text.')
self.doc.generate_pdf('basic_maketitle2', clean_tex=False, compiler='pdflatex')
tex = self.doc.dumps() # The document as string in LaTeX syntax
def Watermark(self):
Doc = open('basic_maketitle2.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(Doc)
pdfWatermark = PyPDF2.PdfFileReader(open('watermark3.pdf', 'rb'))
pdfWriter = PyPDF2.PdfFileWriter()
for pageNum in range(0, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
pageObj.mergePage(pdfWatermark.getPage(0))
pdfWriter.addPage(pageObj)
resultPdfFile = open('PDF_Watermark.pdf', 'wb')
pdfWriter.write(resultPdfFile)
Doc.close()
resultPdfFile.close()
# Basic document
doc = Document('basic')
# Document with `\maketitle` command activated
doc = Document()
doc.preamble.append(Command('title', 'Awesome Title'))
doc.preamble.append(Command('author', 'Anonymous author'))
doc.preamble.append(Command('date', NoEscape(r'\today')))
doc.append(NoEscape(r'\maketitle'))
Document_Watermark(doc)
The example watermark is this one: watermark3.pdf
The initial PDF document: basic_maketitle2.pdf
The final document: PDF_Watermark.pdf
PS: The watermark, the initial generated pdf and the .py file must be at the same directory. I couldn't upload the PDF files, because this is my first answer post and I'm not really sure how could I, but I share some images. I hope it will be helpful.
For more information, I suggest to read the following book: "Automate the boring stuff with Python", chapter 13, by Al Sweigart.

How to color specific part of the text with some separator

I'm trying to color some specific part of the text, i have tried to say:
if word.strip().startswith(":"):
self.setAttributesForRange(NSColor.greenColor(), None, highlightOffset, len(word))
When someone types the sign : it gets colored green. That is good, but it keeps coloring the word after it like this:
:Hello Hello :Hello <---- this all gets colored green, but I want something like:
:Hello Hello :Hello <---- where everything get colored except the middle "hello" because it doesn't start with the sign : , please help me out
from Foundation import *
from AppKit import *
import objc
class PyObjC_HighlightAppDelegate(NSObject):
# The connection to our NSTextView in the UI
highlightedText = objc.IBOutlet()
# Default font size to use when highlighting
fontSize = 12
def applicationDidFinishLaunching_(self, sender):
NSLog("Application did finish launching.")
def textDidChange_(self, notification):
"""
Delegate method called by the NSTextView whenever the contents of the
text view have changed. This is called after the text has changed and
been committed to the view. See the Cocoa reference documents:
http://developer.apple.com/documentation/Cocoa/Reference/ApplicationKit/Classes/NSText_Class/Reference/Reference.html
http://developer.apple.com/documentation/Cocoa/Reference/ApplicationKit/Classes/NSTextView_Class/Reference/Reference.html
Specifically the sections on Delegate Methods for information on additional
delegate methods relating to text control is NSTextView objects.
"""
# Retrieve the current contents of the document and start highlighting
content = self.highlightedText.string()
self.highlightText(content)
def setAttributesForRange(self, color, font, rangeStart, rangeLength):
"""
Set the visual attributes for a range of characters in the NSTextView. If
values for the color and font are None, defaults will be used.
The rangeStart is an index into the contents of the NSTextView, and
rangeLength is used in combination with this index to create an NSRange
structure, which is passed to the NSTextView methods for setting
text attributes. If either of these values are None, defaults will
be provided.
The "font" parameter is used as an key for the "fontMap", which contains
the associated NSFont objects for each font style.
"""
fontMap = {
"normal" : NSFont.systemFontOfSize_(self.fontSize),
"bold" : NSFont.boldSystemFontOfSize_(self.fontSize)
}
# Setup sane defaults for the color, font and range if no values
# are provided
if color is None:
color = NSColor.blackColor()
if font is None:
font = "normal"
if font not in fontMap:
font = "normal"
displayFont = fontMap[font]
if rangeStart is None:
rangeStart = 0
if rangeLength is None:
rangeLength = len(self.highlightedText.string()) - rangeStart
# Set the attributes for the specified character range
range = NSRange(rangeStart, rangeLength)
self.highlightedText.setTextColor_range_(color, range)
self.highlightedText.setFont_range_(displayFont, range)
def highlightText(self, content):
"""
Apply our customized highlighting to the provided content. It is assumed that
this content was extracted from the NSTextView.
"""
# Calling the setAttributesForRange with no values creates
# a default that "resets" the formatting on all of the content
self.setAttributesForRange(None, None, None, None)
# We'll highlight the content by breaking it down into lines, and
# processing each line one by one. By storing how many characters
# have been processed we can maintain an "offset" into the overall
# content that we use to specify the range of text that is currently
# being highlighted.
contentLines = content.split("\n")
highlightOffset = 0
for line in contentLines:
if line.strip().startswith("#"):
# Comment - we want to highlight the whole comment line
self.setAttributesForRange(NSColor.greenColor(), None, highlightOffset, len(line))
elif line.find(":") > -1:
# Tag - we only want to highlight the tag, not the colon or the remainder of the line
startOfLine = line[0: line.find(":")]
yamlTag = startOfLine.strip("\t ")
yamlTagStart = line.find(yamlTag)
self.setAttributesForRange(NSColor.blueColor(), "bold", highlightOffset + yamlTagStart, len(yamlTag))
elif line.strip().startswith("-"):
# List item - we only want to highlight the dash
listIndex = line.find("-")
self.setAttributesForRange(NSColor.redColor(), None, highlightOffset + listIndex, 1)
# Add the processed line to our offset, as well as the newline that terminated the line
highlightOffset += len(line) + 1

It all depends on what word is.
In [6]: word = ':Hello Hello :Hello'
In [7]: word.strip().startswith(':')
Out[7]: True
In [8]: len(word)
Out[8]: 19
Compare:
In [1]: line = ':Hello Hello :Hello'.split()
In [2]: line
Out[2]: [':Hello', 'Hello', ':Hello']
In [3]: for word in line:
print word.strip().startswith(':')
print len(word)
...:
True
6
False
5
True
6
Notice the difference in len(word), which I suspect is causing your problem.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use PyPDF2 to detect Embedded Subset fonts in PDF - python

Related

Python how to delete first line of word file and font.bold not working

underline text with odfpy

How to update table of contents in docx-file with python on linux?

How do I add a page background image in pylatex?

How to color specific part of the text with some separator

Categories

Resources