I'm having trouble trying to create Table of Contents objects in a PDF file. I'm not sure whether I've understood the process from Apple's limited documentation.
I'm using python, but cogent examples in any language are welcome to explain how it's supposed to work. The code creates a new PDF document, but there's no outline item visible in Preview. I've tried just using myOutline as the root object, but that doesn't work either.
pdfURL = NSURL.fileURLWithPath_(infile)
myPDF = Quartz.PDFDocument.alloc().initWithURL_(pdfURL)
if myPDF:
# Create Destination
myPage = myPDF.pageAtIndex_(1)
pagePoint = Quartz.CGPointMake(0,0)
myDestination = Quartz.PDFDestination.alloc().initWithPage_atPoint_(myPage, pagePoint)
# Create Outline
myOutline = Quartz.PDFOutline.alloc().init()
myOutline.setLabel_("Interesting")
myOutline.setDestination_(myDestination)
# Create a root Outline and add the first outline as a child
rootOutline = Quartz.PDFOutline.alloc().init()
rootOutline.insertChild_atIndex_(myOutline, 0)
# Add the root outline to the document and save
myPDF.setOutlineRoot_(rootOutline)
myPDF.writeToFile_(outfile)
EDIT: Actually, the outline IS getting saved to the new file: I can read it programmatically, and it appears in Acrobat as a Bookmark; however, it doesn't show up in Preview's Table of Contents (yes, I checked for the "Hide" thing). If I add another Bookmark in Acrobat, then both show up in Preview.
So I guess that either I'm still doing something wrong which doesn't quite 'finish' the PDFOutline data properly, and Acrobat is being kind; or there's a massive bug in PDFKit that means you can't write PDFOutlines properly. I get the same behaviour on Mountain Lion, FWIW.
This does appear to be a bug in Preview. It will not list the Table of Contents if it contains ONLY ONE child entry.
If I add more Outlines with the code above, then all of them appear in Preview. If use other software to remove all but one entries in the Table of Contents, then Preview will not show any.
Related
I would like to relink a Photoshop Smart Object to a new file using Python.
Here's a screenshot of the button that's used in Photoshop to perform this action - "Relink to File":
I've found some solutions in other programming languages but couldn't make them work in Python, here's one for example: Photoshop Scripting: Relink Smart Object
Editing Contents of a Smart Object would also be a good option, but I can't seem to figure that one out either.
Here's a screenshot of the button to Edit Contents of a Smart Object:
So far I have this:
import win32com.client
psApp = win32com.client.Dispatch('Photoshop.Application')
psDoc = psApp.Application.ActiveDocument
for layer in psDoc.layers:
if layer.kind == 17: # layer kind 17 is Smart Object
print(layer.name)
# here it should either "Relink to File" or "Edit Contents" of a Smart Object
I have figured out a workaround! I simply ran JavaScript in Python.
This is the code to Relink to File.... You could do a similar thing for Edit Contents but I haven't tried it yet, as relinking works better for me.
Keep in mind the new_img_path must be a raw string as far as I'm aware, for example:
new_img_path = r"C:\\Users\\miha\\someEpicPic.jpg"
import photoshop.api as ps
def js_relink(new_img_path):
jscode = r"""
var desc = new ActionDescriptor();
desc.putPath(stringIDToTypeID('null'), new File("{}"));
executeAction(stringIDToTypeID('placedLayerRelinkToFile'), desc, DialogModes.NO);
""".format(new_img_path)
JavaScript(jscode)
def JavaScript(js_code):
app = ps.Application()
app.doJavaScript(js_code)
I am currently doing a project to extract the contents of a PDF. The code runs smoothly and I am able to extract the text but the extracted text are not in the right order. The code extracts the text in a weird way. The order of the text is all over the place. It does not go from top to bottom and is really confusing.
I looked up online but there was very little help on how to order the text extraction. Most tutorials came up with the same result. For reference, this is the PDF that I am currently testing it on (page 5): https://www.pidm.gov.my/PIDM/files/13/134b5c79-5319-4199-ac68-99f62aca6047.pdf
import PyPDF2
with open('pdftest2.pdf', 'rb') as pdfTest:
reader = PyPDF2.PdfFileReader(pdfTest)
page5 = reader.getPage(4)
text = page5.extractText()
print(text)
The extracted text would always start with the footer of the page and then go its way from bottom to top. I noticed in the next page it would start from top to bottom but only for a few certain sentences. Then it would extract text from a different position of the page instead of continuing from where it left off.
All of the text does get extracted but the order of which it is extracted is all over the place. Is there any solution for this problem?
I had to deal with a problem that was similar and it turned out that the module pdfplumber worked better than PyPDF. I guess it depends on the document itself, you should try.
Otherwise another answer to your problem would be to treat the PDFs as images with the pdf2image module and extract the text within them using pytesseract. However it might not be perfect method as the pdf2image method convert_from_path can take quite a long time to run.
I drop some code down here if you are interested.
First of all make sure you install all necessary depedencies as well as Tesseract and ImageMagik. You can find any information regarding install on the website. If you are working with windows there's a good Medium article here.
To convert PDFs to images using pdf2image:
Don't forget to add your poppler path if you are working on windows. It should look like something like that r'C:\<your_path>\poppler-21.02.0\Library\bin'
def pdftoimg(fic,output_folder, poppler_path):
# Store all the pages of the PDF in a variable
pages = convert_from_path(fic, dpi=500,output_folder=output_folder,thread_count=9, poppler_path=poppler_path)
image_counter = 0
# Iterate through all the pages stored above
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(output_folder+filename, 'JPEG')
image_counter = image_counter + 1
for i in os.listdir(output_folder):
if i.endswith('.ppm'):
os.remove(output_folder+i)
To extract text from the image:
Your tesseract path is going to be something like that: r'C:\Program Files\Tesseract-OCR\tesseract.exe'
def imgtotext(img, tesseract_path):
# Recognize the text as string in image using pytesserct
pytesseract.pytesseract.tesseract_cmd = tesseract_path
text = str(((pytesseract.image_to_string(Image.open(img)))))
text = text.replace('-\n', '')
return text
I recently started using PyMuPDF. It’s licensing is a little confusing but some of their methods have ways to correctly sort the text as it naturally appears (left to right, top to bottom). Something like page.get_text(“words”, sort=True) is all it takes.
I have a vectorized floorplan image. I want to identify the objects in the image through the vector data in the SVG file of that image. The SVG code does not have any close points(z) in between them. So I am unable to understand when does the point moves to the other object? Can somebody help me, please?
I have very little knowledge about these SVG files and using them in Tkinter. So please somebody help me or suggest me what can I do?
This is the vector data of the image.
vector data of the image
use in conjunction with SO floorplan question.
Jump to z_final_floorplan.svg for final file.
A
Create 4 files:
w_original_floorplan.svg
x_rough_static_floorplan.svg
y_rough_live_floorplan.svg
z_final_floorplan.svg
w_original_floorplan.svg and x_rough_static_floorplan.svg are identical apart from filename.
y_rough_live_floorplan.svg and z_final_floorplan.svg are empty; to be populated.
Copy x_rough_static_floorplan.svg to y_rough_live_floorplan.svg.
Open y_rough_live_floorplan.svg on browser using server.
x_rough_static_floorplan.svg find all M and replace with two newlines / symbol M (case sensitive). shift + enter shift + enter /M
B
[this section takes the time]
Take away 1st '/' in path in y_rough_live_floorplan.svg [shows blackout_floorplan]
Label x_rough_static_floorplan.svg code section blackout_floorplan where code is.
(this file is used as rough-work, so being xml / svg valid is irrelevant)
In y_rough_live_floorplan.svg find next '/' and delete it [shows floorplan_top_left_whiteout]
Label x_rough_static_floorplan.svg code section floorplan_top_left_whiteout where code is.
Have x_rough_static_floorplan.svg and y_rough_live_floorplan.svg open in 2 windows, will be going back and forth to each of them. Keep repeating until at end.
(hint: find tool seems to be on switching from files in vscode, so you can use find / and next one cmd + g easily) Maybe handy to have a paper printout of original svg as reference and label the names of objects you create e.g.bath, sink, table, as you go along (don’t be fooled by this, one table is 'table'. Is 2nd chair chair2, chair_2, chair_two etc.?) etc..
C
Reorder the whole labels and corresponding code in path x_rough_static_floorplan.svg so the labels are ordered next to each other, but in the order they are found in the path:
e.g.
…
floorplan
bath
sink
table_chairs
sofa
…
Use the 'find' tool here. This process, itself will require a temp file to copy and paste to rather than reorder within the file working on. And rewrite temp to file working on. Might be good idea to create checklist of objects and cross-off as done.
E.g. floorplan, bath, table_chairs, sink…
D
Create path elements from your grouped objects, putting each id as id=“floorplan_main”, id=“bath”, id=“sink” etc.. etc..
Bear in mind, the data of how this is drawn is really, really bad. Really they should be drawn with rect elements for a rectangle when possible and a lot of the path data is very unnecessary, but that’s obviously how the application generates the svg.
From stackoverflow I learned how to set image properties in LibreOffice Writer with pyhton macros, via com.sun.star.text.WrapTextMode. Now I use that to set the text wrap to THOUGHT. Now I would like to set the image to background, like a watermark.
In LibreOffice Writer interactively I select an image, right-click on it and the context menu contains the "Wrap" commands, one is "Wrap Through" and the other one is "In Background".
In the python macro I have the following code (from Insert several images at once using macro scripting in LibreOffice and from the often quoted Andrew Pitonyak):
from com.sun.star.text.WrapTextMode import THROUGHT
and then to insert the image:
img = doc.createInstance('com.sun.star.text.TextGraphicObject')
element_url = 'file://' + file_name
img.GraphicURL = element_url
img.Surround = THROUGHT
text.insertTextContent(cursor, img, False)
So what is the code to put it "In Background"?
MRI shows that setting "In Background" causes the Opaque attribute to be false. So add this to the code:
img.Opaque = False
By the way, Surround is deprecated. Try setting TextWrap to THROUGHT instead.
I'm trying to automate the creation of .docx files (WordML) with the help of python-docx (https://github.com/mikemaccana/python-docx). My current script creates the ToC manually with following loop:
for chapter in myChapters:
body.append(paragraph(chapter.text, style='ListNumber'))
Does anyone know of a way to use the "word built-in" ToC-function, which adds the index automatically and also creates paragraph-links to the individual chapters?
Thanks a lot!
The key challenge is that a rendered ToC depends on pagination to know what page number to put for each heading. Pagination is a function provided by the layout engine, a very complex piece of software built into the Word client. Writing a page layout engine in Python is probably not a good idea, definitely not a project I'm planning to undertake anytime soon :)
The ToC is composed of two parts:
the element that specifies the ToC placement and things like which heading levels to include.
the actual visible ToC content, headings and page numbers with dotted lines connecting them.
Creating the element is pretty straightforward and relatively low-effort. Creating the actual visible content, at least if you want the page numbers included, requires the Word layout engine.
These are the options:
Just add the tag and a few other bits to signal Word the ToC needs to be updated. When the document is first opened, a dialog box appears saying links need to be refreshed. The user clicks Yes and Bob's your uncle. If the user clicks No, the ToC title appears with no content below it and the ToC can be updated manually.
Add the tag and then engage a Word client, by means of C# or Visual Basic against the Word Automation library, to open and save the file; all the fields (including the ToC field) get updated.
Do the same thing server-side if you have a SharePoint instance or whatever that can do it with Word Automation Services.
Create an AutoOpen macro in the document that automatically runs the field update when the document is opened. Probably won't pass a lot of virus checkers and won't work on locked-down Windows builds common in a corporate setting.
Here's a very nice set of screencasts by Eric White that explain all the hairy details
Sorry for adding comments to an old post, but I think it may be helpful.
This is not my solution, but it has been found there: https://github.com/python-openxml/python-docx/issues/36
Thanks to https://github.com/mustash and https://github.com/scanny
from docx.oxml.ns import qn
from docx.oxml import OxmlElement
paragraph = self.document.add_paragraph()
run = paragraph.add_run()
fldChar = OxmlElement('w:fldChar') # creates a new element
fldChar.set(qn('w:fldCharType'), 'begin') # sets attribute on element
instrText = OxmlElement('w:instrText')
instrText.set(qn('xml:space'), 'preserve') # sets attribute on element
instrText.text = 'TOC \\o "1-3" \\h \\z \\u' # change 1-3 depending on heading levels you need
fldChar2 = OxmlElement('w:fldChar')
fldChar2.set(qn('w:fldCharType'), 'separate')
fldChar3 = OxmlElement('w:t')
fldChar3.text = "Right-click to update field."
fldChar2.append(fldChar3)
fldChar4 = OxmlElement('w:fldChar')
fldChar4.set(qn('w:fldCharType'), 'end')
r_element = run._r
r_element.append(fldChar)
r_element.append(instrText)
r_element.append(fldChar2)
r_element.append(fldChar4)
p_element = paragraph._p
Please see explanations in the code comments.
# First set directory where you want to save the file
import os
os.chdir("D:/")
# Now import required packages
import docx
from docx import Document
from docx.oxml.ns import qn
from docx.oxml import OxmlElement
# Initialising document to make word file using python
document = Document()
# Code for making Table of Contents
paragraph = document.add_paragraph()
run = paragraph.add_run()
fldChar = OxmlElement('w:fldChar') # creates a new element
fldChar.set(qn('w:fldCharType'), 'begin') # sets attribute on element
instrText = OxmlElement('w:instrText')
instrText.set(qn('xml:space'), 'preserve') # sets attribute on element
instrText.text = 'TOC \\o "1-3" \\h \\z \\u' # change 1-3 depending on heading levels you need
fldChar2 = OxmlElement('w:fldChar')
fldChar2.set(qn('w:fldCharType'), 'separate')
fldChar3 = OxmlElement('w:t')
fldChar3.text = "Right-click to update field."
fldChar2.append(fldChar3)
fldChar4 = OxmlElement('w:fldChar')
fldChar4.set(qn('w:fldCharType'), 'end')
r_element = run._r
r_element.append(fldChar)
r_element.append(instrText)
r_element.append(fldChar2)
r_element.append(fldChar4)
p_element = paragraph._p
# Giving headings that need to be included in Table of contents
document.add_heading("Network Connectivity")
document.add_heading("Weather Stations")
# Saving the word file by giving name to the file
name = "mdh2"
document.save(name+".docx")
# Now check word file which got created
# Select "Right-click to update field text"
# Now right click and then select update field option
# and then click on update entire table
# Now,You will find Automatic Table of Contents
#Mawg // Updating ToC
Had the same issue to update the ToC and googled for it. Not my code, but it works:
word = win32com.client.DispatchEx("Word.Application")
doc = word.Documents.Open(input_file_name)
doc.TablesOfContents(1).Update()
doc.Close(SaveChanges=True)
word.Quit()