MS Word Manipulation using Python - python

I have ms word document and I want to apply the following setting automatically using python
1.Font type = Trebuchet MS
2.Font Size = 11
3.Table and Appendices font Size = 10
4.Line spacing = multiple of 1.15
Space before paragraph = 0
6.Space after paragraph = 0
7.Paragraph should be Justified
8.Number should be aligned to the bottom right corner of document
9.Text in the table should be Justified
10.Insert footer and header automatically
11.Tool should ensure that a document with a single page shall not have a page number. ,
12.Tool should ensure that For documents exceeding one page, page numbers shall be inserted at the right hand side of the document.
13.Tool should ensure that the cover page shall not be assigned a page number.
14.Tool should ensure that Roman numbers (i, ii, iii. .. ) used only on the preliminary pages including table of contents, preface, abbreviations, list of tables, executive summary
15.Tool should ensure that Arabic numbers (1, 2, 3 ...) used only for main text of the report and appendices.
16.Tool should ensure that year in any document is written in full for the preceding year and two last digits for the current year. For example; instead of writing 2020/2021, write 2020/21
17.Tool should ensure that only English United Kingdom vocabularies are used and NOT English United States. Example: "analyse -English UK" vs "analyze -English US".
18.Tool should ensure that Numbers presented in a paragraph that are less than 10 should be written in words (one, two, three ...).
19.Tool should ensure that Numbers presented in a paragraph For 10 and above, they should be written in numerals.
20.Tool should ensure that Numbers within the table must be expressed in numerals even if they are less than 10.
This is my code so far
from docx import Document
from docx.shared import Pt
from docx.shared import Inches
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.shared import Length
path = 'C:\\Users\\Gaston\\Documents\\Words\\test.docx'
doc = Document(path)
style = doc.styles['Normal']
font = style.font
font.name = 'Trebuchet MS'
font.size = Pt(11)
paragraph = doc.add_paragraph()
paragraph_format = paragraph.paragraph_format
paragraph_format.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY
paragraph_format.right_indent = Inches(1)
paragraph_format.space_before = Pt(0)
paragraph_format.space_after = Pt(0)
paragraph_format.line_spacing = Length(1.15)
doc.save(path)

Related

python-docx trailing trilling whitepsaces not showing correctly

Goal
I am trying to add a text to a table cell where the text is a combination of 2 strings and the space between the strings of variable size so that the final text has the same length and it appears as if the second string is right aligned.
I can either use format or ljust to combine the strings in python.
period = "from Monday to Friday"
item_text = "Some txt"
item_text2 = "Some other txt"
t1 = "t1: {:<30}{:0}".format(item_text,period)
t2 = "t2: {:<30}{:0}".format(item_text2,period)
t3 = f"t3: {item_text.ljust(30)}{period}"
t4 = f"t4: {item_text2.ljust(30)}{period}"
from pprint import pprint
pprint(t1)
pprint(t2)
pprint(t3)
pprint(t4)
Text in python with variable space length between strings
However, if I add this text to a docx table, the space between the strings changes.
from docx import Document
doc = Document()
# Creating a table object
table = doc.add_table(rows=2, cols=2, style="Table Grid")
table.rows[0].cells[0].text = f"{item_text.ljust(30)}{period}"
table.rows[1].cells[0].text = f"{item_text2.ljust(30)}{period}"
def set_col_widths(table):
widths = tuple( Cm(val) for val in [15,8])
for row in table.rows:
for idx, width in enumerate(widths):
row.cells[idx].width = width
set_col_widths(table)
doc.save("test_whitespace.docx")
Text in word. Space between strings changed.
Note
I am aware that I could add a table to the table cell and left adjust the left and right adjust the right but that seems like way more code to write.
Question
Why is the spacing changing in the word document and how can I create the text differently to get the desired goal?

How is the text from this pdf encoded?

I have some pdfs with data about machine parts and i am trying to extract sizes. I extracted the text from a pdf via pypdfium2.
import pypdfium2 as pdfium
pdf = pdfium.PdfDocument("myfile.pdf")
page=pdf[1]
textpage = page.get_textpage()
Most of the text is readable but for some reason the important data is not readable when extracted.
In the extracted string the relevant part is like this
Readable text \r\n\x13\x0c\x10 \x18\x0c\x18 \x0b\x10\x0e\x10\x15\x18\x0f\x10 \x15\x0c\x10 \x14\x0c\x10 \x14\x0c\x15 readable text
I tried also with tika and PyMuPDF. They only give me the questionmarkcharacter for those parts.
I know the mangled part (\r\n\x13\x0c\x10 \x18\x0c\x18 \x0b\x10\x0e\x10\x15\x18\x0f\x10 \x15\x0c\x10 \x14\x0c\x10 \x14\x0c\x15) should be 3,0 8,8 +0,058/0 5,0 4,0 4,5.
My current idea is to make my own encoding table but i wanted to ask if there is a better method and if this looks familiar to someone.
I have about 52 files whith around 200 occurences each.
While the pdfs are not confidential i dont want to post links because it is not my intelectual property.
Update------------------------------
I tried to find out more about the fonts.
from pdfreader import PDFDocument
fd = open("myfile", "rb")
doc = PDFDocument(fd)
page = next(doc.pages())
font_keys=sorted(page.Resources.Font.keys())
for font_key in font_keys:
font = page.Resources.Font[font_key]
print(f"{font_key}: {font.Subtype}, {font.BaseFont}, {font.Encoding}")
gives:
R13: Type0, UHIIUQ+MetaPlusBold-Roman-Identity-H, Identity-H
R17: Type0, EWGLNL+MetaPlusBold-Caps-Identity-H, Identity-H
R20: Type1, NRVKIY+Meta-LightLF, {'Type': 'Encoding', 'BaseEncoding': 'WinAnsiEncoding', 'Differences': [33, 'agrave', 'degree', 39, 'quoteright', 177, 'endash']}
R24: Type0, IKRCND+MetaPlusBold-Italic-Identity-H, Identity-H
-Edit------
I am not interested in help tranlating it manually. I can do that by myself. i am interested in a solution that works by script. For example a script that extracts fonts with codemaps from the pdf and then uses those to translate the unreadable parts
This is not uncommon CID CMAP substitution as output in python notation, and is usua;;y specific to a single font with 6 random ID e.g.UHIIUQ+Font name
often found for subsetting fonts that have a limited range of characters.
should be 3,0 8,8 +0,058/0 5,0 4,0 4,5
\r\n\ = cR Nl (windows line feed \x0d\x0a)
\x13 has been mapped to 3
\x0c has been mapped to ,
\x10 has been mapped to 0
(literal nbsp)
\x18 = 8
\x0c = ,
\x18 = 8
(literal nbsp)
\x0b has been mapped to +
\x10 = 0
\x0e has been mapped to , (very odd see \x0c)
\x10 = 0
\x15 = 5
\x18 = 8
\x0f has been mapped to /
\x10 = 0
(literal nbsp)
\x15 etc......................
\x0c
\x10
\x14
\x0c
\x10
\x14
\x0c
\x15
so \x0# are low order control codes & punctuation
and \x1# are digits
unknown if \x2# are used for letters, the CMAP table should be queried for the full details
\x0e has been mapped to , (very odd see \x0c)
I suspect as its different that should possibly be decimal separator dot ?
Here is example code to get the source of a font's CMAP with PyMuPDF:
import fitz
doc = fitz.open("some.pdf")
# assume that we know a font's xref already
# extract the xref of its CMAP:
cmap_xref = doc.xref_get_key(xref, "ToUnicode")[1] # second string is 'nnn 0 R'
if cmap_xref.endswith("0 R"): # check if a CMAP exists at all
cxref = int(cmap_xref.split()[0])
else:
raise ValueError("no CMAP found")
print(doc.xref_stream(cxref).decode()) # convert bytes to string
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapType 2 def
/CMapName/R63 def
1 begincodespacerange
<00><ff>
endcodespacerange
12 beginbfrange
<20><20><0020>
<2e><2e><002e>
<30><31><0030>
<43><46><0043>
<49><49><0049>
<4c><4d><004c>
<4f><50><004f>
<61><61><0061>
<63><69><0063>
<6b><70><006b>
<72><76><0072>
<78><79><0078>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end

How to set tab spacing in shape/paragraph with python-pptx?

I am setting up a final slide with table of content containing all slide titles and related slide numbers from presentation.
In this case I work with python-pptx 0.6.18 & Python 3.7. So far I've managed to split title and page number with tab sign, however I don't know where should I look for setting tab spacing for those sign.
from pptx import Presentation
from pptx.util import Inches, Cm, Pt
path_to_presentation = 'your/path/to/file.pptx'
prs = Presentation(path_to_presentation)
list_of_titles = []
list_of_slide_pages = []
...
# some code populating both above mentioned lists
...
# create new slide
tslide_layout = prs.slide_layouts[1]
toc_slide = prs.slides.add_slide(tslide_layout)
# add content to TOC slide
toc_slide.shapes.title.text = 'Table of contents'
for numer, title in enumerate(list_of_titles):
paragraph = toc_slide.shapes[1].text_frame.paragraphs[numer]
paragraph.text = title+'\t'+str(list_of_slide_pages[numer])
paragraph.level = 0
paragraph.runs[0].font.size = Pt(18)
toc_slide.shapes[1].text_frame.add_paragraph()
# save presentation
prs.save('your/path/to/file_with_TOC.pptx')
I am looking for parameters to set distance and alignment for tab stops in this shape/text_frame/paragraph or any other trick to elegantly bypass these parameters in a different way giving the desired final result. Any help or advice will be appreciated.

Python docx library text align

I am using python docx library to manipulate a word document. However I can't find how to align a line to the center in the documents page of that library. I can't find by google either.
from docx import Document
document = Document()
p = document.add_paragraph('A plain paragraph having some ')
p.add_run('bold').bold = True
p.add_run(' and some ')
p.add_run('italic.').italic = True
How can I align the text in docx?
With the new version of python-docx 0.7 https://github.com/python-openxml/python-docx/commit/158f2121bcd2c58b258dec1b83f8fef15316de19
Add feature #51: Paragraph.alignment (read/write)
Now it is possible to align a paragraph as here: http://python-docx.readthedocs.org/en/latest/dev/analysis/features/par-alignment.html
paragraph = document.add_paragraph("This is a text")
paragraph.alignment = 0 # for left, 1 for center, 2 right, 3 justify ....
edit from comments
actually it is 0 for left, 1 for center, 2 for right
edit 2 from comments
You shouldn't hard code magic numbers like this. Use WD_ALIGN_PARAGRAPH.CENTER to get the correct value for centering, etc. To do this use the following import
from docx.enum.text import WD_ALIGN_PARAGRAPH
p = document.add_paragraph('A plain paragraph having some ',style='BodyText', breakbefore=False, jc='left')# #param string jc: Paragraph alignment, possible values:left, center, right, both (justified), ...
for reference see this reference at def paragraph read the documentation

How to add space between lines within a single paragraph with Reportlab

I have a block of text that is dynamically pulled from a database and is placed in a PDF before being served to a user. The text is being placed onto a lined background, much like notepad paper. I want to space the text so that only one line of text is between each background line.
I was able to use the following code to create a vertical spacing between paragraphs (used to generate another part of the PDF).
style = getSampleStyleSheet()['Normal']
style.fontName = 'Helvetica'
style.spaceAfter = 15
style.alignment = TA_JUSTIFY
story = [Paragraph(choice.value,style) for choice in chain(context['question1'].itervalues(),context['question2'].itervalues())]
generated_file = StringIO()
frame1 = Frame(50,100,245,240, showBoundary=0)
frame2 = Frame(320,100,245,240, showBoundary=0)
page_template = PageTemplate(frames=[frame1,frame2])
doc = BaseDocTemplate(generated_file,pageTemplates=[page_template])
doc.build(story)
However, this won't work here because I have only a single, large paragraph.
Pretty sure what yo u want to change is the leading. From the user manual in chapter 6.
To get double-spaced text, use a high
leading. If you set
autoLeading(default "off") to
"min"(use observed leading even if
smaller than specified) or "max"(use
the larger of observed and specified)
then an attempt is made to determine
the leading on a line by line basis.
This may be useful if the lines
contain different font sizes etc.
Leading is defined earlier in chapter 2:
Interline spacing (Leading)
The vertical offset between the point
at which one line starts and where the
next starts is called the leading
offset.
So try different values of leading, for example:
style = getSampleStyleSheet()['Normal']
style.leading = 24
Add leading to ParagraphStyle
orden = ParagraphStyle('orden')
orden.leading = 14
orden.borderPadding = 10
orden.backColor=colors.gray
orden.fontSize = 14
Generate PDF
buffer = BytesIO()
p = canvas.Canvas(buffer, pagesize=letter)
text = Paragraph("TEXT Nro 0001", orden)
text.wrapOn(p,500,10)
text.drawOn(p, 45, 200)
p.showPage()
p.save()
pdf = buffer.getvalue()
buffer.close()
The result

Categories