Reading and writing PDF files with ligatures? - python

I am attempting to read text from a PDF file, and then later on, write that same text back to another PDF using Python. After the text is read in, the representation of the string when I print it to the console is:
Officially, it’s called
However, when I print the repr() of this text string, I see:
O\xef\xac\x83cially, it\xe2\x80\x99s called
This makes plenty of sense to me - these are ligatures of symbols from the PDFs i.e. \xef\xac\x83 represents a ligature for 'ff'. The problem is that when I write this string to PDF, using reportlab libraries, the PDFs have black symbols in place, as seen below:
This only happens with certain ligatures. I am wondering what I can do so that the string I write to the PDF does not contain these ligatures or if there is an efficient way to replace all of them.

It appears your input is correct, but to see the ffi character in your output, use a font that does have one.
The font you are using here is bog standard Arial, which does not contain it.
Some suggestions (mainly depending on your platform, but some of these are Open Source):
Arial Unicode MS
Lucida Grande
Calibri
Cambria
Corbel
Droid Sans/Droid Serif
Helvetica Neue
Ubuntu
If you don't want, or are not able, to change the font, replace the sequence \xef\xac\x83 with the plain characters ffi in your program before writing text to PDF. (And similar for those other certain ligatures you mentioned.)

What I ended up doing was copying the characters out of my text file and doing a .replace on them. ie str.replace('ff','ff') - if this looks the same, it's the same. The param on the left is the ligature character and the param on the right is two f's. Also, don't forget # -- coding: utf-8 -- .

Related

PyQt5 incorrect label formatting with links

I have two issues with how PyQt is formatting my QLabels
Issue 1:
When hyperlinks are added it displays as if there were no newlines in the string.
For the input text:
https://www.google.co.uk/
https://www.google.co.uk/
https://www.google.co.uk/
It's shown like this without newlines
Issue 2: Sometimes PyQt just doesn't even detect the 'a' tag this happens when the start of string is not a hyperlink but it is then followed by newlines with hyperlinks e.g. this input:
test
https://www.google.co.uk/
https://www.google.co.uk/
https://www.google.co.uk/
As you can see the newlines are properly shown but PyQt has no longer detected the hyperlinks
From the text property documentation of QLabel:
The text will be interpreted either as plain text or as rich text, depending on the text format setting; see setTextFormat(). The default setting is Qt::AutoText; i.e. QLabel will try to auto-detect the format of the text set.
The AutoText flag can only make a guess using simple tag syntax checks (basic tags without arguments, such as <b>, or document type declaration headers, like <html>).
This is obviously done for performance reasons.
If you are sure that you're always setting rich text content, use the appropriate Qt.TextFormat enum:
label.setTextFormat(QtCore.Qt.RichText)
Using the HTML-like syntax of rich text will obviously use the same basic concept HTML had since its birth, almost 30 years ago: line breaks between any word in the document (text or tag) are ignored, as much as multiple spaces are always considered as one.
So, if you want to add line breaks, you have to use the appropriate <br> (or <br/> for xhtml) tag.
Also remember that Qt rich text engine has a limited support, as described in the documentation about the Supported HTML Subset.

Regex behaves differently for the same input string

I am trying to get a pdf page with a particular string and the string is:
"statement of profit or loss"
and I'm trying to accomplish this using following regex:
re.search('statement of profit or loss', text, re.IGNORECASE)
But even though the page contained this string "statement of profit or loss" the regex returned None.
On further investigating the document, I found that the characters 'fi' in the "profit" as written in the document are more congested. When I copied it from the document and pasted it in my code it worked fine.
So, If I copy "statement of profit or loss" from document and paste it in re.search() in my code, it works fine. But if I write "statement of profit or loss" manually in my code, re.search() returns none.
How can I avoid this behavior?
The 'congested' characters copied from your PDF are actually a single character: the 'fi ligature' U+FB01: fi.
Either it was entered as such in the source document, or the typesetting engine that was used to create the PDF, replaced the combination f+i by fi.
Combining two or more characters into a single glyph is a fairly usual operation for "nice typesetting", and is not limited to fi, fl, ff, and fj, although these are the most used combinations. (That is because in some fonts the long overhang of the f glyph jarringly touches or overlaps the next character.) Actually, you can have any amount of ligatures; some Adobe fonts use a single ligature for Th.
Usually this is not a problem with text extracting, because in the PDF it can be specified that certain glyphs must be decoded as a string of characters – the original characters. So, possibly your PDF does not contain such a definition, or the typesetting engine did not bother because the single character fi is a valid Unicode character on itself (although it is highly advised not to use it).
You can work around this by explicitly cleaning up your text strings before processing any further:
text = text.replace('fi', 'fi')
– repeat this for other problematic ligatures which have a Unicode codepoint: fl, ff, ffi, ffl (I possibly missed some more).

How to create a word docx using python docx in other than english?

I am building a program creating printed outputs from python code. Further, the final print containing the other language (Sinhala). I want to use python docx to save this output into a word document. How to write into word in another language?
My aim is to produce a report making program from another language (Sinhala). I take all user inputs from widgets and managed to print the resulted lines in another language in python.
Now, I want to write these lines into word file using the Sinhala language.
a= "කණ්ඩියේ උස මීටර් 5.0 ක් පළල මීටර් 2.0 හා දිග මීටර් 2.0 ක් පමණ වන කොටසක්
අස්ථාවර වී"
document = Document()
document.add_heading("python word doc")
document.add_paragraph(a)
document.save('****\\report.docx')
when I use English, the code does the job. But, for the Sinhala language, I'm not sure how to do that?
I get the following error message for sinala language.
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
The error code you're seeing is not directly related to the language. The only thing Word knows about language is which spelling dictionary to use. Otherwise its text is just an arbitrary sequence of unicode characters.
What I suspect is that the Unicode encoding of the Sinhala strings you're trying to write is not UTF-8. The other possibility is that the string contains some control characters (as mentioned in the error message), particularly the vertical-tab (VT, 0xB or decimal 11) which can arise in copy and paste from PowerPoint.
This latter one is easier to check for, so perhaps start there.
import re
def sanitize_str(s):
control_chars = "\x00-\x1f\x7f-\x9f"
control_char_re = re.compile("[%s]" % control_chars)
return control_char_re.sub("", s)
document.add_paragraph(sanitize_str(a))

Working with Urdu and Arabic names in Python

I'm trying to work with Urdu text but am unable to get the right output.
name = '\xd9\x87\xd9\x84\xd9\x84\xd8\xa7 \xd8\xa7\xd9\x85\xd8\xa7\xd9\x86'
print name
OUTPUT
هللا امان
DESIRED OUTPUT
امان اللہ
please advise.
I see two main issues with your snippet.
The first is that in Arabic, there are special code points for entire words, and the word you are trying to print اللہ is called ARABIC LIGATURE ALLAH ISOLATED FORM, which is 0xFDF2 or 0xEF 0xB7 0xB2.
If you write it isolated (each individual character), you will not get the correct representation.
Second, your font in your terminal (or whatever application is being used to render the text) should support the glyph, and you should ensure that the text direction is switched to right-to-left.
Here is an example from the online Python shell:
>>> print(u"\uFDF2")
ﷲ
Since this shell is not configured for right to left you can see that it is printing it left to right.

Can I change text in MS Word using python-docx, without losing characteristics?

I now have a English word document in MS Word and I want to change its texts into Chinese using python. I've been using Python 3.4 and installed python-docx. Here's my code:
from docx import Document
document = Document(*some MS Word file*)
# I only change the texts of the first two paragraphs
document.paragraphs[0].text = '带有消毒模式的地板清洁机'
document.paragraphs[1].text = '背景'
document.save(*save_file_path*)
The first two lines did turn into Chinese characters, but characteristics like font and bold are all gone:
Is there anyway I could alter text without losing the original characteristics?
It depends on how the characteristics are applied. There is a thing called the style hierarchy, and text characteristics can be applied anywhere from directly to a run of text, a style, or a document default, and levels in-between.
There are two main classes of characteristic: paragraph properties and run properties. Paragraph properties are things like justification, space before and after, etc. Everything having to do with character-level formatting, like size, typeface, color, subscript, italic, bold, etc. is a run property, also loosely known as a font.
So if you want to preserve the font of a run of text, you need to operate at the run level. An operation like this will preserve font formatting:
run.text = "New text"
An operation like this will preserve paragraph formatting, but remove any character level formatting not applied by the paragraph style:
paragraph.text = "New paragraph text"
You'll need to decide for your application whether you modify individual runs (which may be tricky to identify) or whether you work perhaps with distinct paragraphs and apply different styles to each. I recommend the latter. So in your example, "FLOOR CLEANING MACHINE ...", "BACKGROUND", and "[0001]..." would each become distinct paragraphs. In your screenshot they appear as separate runs in a single paragraph, separated by a line break.
You can get the style of the existing paragraphs and apply it to your new paragraphs - beware that the existing paragraphs might specify a font that does not support Chinese.

Categories