PyQt5 incorrect label formatting with links - python

I have two issues with how PyQt is formatting my QLabels
Issue 1:
When hyperlinks are added it displays as if there were no newlines in the string.
For the input text:
https://www.google.co.uk/
https://www.google.co.uk/
https://www.google.co.uk/
It's shown like this without newlines
Issue 2: Sometimes PyQt just doesn't even detect the 'a' tag this happens when the start of string is not a hyperlink but it is then followed by newlines with hyperlinks e.g. this input:
test
https://www.google.co.uk/
https://www.google.co.uk/
https://www.google.co.uk/
As you can see the newlines are properly shown but PyQt has no longer detected the hyperlinks

From the text property documentation of QLabel:
The text will be interpreted either as plain text or as rich text, depending on the text format setting; see setTextFormat(). The default setting is Qt::AutoText; i.e. QLabel will try to auto-detect the format of the text set.
The AutoText flag can only make a guess using simple tag syntax checks (basic tags without arguments, such as <b>, or document type declaration headers, like <html>).
This is obviously done for performance reasons.
If you are sure that you're always setting rich text content, use the appropriate Qt.TextFormat enum:
label.setTextFormat(QtCore.Qt.RichText)
Using the HTML-like syntax of rich text will obviously use the same basic concept HTML had since its birth, almost 30 years ago: line breaks between any word in the document (text or tag) are ignored, as much as multiple spaces are always considered as one.
So, if you want to add line breaks, you have to use the appropriate <br> (or <br/> for xhtml) tag.
Also remember that Qt rich text engine has a limited support, as described in the documentation about the Supported HTML Subset.

Related

Regex behaves differently for the same input string

I am trying to get a pdf page with a particular string and the string is:
"statement of profit or loss"
and I'm trying to accomplish this using following regex:
re.search('statement of profit or loss', text, re.IGNORECASE)
But even though the page contained this string "statement of profit or loss" the regex returned None.
On further investigating the document, I found that the characters 'fi' in the "profit" as written in the document are more congested. When I copied it from the document and pasted it in my code it worked fine.
So, If I copy "statement of profit or loss" from document and paste it in re.search() in my code, it works fine. But if I write "statement of profit or loss" manually in my code, re.search() returns none.
How can I avoid this behavior?
The 'congested' characters copied from your PDF are actually a single character: the 'fi ligature' U+FB01: fi.
Either it was entered as such in the source document, or the typesetting engine that was used to create the PDF, replaced the combination f+i by fi.
Combining two or more characters into a single glyph is a fairly usual operation for "nice typesetting", and is not limited to fi, fl, ff, and fj, although these are the most used combinations. (That is because in some fonts the long overhang of the f glyph jarringly touches or overlaps the next character.) Actually, you can have any amount of ligatures; some Adobe fonts use a single ligature for Th.
Usually this is not a problem with text extracting, because in the PDF it can be specified that certain glyphs must be decoded as a string of characters – the original characters. So, possibly your PDF does not contain such a definition, or the typesetting engine did not bother because the single character fi is a valid Unicode character on itself (although it is highly advised not to use it).
You can work around this by explicitly cleaning up your text strings before processing any further:
text = text.replace('fi', 'fi')
– repeat this for other problematic ligatures which have a Unicode codepoint: fl, ff, ffi, ffl (I possibly missed some more).

Can I change text in MS Word using python-docx, without losing characteristics?

I now have a English word document in MS Word and I want to change its texts into Chinese using python. I've been using Python 3.4 and installed python-docx. Here's my code:
from docx import Document
document = Document(*some MS Word file*)
# I only change the texts of the first two paragraphs
document.paragraphs[0].text = '带有消毒模式的地板清洁机'
document.paragraphs[1].text = '背景'
document.save(*save_file_path*)
The first two lines did turn into Chinese characters, but characteristics like font and bold are all gone:
Is there anyway I could alter text without losing the original characteristics?
It depends on how the characteristics are applied. There is a thing called the style hierarchy, and text characteristics can be applied anywhere from directly to a run of text, a style, or a document default, and levels in-between.
There are two main classes of characteristic: paragraph properties and run properties. Paragraph properties are things like justification, space before and after, etc. Everything having to do with character-level formatting, like size, typeface, color, subscript, italic, bold, etc. is a run property, also loosely known as a font.
So if you want to preserve the font of a run of text, you need to operate at the run level. An operation like this will preserve font formatting:
run.text = "New text"
An operation like this will preserve paragraph formatting, but remove any character level formatting not applied by the paragraph style:
paragraph.text = "New paragraph text"
You'll need to decide for your application whether you modify individual runs (which may be tricky to identify) or whether you work perhaps with distinct paragraphs and apply different styles to each. I recommend the latter. So in your example, "FLOOR CLEANING MACHINE ...", "BACKGROUND", and "[0001]..." would each become distinct paragraphs. In your screenshot they appear as separate runs in a single paragraph, separated by a line break.
You can get the style of the existing paragraphs and apply it to your new paragraphs - beware that the existing paragraphs might specify a font that does not support Chinese.

GTK3, Pango - rendering mixed case text as uppercase in TextView?

What would be the most efficient way to render text, which in the TextBuffer could be any case, as uppercase in a TextView?
It isn't for the entirety of the text, only specific styles within it - and the original capitalization of that section needs to be preserved in case the user changes the text style back to a non-capitalized style.
So if the relevant section of text could be tagged with a TextTag that would be ideal, but there isn't a tag to fully capitalize (there is a small_caps font variant, which for some reason doesn't seem to work in a textview) - can one create a custom TextTag property like "all_caps" and, if so, how would it be implemented?
Other thoughts would be overriding the textview draw function (sounds painful) or possibly creating a secondary TextBuffer and changing the text case on the fly?
UPDATE:
For this application, the best would likely be to intercept the string being passed to Pango from the TextBuffer (from TextView's do_draw, I think) and change it on the fly: for other text styles in this application, some additional text character additions would be needed (It's a screenwriting application, so there is a 'Parenthical' style which, unsurprisingly, is always contained in parentheses - these should be added as part of the style, not relying on the user to add them)
So the updated question would be: How would one subclass / monkey code / something Pango / PangoCairo / Gtk+ 3 to intercept the string being passed to Pango (along with its TextTags) so as to alter / add to it according to its TextTag styles?

How to check for key value metadata in markdown

I need to check if my input, formatted using markdown, has key-value pair metadata at the beginning, and then insert text after the whole metadata block.
I look for a : in the first line and if found, split the input string at the first newline and add my stuff.
Now, if markdown_content.splitlines()[0].find(':') >= 0: obviously fails when there's no metadata at the beginning, but something else containing a :instead.
Examples
Input with metadata:
page title: fancypagetitle
something else: another value
# Heading
Text
Input without metadata, but with a :
This is a [link](http://www.stackoverflow.com)
# Heading
Text
My question is: How do I check if a metadata block is present and in case it is, add something in between metadata and the remaining markdown.
Definition of metadata
The keywords are case-insensitive and may consist of letters, numbers, underscores and dashes and must end with a colon. The values consist of anything following the colon on the line and may even be blank.
If a line is indented by 4 or more spaces, that line is assumed to be an additional line of the value for the previous keyword. A keyword may have as many lines as desired.
The first blank line ends all meta-data for the document. Therefore, the first line of a document must not be blank. All meta-data is stripped from the document prior to any further processing by Markdown.
Source: https://pythonhosted.org/Markdown/extensions/meta_data.html
Have you considered looking at the source code for the meta data extension to see how it's done?
The regex used is:
META_RE = re.compile(r'^[ ]{0,3}(?P<key>[A-Za-z0-9_-]+):\s*(?P<value>.*)')
Of course there is also the regex for secondary lines:
META_MORE_RE = re.compile(r'^[ ]{4,}(?P<value>.*)')
If you note, those regular expressions are much more specific than yours and are much less likely to match a false positive. Then the extension splits the document into lines, loops through each line comparing with those regexs and breaks out of the loop on the first line that does not match (which may or may not be blank line).
If you notice in that code, there is a new feature that has been added which will be available in the next release. Support is being added for optional YAML style deliminators. If you are comfortable using the latest (unreleased) development code, you could wrap your meta data in YAML deliminators which might make it a little easier to find the end of the meta data.
For example, your example document above would then look like this (note I used the optional end specific deliminator (...) which more clearly marks the end):
---
page title: fancypagetitle
something else: another value
...
# Heading
Text
That said, you would still need to be careful that you didn't get a false match (a <hr> for example). I suppose either way you would really need to re-implement everything that is in the meta data extension for your own needs. Of course, it is open source, so you can as long as you honor the license.
Sorry, but I can't give you a timeline on when the next release will happen for sure.
Oh, and it may also help to look at the description of this feature provided by MultiMarkdown which inspired the feature in Python-Markdown. That might give you a clearer picture of what might comprise meta-data.

Translating content in filesystem for a Plone product

I'm trying to get certain strings in a .py file translated, using the i18n machinery. Translating .pt files is not a problem, but whenever I try to translate using _('Something') in Python code on the filesystem, it always gives English text (which is the default) instead of the Norwegian text that should be there. So I can see output from python code in English, while other Page Templates bits are correctly translated.
Is there a how-to or something similar for this?
Is the domain name used for _('Something') the same as what you use in the Norwegian .po file that has the translation? They should be the same, so do not use 'plone' in one case and 'my.domain' in the other.
Also, the call to the underscore function does not in itself translate the string; it only creates a string that can be translated. If this string ends up on its own directly in a template, you should add i18n:translate="" to that tag, probably with a matching i18n:domain.
Otherwise you should manually call the translate method, as in http://readthedocs.org/docs/collective-docs/en/latest/i18n/localization.html#manually-translated-message-ids. Read the Plone 4 migration guide for some differences between Plone 3 and 4 that might bite you here.
if you are seeking for how-tos you should probably read these docs:
http://plone.org/documentation/kb/i18n-for-developers
http://readthedocs.org/docs/collective-docs/en/latest/i18n/localization.html
Bye,
Giacomo
be aware that _() does not translate the text at call, but returns a Message object which will be translated when rendered in a template.
That means:
do not concat Message objects. "text %s" % _('translation') will not work, as well as "text" + _('translation')
if you do not send the text to the browser through a template, it may not be translated. for example if you generate a email you need to translate the Message object manually

Categories