How to force arabic characters to be seperate?

How to force arabic characters to be seperate? - python

I'm trying to type a set of arabic characters without space on an image using pillow. The problem I'm currently having is that some arabic characters when get next to each other, appear differently when they are seperate.((e.g. س and ‍ل will be ‍سل when put next to each other.) I'm trying to somehow force my font settings to always seperate all characters without injection of any other characters, what should I do?
Here is a snippet of my code:
#font is an arabic font, and font_path is pointing to that location.
font = ImageFont.truetype(
font=font_path, size=size,
layout_engine=ImageFont.LAYOUT_RAQM)
h, w = font.getsize(text, direction='rtl')
offset = font.getoffset(text)
H, W = int(1.5 * h), int(1.5 * w)
imgSize = H, W
img = Image.new(mode='1', size=imgSize, color=0)
draw = ImageDraw.Draw(img)
pos = ((H-h)/2, (W-w)/2)
draw.text(pos, text, fill=255, font=font,
direction='rtl', align='center')

What you're describing might be possible with some fonts that support Arabic, specifically, those that encode the position-sensitive forms in the Arabic Presentation Forms-B Block of Unicode. You would need to map your input text character codes into the correct positional variant. So for the example characters seen and lam as you described, U+0633 س‎ and U+0644 ل‎, you want the initial form of U+0633, which is U+FEB3 ﺳ‎‎, and the final form of U+0644, which is U+FEDE ﻞ, putting those together (separated by a regular space): ﺳ‌ ﻞ‌.
There is a useful chart showing the positional forms at https://en.wikipedia.org/wiki/Arabic_script_in_Unicode#Contextual_forms.
But, important to understand:
not all fonts that contain Arabic have the Presentation Forms encoded (many fonts do not)
not all Arabic codes have an equivalent in the Presentation Forms range (most of the basic ones do, but there are some extended Arabic characters for other languages that do not have Presentation Forms).
you are responsible for processing your input text (in the U+06xx range) into the correct presentation form (U+FExx range) codes based on the word/group context, which can be tricky. That job normally falls to an OpenType Layout engine, but it also performs the joining. So you're basically overriding that logic.

Related

Pillow - draw a literal tag glyph?

The Unifont contains glyphs for Tags, Variation Selectors, and other non-printable characters.
For example at the end of https://unifoundry.com/pub/unifont/unifont-14.0.04/font-builds/unifont_upper-14.0.04.ttf are these tags (as shown in FontForge):
Each one has a glyph which should be printable:
I want to draw that glyph, using the Unifont, on an image with Pillow.
from PIL import Image, ImageDraw, ImageFont
text = chr(0x2A6B2) + " " + chr(0x0E0026)
font = ImageFont.truetype("unifont_upper-14.0.04.ttf", size=64)
image1 = Image.new("RGB", (256, 64), "white")
draw1 = ImageDraw.Draw(image1)
draw1.text( (0 , 0), text, font=font, fill="black")
image1.save("test aa.png")
The first character (a CJK ideograph) draws correctly. But the tag character is invisible.
Is there any way to get Pillow to draw the shape that I can see in FontForge?

It seems the short answer is, unfortunately, "no you can't".
Pillow generally uses libraqm to lay out text (i.e. do stuff like map the Unicode string to the glyphs in the font, specifically the raqm_layout function.
That library in turn has uses a library called harfbuzz to do the text shaping.
The tag characters you want, including U+E0026, have the Unicode default ignorable property. By default harfbuzz doesn't display characters with this property, replacing them with a blank glyph. But it is possible, with the use of flags, to modify this behaviour: specifically, calling hb_buffer_set_flags with HB_BUFFER_FLAG_PRESERVE_DEFAULT_IGNORABLES seems like it will achieve what you want, displaying these characters rather than blanking them out.
The trouble is, libraqm has no way of setting this flag when it calls harfbuzz - it does let you set some of the other flags, but not this one :(
To achieve what you want I guess you'd have to use a lower level library - there are apparently Python bindings for both FreeType and harfbuzz, though I've not used either so I can't comment on how much pain that might involve.

From Section 23.9, Tag Characters in The Unicode Standard, Chapter 23, Special Areas and Format Characters:
Tag Characters: U+E0000–U+E007F
This block encodes a set of 95 special-use tag characters to enable
the spelling out of ASCII-based string tags using characters that can
be strictly separated from ordinary text content characters in
Unicode…
Display. Characters in the tag character block have no visible rendering in normal text and the language tags themselves are not
displayed.
And from the Unicode Frequently Asked Questions (with my own emphasizing):
Q: Which characters should be displayed as invisible, if not supported?
All default-ignorable characters should be rendered as completely invisible (and non advancing, i.e. "zero width"), if not explicitly
supported in rendering.
Q: Does that mean that a font can never display one of these characters?
No. Rendering systems may also support special modes such as “Display
Hidden”, which are intended to reveal characters that would not
otherwise display. Fonts can contain glyphs intended for visible
display of default ignorable code points that would otherwise be
rendered invisibly when not supported.
More resources (required reading, incomplete):
Default_Ignorable_Code_Point character property
Section 5.21, Ignoring Characters in Processing in Implementation Guidelines
🏴 Emoji Tag Sequence

How can I limit text box width of QLineEdit to display at most four characters?

I am working with a GUI based on PySide. I made a (one line) text box with QLineEdit and the input is just four characters long, a restriction I already successfully applied.
The problem is I have a wider than needed text box (i.e. there is a lot of unused space after the text). How can I shorten the length of the text box?
I know this is something that is easily fixed by designing the text box with Designer; however, this particular text box is not created in Designer.

If what you want is modify your QLineEdit width and fix it, use:
#setFixedWidth(int w)
MyLineEdit.setFixedWidth(120)

Looking at the source of QLineEdit.sizeHint() one sees that a line edit is typically wide enough to display 17 latin "x" characters. I tried to replicate this in Python and change it to display 4 characters but I failed in getting the style dependent margins of the line edit correctly due to limitations of the Python binding of Qt.
A simple:
e = QtGui.QLineEdit()
fm = e.fontMetrics()
m = e.textMargins()
c = e.contentsMargins()
w = 4*fm.width('x')+m.left()+m.right()+c.left()+c.right()
is returning 24 in my case which however is not enough to display four characters like "abcd" in a QLineEdit. A better value would be about 32 which you can set for example like.
e.setMaximumWidth(w+8) # mysterious additional factor required
which might still be okay even if the font is changed on many systems.

Python Reddit bot not correctly encoding special characters

I have a Reddit bot that tries to convert ASCII text to images. I'm running into issues encoding special characters, as per this issue.
I have a repo dedicated to this project, but for the sake of brevity, I'll post the relevant code. I tried switching to Python 3 (since I heard it handles Unicode more elegantly than Python 2), but that didn't solve the issue.
This function pulls comments from Reddit. As you can see, I'm encoding everything in utf-8 as soon as I pull it, which is why I'm confused.
def comments_by_keyword(r, keyword, subreddit='all', print_comments=False):
"""Fetches comments from a subreddit containing a given keyword or phrase
Args:
r: The praw.Reddit class, which is required to access the Reddit API
keyword: Keep only the comments that contain the keyword or phrase
subreddit: A string denoting the subreddit(s) to look through, default is 'all' for r/all
limit: The maximum number of posts to fetch, increase for more thoroughness at the cost of increased redundancy/running time
print_comments: (Debug option) If True, comments_by_keyword will print every comment it fetches, instead of just returning filtered ones
Returns:
An array of comment objects whose body text contains the given keyword or phrase
"""
output = []
comments = r.get_comments(subreddit, limit=1000)
for comment in comments:
# ignore the case of the keyword and comments being fetched
# Example: for keyword='RIP mobile users', comments_by_keyword would keep 'rip Mobile Users', 'rip MOBILE USERS', etc.
if keyword.lower() in comment.body.lower():
print(comment.body.encode('utf-8'))
print("=====\n")
output.append(comment)
elif print_comments:
print(comment.body.encode('utf-8'))
print("=====\n")
return output
And then this converts it to an image:
def str_to_img(str, debug=False):
"""Converts a given string to a PNG image, and saves it to the return variable"""
# use 12pt Courier New for ASCII art
font = ImageFont.truetype("cour.ttf", 12)
# do some string preprocessing
str = str.replace("\n\n", "\n") # Reddit requires double newline for new line, don't let the bot do this
str = html.unescape(str)
img = Image.new('RGB', (1,1))
d = ImageDraw.Draw(img)
str_by_line = str.split("\n")
num_of_lines = len(str_by_line)
line_widths = []
for i, line in enumerate(str_by_line):
line_widths.append(d.textsize(str_by_line[i], font=font)[0])
line_height = d.textsize(str, font=font)[1] # the height of a line of text should be unchanging
img_width = max(line_widths) # the image width is the largest of the individual line widths
img_height = num_of_lines * line_height # the image height is the # of lines * line height
# creating the output image
# add 5 pixels to account for lowercase letters that might otherwise get truncated
img = Image.new('RGB', (img_width, img_height + 5), 'white')
d = ImageDraw.Draw(img)
for i, line in enumerate(str_by_line):
d.text((0,i*line_height), line, font=font, fill='black')
output = BytesIO()
if (debug):
img.save('test.png', 'PNG')
else:
img.save(output, 'PNG')
return output
Like I said, I'm encoding everything in utf-8, but the special characters don't show up properly. I'm also using Courier New from the official .ttf file, which is suppose to support a wide base of characters and symbols, so I'm not sure what the issue is there either.
I feel like it's something obvious. Can anyone enlighten me? It's not ImageDraw, is it? To top it all off, it seems like text encoding as a whole is sort of ambiguous, so even after reading other StackOverflow posts (and blog posts about encoding), I'm hardly closer to a real solution.

I can't run any tests myself currently and I can't leave a comment due to low rep so I'm dropping a partial answer that hopefully gives some ideas what to try. I'm also a bit rusty with Python 2 but lets try..
So two things. First:
I'm encoding everything in utf-8 as soon as I pull it
Are you?
print(comment.body.encode('utf-8'))
print("=====\n")
output.append(comment)
You're encoding the print output, but appending the original comment to the output list as it was outputted by praw. Does praw output unicode objects?
Because I would imagine unicode objects are what the ImageDraw module wants. Looking at its source code it doesn't seem to have any clue about the encoding of the text you're trying to render. Meaning Python 2 strings would be rendered probably as single byte characters leading to garbage in the output in case of utf8 encoding.
http://pillow.readthedocs.org/en/latest/reference/ImageFont.html#PIL.ImageFont.truetype mentions "encoding" parameter, which should default to unicode. Could be worth to try setting it just in case. Maybe it raises an error if the font is not unicode compatible.
Encodings in Python 2 aren't fun. But one thing I would still try to make sure that unicode object is passed to the ImageDraw (try unicode(str) or str.decode("utf8"))

Python’s `str.format()`, fill characters, and ANSI colors

In Python 2, I’m using str.format() to align a bunch of columns of text I’m printing to a terminal. Basically, it’s a table, but I’m not printing any borders or anything—it’s simply rows of text, aligned into columns.
With no color-fiddling, everything prints as expected.
If I wrap an entire row (i.e., one print statement) with ANSI color codes, everything prints as expected.
However: If I try to make each column a different color within a row, the alignment is thrown off. Technically, the alignment is preserved; it’s the fill characters (spaces) that aren’t printing as desired; in fact, the fill characters seem to be completely removed.
I’ve verified the same issue with both colorama and xtermcolor. The results were the same. Therefore, I’m certain the issue has to do with str.format() not playing well with ANSI escape sequences in the middle of a string.
But I don’t know what to do about it! :( I would really like to know if there’s any kind of workaround for this problem.
Color and alignment are powerful tools for improving readability, and readability is an important part of software usability. It would mean a lot to me if this could be accomplished without manually aligning each column of text.
Little help? ☺

This is a very late answer, left as bread crumbs for anyone who finds this page while struggling to format text with built-in ANSI color codes.
byoungb's comment about making padding decisions on the length of pre-colorized text is exactly right. But if you already have colored text, here's a work-around:
See my ansiwrap module on PyPI. Its primary purpose is providing textwrap for ANSI-colored text, but it also exports ansilen() which tells you "how long would this string be if it didn't contain ANSI control codes?" It's quite useful in making formatting, column-width, and wrapping decisions on pre-colored text. Add width - ansilen(s) spaces to the end or beginning of s to left (or respectively, right) justify s in a column of your desired width. E.g.:
def ansi_ljust(s, width):
needed = width - ansilen(s)
if needed > 0:
return s + ' ' * needed
else:
return s
Also, if you need to split, truncate, or combine colored text at some point, you will find that ANSI's stateful nature makes that a chore. You may find ansi_terminate_lines() helpful; it "patch up" a list of sub-strings so that each has independent, self-standing ANSI codes with equivalent effect as the original string.
The latest versions of ansicolors also contain an equivalent implementation of ansilen().

Python doesn't distinguish between 'normal' characters and ANSI colour codes, which are also characters that the terminal interprets.
In other words, printing '\x1b[92m' to a terminal may change the terminal text colour, Python doesn't see that as anything but a set of 5 characters. If you use print repr(line) instead, python will print the string literal form instead, including using escape codes for non-ASCII printable characters (so the ESC ASCII code, 27, is displayed as \x1b) to see how many have been added.
You'll need to adjust your column alignments manually to allow for those extra characters.
Without your actual code, that's hard for us to help you with though.

Also late to the party. Had this same issue dealing with color and alignment. Here is a function I wrote which adds padding to a string that has characters that are 'invisible' by default, such as escape sequences.
def ljustcolor(text: str, padding: int, char=" ") -> str:
import re
pattern = r'(?:\x1B[#-_]|[\x80-\x9F])[0-?]*[ -/]*[#-~]'
matches = re.findall(pattern, text)
offset = sum(len(match) for match in matches)
return text.ljust(padding + offset,char[0])
The pattern matches all ansi escape sequences, including color codes. We then get the total length of all matches which will serve as our offset when we add it to the padding value in ljust.

Version of Canvas create_text() that supports word wrap?

Is there a create_text() mode or technique that supports word wrap? I'm stuck using create_text() vs. a Label or Text widget because I'm placing text on top of an image on my Canvas.
Also, is there a Tkinter API that truncates text that doesn't fit a certain width with an ellipsis like suffix, eg. Where very, very, very long text gets converted to something like Where very, very, ....

There is indeed a word wrap feature in create_text(). You'd call it like so:
canvas.create_text(x, y, width=80)
You can set the width parameter to whatever max length you want, or 0 if you want no word wrapping. See this article for all the options, arguments etc. for create_text().
I'm not sure about truncating text, but I did see this talking about a way to limit the length of input in an Entry widget...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to force arabic characters to be seperate? - python

Related

Pillow - draw a literal tag glyph?

How can I limit text box width of QLineEdit to display at most four characters?

Python Reddit bot not correctly encoding special characters

Python’s `str.format()`, fill characters, and ANSI colors

Version of Canvas create_text() that supports word wrap?

Categories

Resources