Unicode characters are boxes

Unicode characters are boxes - python

Why is it that some characters show up normally, and some characters (for example, &#3676 - &#3712) show up as boxes? The website I'm using is http://www.tamasoft.co.jp/en/general-info/unicode-decimal.html, and even when I try to return the characters in python, they show up as boxes.
Note: The character codes end with semicolons

Some code points are not yet assigned to a character yet. Code point 3676, or U+0E5C as it's commonly written, is one of those.
As a consequence you don't have to worry about these, as they will not show up in any text.

Related

How can I ask `comment-dwim` to leave at least two spaces in python-mode?

comment-dwim (commend-do-what-I-mean, usually M-;) works well. It:
appends spaces at the end of the line until at least some column is reached,
if the preceding line also had a comment tailing the code, aligns the comment character, and
inserts the comment character for the mode, followed by one space.
In Python one feature is missing to comply with the stringent rules:
at least two spaces must separate code from the trailing comment.
How can I configure comment-dwim to leave at least two spaces in python-mode?

Fiddling with comment-dwim may be holding the wrong end of the stick. For me, M-; automatically inserts two spaces, the '#' character, and a space, following which point is placed. If it's not doing that for you, then you may have reconfigured it somewhere else, possibly accidentally.
I suggest an alternative is to have a 'format-on-save' option enabled, which will do this for you without fussing in your .emacs or with customize.
(use-package python-black :after python
:hook (python-mode . python-black-on-save-mode)
:bind ("C-c p C-f" . 'python-black-buffer))
I use black, but any alternative formatter should do the same. I never worry about how my file is formatted, because black fixes it for me automatically.
From the docs for black:
Black does not format comment contents, but it enforces two spaces
between code and a comment on the same line, and a space before the
comment text begins.

Detecting Arabic characters in regex

I have a dataset of Arabic sentences, and I want to remove non-Arabic characters or special characters. I used this regex in python:
text = re.sub(r'[^ء-ي0-9]',' ',text)
It works perfectly, but in some sentences (4 cases from the whole dataset) the regex also removes the Arabic words!
I read the dataset using Panda (python package) like:
train = pd.read_excel('d.xlsx', encoding='utf-8')
Just to show you in a picture, I tested on Pythex site:
What is the problem?
------------------ Edited:
The sentences in the example:
انا بحكي رجعو مبارك واعملو حفلة واحرقوها بالمعازيم ولما الاخوان يروحو
يعزو احرقو العزا -- احسنلكم والله #مصر
ﺷﻔﻴﻖ ﺃﺭﺩﻭﻏﺎﻥ ﻣﺼﺮ ..ﺃﺣﻨﺍ ﻧﺒﻘﻰ ﻣﻴﻦ ﻳﺎ ﺩﺍﺩﺍ؟ #ﻣﺴﺨﺮﺓ #ﻋﺒﺚ #EgyPresident #Egypt #ﻣﻘﺎﻃﻌﻮﻥ لا يا حبيبي ما حزرت: بشار غبي بوجود بعثة أنان حاب يفضح روحه انه مجرم من هيك نفذ المجزرة لترى البعثة اجرامه بحق السورين

Those incorrectly included characters are not in the common Unicode range for Arabic (U+0621..U+64A), but are "hardcoded" as their initial, medial, and final forms.
Comparable to capitalization in Latin-based languages, but more strict than that, Arabic writing indicates both the start and end of words with a special 'flourish' form. In addition it also allows an "isolated" form (to be used when the character is not part of a full word).
This is usually encoded in a file as 'an' Arabic character and the actual rendering in initial, medial, or final form is left to the text renderer, but since all forms also have Unicode codepoints of their own, it is also possible to "hardcode" the exact forms. That is what you encountered: a mix of these two systems.
Fortunately, the Unicode ranges for the hardcoded forms are also fixed values:
Arabic Presentation Forms-A is a Unicode block encoding contextual forms and ligatures of letter variants needed for Persian, Urdu, Sindhi and Central Asian languages. The presentation forms are present only for compatibility with older standards such as codepage 864 used in DOS, and are typically used in visual and not logical order.
(https://en.wikipedia.org/wiki/Arabic_Presentation_Forms-A)
and their ranges are U+FB50..U+FDFF (Presentation Forms A) and U+FE70..U+FEFC (Presentation Forms B). If you add these ranges to your exclusion set, the regex will no longer delete these texts:
[^ء-ي0-9ﭐ-﷿ﹰ-ﻼ]
Depending on your browser and/or editor, you may have problems with selecting this text to copy and paste it. It may be more clear to explicitly use a string specifying the exact characters:
[^0-9\u0621-\u064a\ufb50-\ufdff\ufe70-\ufefc]

I have made some try on Pythex and I Found this (With the help from Regular Expression Arabic characters and numbers only) : [\u0621-\u064A0-9] who catch almost all non-Arabic characters. For un Unknown reason, this dosen't catch 'y' so you have to add it yourself : [\u0621-\u064A0-9y]
This can catch all non-arabic character. For special character, i'm sorry but i found nothing except to add them inside : [\u0621-\u064A0-9y#\!\?\,]

How to select an entire entity around a regex without splitting the string first?

My project (unrelated to this question, just context) is a ML classifier, I'm trying to improve it and have found that when I stripped URLS from the text given to it, some of the URLS have been broken by spaces. For example:
https:// twitter.com/username/sta tus/ID
After I remove links that are not broken, I am left with thinks like www website com. I removed those with the following regular expression in Python:
tweet = re.sub('(www|http).*?(org |net |edu |com |be |tt |me |ms )','',tweet);
I've put a space after every one of them because this happens after the regular strip and text processing (so only working with parts of a URL separated by spaces) and theoretically we should only pick up the remainders of a broken link... not something like
http website strangeTLD .... communication
It's not perfect but it works, however I just thought that I might try to preemptively remove URLS from twitter only, since I know that the spaces that break the regular URL strip will always be in the same places, hoping this improves my classifier accuracy? This will get rid of the string of characters that occurs after a link... specifically pictures, which is a lot of my data.
Specifically, is there a way to select the entity surrounding/after:
pic.twitter.com/
or, in reference to the example I gave earlier, select the entity after the username broken by the space in status (I'm just guessing at this regex)...
http.*?twitter.com/*?/sta tus/
Thank you in advance! And for the record, I was given this dataset to work with; I am not sure why the URLs are almost all broken by spaces.

Yes, what you are talking about is called Positive Lookbehind and works using (?<=...), where the ellipsis should be replaced by what you want to skip.
E.g. if you want to select whatever comes after username in https://twitter.com/username/status/ID, just use
(?<=https:\/\/twitter\.com\/username\/).*
and you will get status/ID, like you can see with this live demo.
In this case I had to escape slashes / using backslashes, as required by Regex specifications; I also used the Kleene star operator, i.e. the asterisk, to match any occurrence of . (any character), just like you did.
What a positive lookbehind combination does is specifying some mandatory text before the current position of your cursor; in other words, it puts the cursor after the expression you feed it (if the said text exists).
Of course this is not enough in your case, since username won't be a fixed string but a variable one. This might be an additional requirement, since lookbehinds do not work with variable lengths.
So you can just skip www.twitter.com/
(?<=https:\/\/twitter\.com\/).*
And then, via Python, create a substring
currentText = "username/status/ID"
result = currentText.split("/",1)[1] # returns status/ID
Test it in this demo (click "Execute"); a simple explanation of how this works is in the answer to this question (in short, you just split the string at the first slash character).
As a sidenote, blanks/spaces aren't allowed in URLs and if necessary are usually encoded as %20 or + (see e.g. this answer). In other words, every URL you got can be safely stripped of spaces before processing, so... why didn't they do it?

Extract all SVG between two unicode values fontforge

So I have a font that i made a while back din't do very much of anything with it at the time, i lost a few of the original SVGs when i have reinstalled the OS on my computer, The font is full unicode with lots of characters, all the icons saved in the font are outside the normal character range, i know it is possible to extract a bunch of SVG from the font using this command
fontforge -lang=ff -c 'Open($1); SelectWorthOutputting(); foreach Export("svg"); endloop;' Typeface.ttf
and
fontforge -lang=ff -c 'Open($1); SelectAll(); foreach Export("svg"); endloop;' Typeface.ttf
However the first misses the icons completely, and the second is no different. all the icons are between two points in the file starting at U+e000 and going through to U+e17d i want to know how i can extract all the icons between these two points. and if possible match with a namelist.txt for naming.

How to extract several characters between two Unicode values from fontforge?
The conceptually cleanest way would be to iterate over the hexadecimal number-range you specified and call font[char_name].export() on every single one of them. However, this comes with the hassle of incrementing hexadecimal numbers (which while feasible is a bit more involved than what I'm going to propose).
The following for-loop paired with the function 'nameFromUnicode(position)' should do the trick while staying (mostly) clear of hexadecimals. The function 'nameFromUnicode(position)' takes the position of a Unicode character as an integer and returns the name of the character at that position (something like Uni2D42). This name can then be passed to font[that_char_name].export() to export it. To find the starting and ending position of your range in decimal simply interpret the character names as a hexadecimal number and convert it to decimal. In your case e000 becomes 57344 and e17d becomes 57725. The range in decimal would then be from 57344 to 57725.
The following code snippet is a bare bone loop extracting the characters in the specified range (although for '.png' instead of '.svg'; adapting it should be fairly straight forward though).
from fontforge import *
font = open("/path/to/your/.ttf/file")
for position in range(57344,57726):
glyph = nameFromUnicode(position)
font[glyph].export(font[glyph].glyphname + ".png", 150)
I don't quite understand your second question about matching against a namelist.txt. If after 5 years you still crave an answer your welcome to elaborate and I'll see if I can come up with an answer.

Python’s `str.format()`, fill characters, and ANSI colors

In Python 2, I’m using str.format() to align a bunch of columns of text I’m printing to a terminal. Basically, it’s a table, but I’m not printing any borders or anything—it’s simply rows of text, aligned into columns.
With no color-fiddling, everything prints as expected.
If I wrap an entire row (i.e., one print statement) with ANSI color codes, everything prints as expected.
However: If I try to make each column a different color within a row, the alignment is thrown off. Technically, the alignment is preserved; it’s the fill characters (spaces) that aren’t printing as desired; in fact, the fill characters seem to be completely removed.
I’ve verified the same issue with both colorama and xtermcolor. The results were the same. Therefore, I’m certain the issue has to do with str.format() not playing well with ANSI escape sequences in the middle of a string.
But I don’t know what to do about it! :( I would really like to know if there’s any kind of workaround for this problem.
Color and alignment are powerful tools for improving readability, and readability is an important part of software usability. It would mean a lot to me if this could be accomplished without manually aligning each column of text.
Little help? ☺

This is a very late answer, left as bread crumbs for anyone who finds this page while struggling to format text with built-in ANSI color codes.
byoungb's comment about making padding decisions on the length of pre-colorized text is exactly right. But if you already have colored text, here's a work-around:
See my ansiwrap module on PyPI. Its primary purpose is providing textwrap for ANSI-colored text, but it also exports ansilen() which tells you "how long would this string be if it didn't contain ANSI control codes?" It's quite useful in making formatting, column-width, and wrapping decisions on pre-colored text. Add width - ansilen(s) spaces to the end or beginning of s to left (or respectively, right) justify s in a column of your desired width. E.g.:
def ansi_ljust(s, width):
needed = width - ansilen(s)
if needed > 0:
return s + ' ' * needed
else:
return s
Also, if you need to split, truncate, or combine colored text at some point, you will find that ANSI's stateful nature makes that a chore. You may find ansi_terminate_lines() helpful; it "patch up" a list of sub-strings so that each has independent, self-standing ANSI codes with equivalent effect as the original string.
The latest versions of ansicolors also contain an equivalent implementation of ansilen().

Python doesn't distinguish between 'normal' characters and ANSI colour codes, which are also characters that the terminal interprets.
In other words, printing '\x1b[92m' to a terminal may change the terminal text colour, Python doesn't see that as anything but a set of 5 characters. If you use print repr(line) instead, python will print the string literal form instead, including using escape codes for non-ASCII printable characters (so the ESC ASCII code, 27, is displayed as \x1b) to see how many have been added.
You'll need to adjust your column alignments manually to allow for those extra characters.
Without your actual code, that's hard for us to help you with though.

Also late to the party. Had this same issue dealing with color and alignment. Here is a function I wrote which adds padding to a string that has characters that are 'invisible' by default, such as escape sequences.
def ljustcolor(text: str, padding: int, char=" ") -> str:
import re
pattern = r'(?:\x1B[#-_]|[\x80-\x9F])[0-?]*[ -/]*[#-~]'
matches = re.findall(pattern, text)
offset = sum(len(match) for match in matches)
return text.ljust(padding + offset,char[0])
The pattern matches all ansi escape sequences, including color codes. We then get the total length of all matches which will serve as our offset when we add it to the padding value in ljust.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.