Python Reddit bot not correctly encoding special characters - python

I have a Reddit bot that tries to convert ASCII text to images. I'm running into issues encoding special characters, as per this issue.
I have a repo dedicated to this project, but for the sake of brevity, I'll post the relevant code. I tried switching to Python 3 (since I heard it handles Unicode more elegantly than Python 2), but that didn't solve the issue.
This function pulls comments from Reddit. As you can see, I'm encoding everything in utf-8 as soon as I pull it, which is why I'm confused.
def comments_by_keyword(r, keyword, subreddit='all', print_comments=False):
"""Fetches comments from a subreddit containing a given keyword or phrase
Args:
r: The praw.Reddit class, which is required to access the Reddit API
keyword: Keep only the comments that contain the keyword or phrase
subreddit: A string denoting the subreddit(s) to look through, default is 'all' for r/all
limit: The maximum number of posts to fetch, increase for more thoroughness at the cost of increased redundancy/running time
print_comments: (Debug option) If True, comments_by_keyword will print every comment it fetches, instead of just returning filtered ones
Returns:
An array of comment objects whose body text contains the given keyword or phrase
"""
output = []
comments = r.get_comments(subreddit, limit=1000)
for comment in comments:
# ignore the case of the keyword and comments being fetched
# Example: for keyword='RIP mobile users', comments_by_keyword would keep 'rip Mobile Users', 'rip MOBILE USERS', etc.
if keyword.lower() in comment.body.lower():
print(comment.body.encode('utf-8'))
print("=====\n")
output.append(comment)
elif print_comments:
print(comment.body.encode('utf-8'))
print("=====\n")
return output
And then this converts it to an image:
def str_to_img(str, debug=False):
"""Converts a given string to a PNG image, and saves it to the return variable"""
# use 12pt Courier New for ASCII art
font = ImageFont.truetype("cour.ttf", 12)
# do some string preprocessing
str = str.replace("\n\n", "\n") # Reddit requires double newline for new line, don't let the bot do this
str = html.unescape(str)
img = Image.new('RGB', (1,1))
d = ImageDraw.Draw(img)
str_by_line = str.split("\n")
num_of_lines = len(str_by_line)
line_widths = []
for i, line in enumerate(str_by_line):
line_widths.append(d.textsize(str_by_line[i], font=font)[0])
line_height = d.textsize(str, font=font)[1] # the height of a line of text should be unchanging
img_width = max(line_widths) # the image width is the largest of the individual line widths
img_height = num_of_lines * line_height # the image height is the # of lines * line height
# creating the output image
# add 5 pixels to account for lowercase letters that might otherwise get truncated
img = Image.new('RGB', (img_width, img_height + 5), 'white')
d = ImageDraw.Draw(img)
for i, line in enumerate(str_by_line):
d.text((0,i*line_height), line, font=font, fill='black')
output = BytesIO()
if (debug):
img.save('test.png', 'PNG')
else:
img.save(output, 'PNG')
return output
Like I said, I'm encoding everything in utf-8, but the special characters don't show up properly. I'm also using Courier New from the official .ttf file, which is suppose to support a wide base of characters and symbols, so I'm not sure what the issue is there either.
I feel like it's something obvious. Can anyone enlighten me? It's not ImageDraw, is it? To top it all off, it seems like text encoding as a whole is sort of ambiguous, so even after reading other StackOverflow posts (and blog posts about encoding), I'm hardly closer to a real solution.

I can't run any tests myself currently and I can't leave a comment due to low rep so I'm dropping a partial answer that hopefully gives some ideas what to try. I'm also a bit rusty with Python 2 but lets try..
So two things. First:
I'm encoding everything in utf-8 as soon as I pull it
Are you?
print(comment.body.encode('utf-8'))
print("=====\n")
output.append(comment)
You're encoding the print output, but appending the original comment to the output list as it was outputted by praw. Does praw output unicode objects?
Because I would imagine unicode objects are what the ImageDraw module wants. Looking at its source code it doesn't seem to have any clue about the encoding of the text you're trying to render. Meaning Python 2 strings would be rendered probably as single byte characters leading to garbage in the output in case of utf8 encoding.
http://pillow.readthedocs.org/en/latest/reference/ImageFont.html#PIL.ImageFont.truetype mentions "encoding" parameter, which should default to unicode. Could be worth to try setting it just in case. Maybe it raises an error if the font is not unicode compatible.
Encodings in Python 2 aren't fun. But one thing I would still try to make sure that unicode object is passed to the ImageDraw (try unicode(str) or str.decode("utf8"))

Related

Python Curses, reading wide character's attribute from screen

The problem I'm trying to solve is to get a couple ch,att representing the character and the associated attribute currently displayed at some given position.
Now, when the displayed character is not a wide one (i.e. an ASCII character), the method .inch does the job up to masking correctly the results. The issue comes when the displayed character is wide. More precisely I know how to get the given character through .instr, however this function does not return any information about the attribute.
Since, as far as I know, there is no specific function to get the attribute alone, my first attempt was to use .inch, drop the 8 less significant bit and interpret the result as the attribute. This seemed to work to some extent but double checking I realized that reading greek letters (u"u\03b1" for instance) with no attribute in this way returns att = 11.0000.0000 instead of 0. Is there a better way to approach the problem?
EDIT, a minimal example for Python3
import curses
def bin(x):
out = ''
while x > 0:
out = str(x % 2) + out
x = x // 2
return out
def main(s):
s.addstr(1, 1, u'\u03b1')
s.refresh()
chratt = s.inch(1, 1)
att = chratt & 0xFF00
s.addstr(2, 1, bin(att))
s.refresh()
while True:
pass
curses.wrapper(main)
In curses, inch and instr is only for ascii characters as you suspected. "complex" or "wide" characters like characters from utf-8 have another system, as explained here on stackoverflow by one of the ncurses creators.
However, onto the bad news. They aren't implemented in python curses (yet). A pull request was submitted here and it is very close to merging (90%), so if you really need it then why not go contribute yourself?
And if that isn't an option, then you could try to store every change you make to your screen in a variable and then pull the wide characters from there.

How to force arabic characters to be seperate?

I'm trying to type a set of arabic characters without space on an image using pillow. The problem I'm currently having is that some arabic characters when get next to each other, appear differently when they are seperate.((e.g. س and ‍ل will be ‍سل when put next to each other.) I'm trying to somehow force my font settings to always seperate all characters without injection of any other characters, what should I do?
Here is a snippet of my code:
#font is an arabic font, and font_path is pointing to that location.
font = ImageFont.truetype(
font=font_path, size=size,
layout_engine=ImageFont.LAYOUT_RAQM)
h, w = font.getsize(text, direction='rtl')
offset = font.getoffset(text)
H, W = int(1.5 * h), int(1.5 * w)
imgSize = H, W
img = Image.new(mode='1', size=imgSize, color=0)
draw = ImageDraw.Draw(img)
pos = ((H-h)/2, (W-w)/2)
draw.text(pos, text, fill=255, font=font,
direction='rtl', align='center')
What you're describing might be possible with some fonts that support Arabic, specifically, those that encode the position-sensitive forms in the Arabic Presentation Forms-B Block of Unicode. You would need to map your input text character codes into the correct positional variant. So for the example characters seen and lam as you described, U+0633 س‎ and U+0644 ل‎, you want the initial form of U+0633, which is U+FEB3 ﺳ‎‎, and the final form of U+0644, which is U+FEDE ﻞ, putting those together (separated by a regular space): ﺳ‌ ﻞ‌.
There is a useful chart showing the positional forms at https://en.wikipedia.org/wiki/Arabic_script_in_Unicode#Contextual_forms.
But, important to understand:
not all fonts that contain Arabic have the Presentation Forms encoded (many fonts do not)
not all Arabic codes have an equivalent in the Presentation Forms range (most of the basic ones do, but there are some extended Arabic characters for other languages that do not have Presentation Forms).
you are responsible for processing your input text (in the U+06xx range) into the correct presentation form (U+FExx range) codes based on the word/group context, which can be tricky. That job normally falls to an OpenType Layout engine, but it also performs the joining. So you're basically overriding that logic.

In Python, how do I most efficiently chunk a UTF-8 string for REST delivery?

I'll start out by saying I sort of understand what 'UTF-8' encoding is, that it is basically but not exactly unicode, and that ASCII is a smaller character set. I also understand that if I have:
se_body = "> Genesis 2:2 וַיְכַל אֱלֹהִים בַּיֹּום הַשְּׁבִיעִי מְלַאכְתֹּו אֲשֶׁר עָשָׂה וַיִּשְׁבֹּת בַּיֹּום הַשְּׁבִיעִי מִכָּל־מְלַאכְתֹּו אֲשֶׁר עָשָֽׂה׃ The word tr <excess removed ...> JV"
print len(se_body) #will return the number of characters in the string, in my case '1500'
print sys.getsizeof(se_body) #will return the number of bytes, which will be 3050
My code is leveraging a RESTful API that I do not control. Said RESTful API's job is to parse a passed parameter for bible references out of the text, and has an interesting quirk - it only accepts 2000 characters at a time. If more than 2000 characters are sent, my API call will return a 404. Again, to stress, I am leveraging someone else's API, so please don't tell me "fix the server side." I can't :)
My solution is take the string and chunk it in bits that are less than 2000 characters, let it scan each chunk, and then I'll reassemble and tag as needed. I'd like to be kind to said service and pass as few chunks as possible, meaning that each chunk should be large.
My problem comes when I pass a string with Hebrew or Greek characters in it. (Yes, biblical answers often use Greek and Hebrew!) If I set the chunk size as low as 1000 characters, I can always safely pass it, but this just seems really small. In most cases, I should be able to chunk it larger.
My question is this: Without resorting to too many heroics, what is the most efficient way I can chunk a UTF-8 into a correct size?
Here's the code:
# -*- coding: utf-8 -*-
import requests
import json
biblia_apikey = '************'
refparser_url = "http://api.biblia.com/v1/bible/scan/?"
se_body = "> Genesis 2:2 וַיְכַל אֱלֹהִים בַּיֹּום הַשְּׁבִיעִי מְלַאכְתֹּו אֲשֶׁר עָשָׂה וַיִּשְׁבֹּת בַּיֹּום הַשְּׁבִיעִי מִכָּל־מְלַאכְתֹּו אֲשֶׁר עָשָֽׂה׃ The word translated as "rest" in English, is actually the conjugated word from which we get the English word `Sabbath`, which actually means to "cease doing". > וַיִּשְׁבֹּת or by its root: > שָׁבַת Here's BlueletterBible's concordance entry: [Strong's H7673][1] It is actually the same root word that is conjugated to mean "[to go on strike][2]" in modern Hebrew. In Genesis it is used to refer to the fact that the creation process ceased, not that God "rested" in the sense of relieving exhaustion, as we would normally understand the term in English. The word "rest" in that sense is > נוּחַ Which can be found in Genesis 8:9, for example (and is also where we get Noah's name). More here: [Strong's H5117][3] Jesus' words are in reference to the fact that God is always at work, as the psalmist says in Psalm 54:4, He is the sustainer, something that implies a constant intervention (a "work" that does not cease). The institution of the Sabbath was not merely just so the Israelites would "rest" from their work but as with everything God institutes in the Bible, it had important theological significance (especially as can be gleaned from its prominence as one of the 10 commandments). The point of the Sabbath was to teach man that he should not think he is self-reliant (cf. instances such as Judges 7) and that instead they should rely upon God, but more specifically His mercy. The driving message throughout the Old Testament as well as the New (and this would be best extrapolated in c.se) is that man cannot, by his own efforts ("works") reach God's standard: > Ephesians 2:8 For by grace you have been saved through faith, and that not of yourselves; it is the gift of God, 9 not of works, lest anyone should boast. The Sabbath (and the penalty associated with breaking it) was a way for the Israelites to weekly remember this. See Hebrews 4 for a more in depth explanation of this concept. So there is no contradiction, since God never stopped "working", being constantly active in sustaining His creation, and as Jesus also taught, the Sabbath was instituted for man, to rest, but also, to "stop doing" and remember that he is not self-reliant, whether for food, or for salvation. Hope that helps. [1]: http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H7673&t=KJV [2]: http://www.morfix.co.il/%D7%A9%D7%91%D7%99%D7%AA%D7%94 [3]: http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?strongs=H5117&t=KJV"
se_body = se_body.decode('utf-8')
nchunk_start=0
nchunk_size=1500
found_refs = []
while nchunk_start < len(se_body):
body_chunk = se_body[nchunk_start:nchunk_size]
if (len(body_chunk.strip())<4):
break;
refparser_params = {'text': body_chunk, 'key': biblia_apikey }
headers = {'content-type': 'text/plain; charset=utf-8', 'Accept-Encoding': 'gzip,deflate,sdch'}
refparse = requests.get(refparser_url, params = refparser_params, headers=headers)
if (refparse.status_code == 200):
foundrefs = json.loads(refparse.text)
for foundref in foundrefs['results']:
foundref['textIndex'] += nchunk_start
found_refs.append( foundref )
else:
print "Status Code {0}: Failed to retrieve valid parsing info at {1}".format(refparse.status_code, refparse.url)
print " returned text is: =>{0}<=".format(refparse.text)
nchunk_start += (nchunk_size-50)
#Note: I'm purposely backing up, so that I don't accidentally split a reference across chunks
for ref in found_refs:
print ref
print se_body[ref['textIndex']:ref['textIndex']+ref['textLength']]
I know how to slice a string (body_chunk = se_body[nchunk_start:nchunk_size]) but I'm not sure how I would go about slicing the same string according to the length of the UTF-8 bit.
When I'm done, I need to pull out the selected references (I'm actually going to add SPAN tags). This is what the output would look like for now though:
{u'textLength': 11, u'textIndex': 5, u'passage': u'Genesis 2:2'}
Genesis 2:2
{u'textLength': 11, u'textIndex': 841, u'passage': u'Genesis 8:9'}
Genesis 8:9
There could be several sizes:
Size in memory returned by sys.getsizeof() e.g.,
>>> import sys
>>> sys.getsizeof(b'a')
38
>>> sys.getsizeof(u'Α')
56
i.e., a bytestring that contains a single byte b'a' may require 38 bytes in memory.
You shouldn't care about it unless your local machine has memory problems
The number of bytes in the text encoded as utf-8:
>>> unicode_text = u'Α' # greek letter
>>> bytestring = unicode_text.encode('utf-8')
>>> len(bytestring)
2
The number of Unicode codepoints in the text:
>>> unicode_text = u'Α' # greek letter
>>> len(unicode_text)
1
In general, you might also be interested in number of grapheme clusters ("visual characters") in the text:
>>> unicode_text = u'ё' # cyrillic letter
>>> len(unicode_text) # number of Unicode codepoints
2
>>> import regex # $ pip install regex
>>> chars = regex.findall(u'\\X', unicode_text)
>>> chars
[u'\u0435\u0308']
>>> len(chars) # number of "user-perceived characters"
1
If the API limits are defined by p. 2 (number of bytes in utf-8 encoded bytestring) then you could use answers from the question linked by #Martijn Pieters: Truncating unicode so it fits a maximum size when encoded for wire transfer. The first answer should work:
truncated = unicode_text.encode('utf-8')[:2000].decode('utf-8', 'ignore')
There is also a possibility that the length is limited by the url length:
>>> import urllib
>>> urllib.quote(u'\u0435\u0308'.encode('utf-8'))
'%D0%B5%CC%88'
To truncate it:
import re
import urllib
urlencoded = urllib.quote(unicode_text.encode('utf-8'))[:2000]
# remove `%` or `%X` at the end
urlencoded = re.sub(r'%[0-9a-fA-F]?$', '', urlencoded)
truncated = urllib.unquote(urlencoded).decode('utf-8', 'ignore')
The issue with the url length might be solved using 'X-HTTP-Method-Override' http header that would allow to convert GET request into POST request if the service supports it. Here's code example that uses Google Translate API.
If it is allowed in your case, you could compress the html text by decoding html references and using NFC Unicode normalization form to combine some Unicode codepoints:
import unicodedata
from HTMLParser import HTMLParser
unicode_text = unicodedata.normalize('NFC', HTMLParser().unescape(unicode_text))

Python: Removing particular character (u"\u2610") from string

I have been wrestling with decoding and encoding in Python, and I can't quite figure out how to resolve my problem. I am looping over xml text files (sample) that are apparently coded in utf-8, using Beautiful Soup to parse each file, then looking to see if any sentence in the file contains one or more words from two different list of words. Because the xml files are from the eighteenth century, I need to retain the em dashes that are in the xml. The code below does this just fine, but it also retains a pesky box character that I wish to remove. I believe the box character is this character.
(You can find an example of the character I wish to remove in line 3682 of the sample file above. On this webpage, the character looks like an 'or' pipe, but when I read the xml file in Komodo, it looks like a box. When I try to copy and paste the box into a search engine, it looks like an 'or' pipe. When I print to console, though, the character looks like an empty box.)
To sum up, the code below runs without errors, but it prints the empty box character that I would like to remove.
for work in glob.glob(pathtofiles):
openfile = open(work)
readfile = openfile.read()
stringfile = str(readfile)
decodefile = stringfile.decode('utf-8', 'strict') #is this the dodgy line?
soup = BeautifulSoup(decodefile)
textwithtags = soup.findAll('text')
textwithtagsasstring = str(textwithtags)
#this method strips everything between anglebrackets as it should
textwithouttags = stripTags(textwithtagsasstring)
#clean text
nonewlines = textwithouttags.replace("\n", " ")
noextrawhitespace = re.sub(' +',' ', nonewlines)
print noextrawhitespace #the boxes appear
I tried to remove the boxes by using
noboxes = noextrawhitespace.replace(u"\u2610", "")
But Python threw an error flag:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 280: ordinal not in range(128)
Does anyone know how I can remove the boxes from the xml files? I would be grateful for any help others can offer.
The problem is that you're mixing unicode and str. Whenever you do that, Python has to convert one to the other, which is does by using sys.getdefaultencoding(), which is usually ASCII, which is almost never what you want.*
If the exception comes from this line:
noboxes = noextrawhitespace.replace(u"\u2610", "")
… the fix is simple… except that you have to know whether noextrawhitespace is supposed to be a unicode object or a UTF-8-encoding str object). If the former, it's this:
noboxes = noextrawhitespace.replace(u"\u2610", u"")
If the latter, it's this:
noboxes = noextrawhitespace.replace(u"\u2610".encode('utf-8'), "")
But really, you have to get all of the strings consistent in your code; mixing the two up is going to cause problems in more places than this one.
Since I don't have your XML files to test, I wrote my own:
<xml>
<text>abc☐def</text>
</xml>
Then, I added these two lines to the bottom of your code (and a bit to the top to just open my file instead of globbing for whatever):
noboxes = noextrawhitespace.replace(u"\u2610".encode('utf-8'), "")
print noboxes
The output is now:
[<text>abc☐def</text>]
[<text>abc☐def</text>]
[<text>abcdef</text>]
So, I think that's what you want here.
* Sure sometimes you want ASCII… but those aren't usually the times when you have unicode objects…
Give this a try:
noextrawhitespace.replace("\\u2610", "")
I think you are just missing that extra '\'
This might also work.
print(noextrawhitespace.decode('unicode_escape').encode('ascii','ignore'))
Reading your sample, the following are the non-ASCII characters in the document:
0x2223 DIVIDES
0x2022 BULLET
0x3009 RIGHT ANGLE BRACKET
0x25aa BLACK SMALL SQUARE
0x25ca LOZENGE
0x3008 LEFT ANGLE BRACKET
0x2014 EM DASH
0x2026 HORIZONTAL ELLIPSIS
\u2223 is the actual character in question in line 3682, and it is being used as a soft hyphen. The others are used in markup for tagging illegible characters, such as:
<GAP DESC="illegible" RESP="oxf" EXTENT="4+ letters" DISP="\u2022\u2022\u2022\u2022\u2026"/>
Here's some code to do what your code is attempting. Make sure to process in Unicode:
from bs4 import BeautifulSoup
import re
with open('k000039.000.xml') as f:
soup = BeautifulSoup(f) # BS figures out the encoding
text = u''.join(soup.strings) # strings is a generator for just the text bits.
text = re.sub(ur'\s+',ur' ',text) # Simplify all white space.
text = text.replace(u'\u2223',u'') # Get rid of the DIVIDES character.
print text
Output:
[[truncated]] reckon my self a Bridegroom too. Buckle. I doubt Kickey won't find him such. [Aside.] Mrs. Sago. Well,—poor Keckky's bound to good Behaviour, or she had lost quite her Puddy's Favour. Shall I for this repine at Fortune?—No. I'm glad at Heart that I'm forgiven so. Some Neighbours Wives have but too lately shown, When Spouse had left 'em all their Friends were flown. Then all you Wives that wou'd avoid my Fate. Remain contented with your present State FINIS.

Python’s `str.format()`, fill characters, and ANSI colors

In Python 2, I’m using str.format() to align a bunch of columns of text I’m printing to a terminal. Basically, it’s a table, but I’m not printing any borders or anything—it’s simply rows of text, aligned into columns.
With no color-fiddling, everything prints as expected.
If I wrap an entire row (i.e., one print statement) with ANSI color codes, everything prints as expected.
However: If I try to make each column a different color within a row, the alignment is thrown off. Technically, the alignment is preserved; it’s the fill characters (spaces) that aren’t printing as desired; in fact, the fill characters seem to be completely removed.
I’ve verified the same issue with both colorama and xtermcolor. The results were the same. Therefore, I’m certain the issue has to do with str.format() not playing well with ANSI escape sequences in the middle of a string.
But I don’t know what to do about it! :( I would really like to know if there’s any kind of workaround for this problem.
Color and alignment are powerful tools for improving readability, and readability is an important part of software usability. It would mean a lot to me if this could be accomplished without manually aligning each column of text.
Little help? ☺
This is a very late answer, left as bread crumbs for anyone who finds this page while struggling to format text with built-in ANSI color codes.
byoungb's comment about making padding decisions on the length of pre-colorized text is exactly right. But if you already have colored text, here's a work-around:
See my ansiwrap module on PyPI. Its primary purpose is providing textwrap for ANSI-colored text, but it also exports ansilen() which tells you "how long would this string be if it didn't contain ANSI control codes?" It's quite useful in making formatting, column-width, and wrapping decisions on pre-colored text. Add width - ansilen(s) spaces to the end or beginning of s to left (or respectively, right) justify s in a column of your desired width. E.g.:
def ansi_ljust(s, width):
needed = width - ansilen(s)
if needed > 0:
return s + ' ' * needed
else:
return s
Also, if you need to split, truncate, or combine colored text at some point, you will find that ANSI's stateful nature makes that a chore. You may find ansi_terminate_lines() helpful; it "patch up" a list of sub-strings so that each has independent, self-standing ANSI codes with equivalent effect as the original string.
The latest versions of ansicolors also contain an equivalent implementation of ansilen().
Python doesn't distinguish between 'normal' characters and ANSI colour codes, which are also characters that the terminal interprets.
In other words, printing '\x1b[92m' to a terminal may change the terminal text colour, Python doesn't see that as anything but a set of 5 characters. If you use print repr(line) instead, python will print the string literal form instead, including using escape codes for non-ASCII printable characters (so the ESC ASCII code, 27, is displayed as \x1b) to see how many have been added.
You'll need to adjust your column alignments manually to allow for those extra characters.
Without your actual code, that's hard for us to help you with though.
Also late to the party. Had this same issue dealing with color and alignment. Here is a function I wrote which adds padding to a string that has characters that are 'invisible' by default, such as escape sequences.
def ljustcolor(text: str, padding: int, char=" ") -> str:
import re
pattern = r'(?:\x1B[#-_]|[\x80-\x9F])[0-?]*[ -/]*[#-~]'
matches = re.findall(pattern, text)
offset = sum(len(match) for match in matches)
return text.ljust(padding + offset,char[0])
The pattern matches all ansi escape sequences, including color codes. We then get the total length of all matches which will serve as our offset when we add it to the padding value in ljust.

Categories