How do we remove all emoji values from strings in python 3?

How do we remove all emoji values from strings in python 3? - python

I am trying to write a program that will get tweets and then insert them into a csv file but I get this error:
Traceback (most recent call last):
File "c:/Users/Fateh Aliyev/Desktop/Python/AI/Data Mining/data.py", line 30, in <module>
csv.writerow([text, 0])
File "C:\Python\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f44c' in position 41: character maps to <undefined>
I am sure that this is from the emojis that are in the strings. I tried this solution but I got the same error. Is this caused by python not being able to encode the string in the first place or something else? How do we get rid of the emojis?

You can remove the emoji by ignoring it when it cannot be encoded:
import codecs
codecs.charmap_encode('\U0001f44c', 'ignore')
# outputs: (b'', 1)

Related

Python 2 Print emoji in a sentence as rectangles

I have this 😭👏🏻 (loudly crying face and clapping hand emoji character) in a string.txt file (encoded in utf-8).
I am trying to print it out into the default python IDLE, in a sentence.
with open('string.txt','r') as f:
string = f.read()
The code:
>>> string
'\xf0\x9f\x98\xad\xf0\x9f\x91\x8f\xf0\x9f\x8f\xbb'
>>> print string
ðﾟﾘﾭðﾟﾑﾏðﾟﾏﾻ
>>> print string.decode('utf-8')
😭👏🏻 # <-- this is the output I want in a middle of sentence
That's the output I want (rectangles). The tricky part is that I want them in middle of a sentence. So:
>>> print 'The string is: {}!'.format(string.decode('utf-8')) # will get error
Traceback (most recent call last):
File "<pyshell#81>", line 1, in <module>
print 'The string is: {}!'.format(string.decode('utf-8'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
Got an error. But if I don't decode it, it works:
>>> print 'The string is: {}!'.format(string)
The string is: ðﾟﾘﾭðﾟﾑﾏðﾟﾏﾻ!
It did not raise any error, but I don't want this output. I want the rectangles.
How should I solve this issue so it will behave like this:
>>> print 'The string is: {}!'.format(magical_string)
The string is: 😭👏🏻!
Preferred to not use any 3rd party library.
EDIT:
My Operating System: Windows 7 (preferred solution for all Windows 7-10)
Python: 2.7

I think it's a setting of your IDE, and not really a python issue.
When I save the first line of your question into a txt file and read it:
Copied from terminal:
>>> open('test.txt').read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\joost\Desktop\pythontests\venv\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 19: character maps to <undefined>
>>> open('test.txt', encoding='utf-8').read()
'I have this 😭👏🏻 (loudly crying\n'
>>>
As a picture:
Perhaps specify your encoding when opening the file?

Syntax Error in ArcGIS Field Calculator

I got a tiny script in ArcGIS which creates a hyperlink.
My code:
def Befahrung(value1, value2):
if value1 is '':
return ''
else:
return "G:\\Example\\" + str(value1) + "\\File_" + str(value2) + ".pdf"
The error (only when !Bezeichnun! contains a special character):
ERROR 000539: Error running expression: Befahrung(u" ",u"1155Mönch1")
Traceback (most recent call last):
File "<expression>", line 1 in <module>
File "<string>", line 5 in Befahrung
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 5: ordinal not in range(128)
!Bezeichnun! and !Auftrag! are both strings. It works very well until !Bezeichnun! contains a special character. I can't change the characters, I need to save them.
What do I have to change?

In Befahrung, you convert a string (Unicode in this case) to ASCII:
str(value1);
str(value2);
cannot work if value1 or value2 contain non-ASCII characters. You want to use
unicode(value1)
or better, use string formatting:
return u"G:\\Example\\{}\\File_{}.pdf".format(value1, value2)
(works in Python 2.7 and above)

I recommend reading the Python Unicode HOWTO. The error can be distilled to
>>> str(u"1155Mönch1")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 5: ordinal not in range(128)
If you know what character encoding you need (e.g., UTF-8), you can encode it like
value1.encode('utf-8')

Another python unicode error

I'm getting errors such as
UnicodeEncodeError('ascii', u'\x01\xff \xfeJ a z z', 1, 2, 'ordinal not in range(128)'
I'm also getting sequences such as
u'\x17\x01\xff \xfeA r t B l a k e y'
I recognize \x01\xff\xfe as a BOM, but how do I transform these into the obvious output (Jazz and Art Blakey)?
These are coming from a program that reads music file tags.
I've tried various encodings, such a s.encode('utf8'), and various decodes followed by encodes, without success.
As requested:
from hsaudiotag import auto
inf = 'test.mp3'
song = auto.File(inf)
print song.album, song.artist, song.title, song.genre
> Traceback (most recent call last): File "audio2.py", line 4, in
> <module>
> print song.album, song.artist, song.title, song.genre File "C:\program files\python27\lib\encodings\cp437.py", line 12, in encode
> return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character u'\xfe' in
> position 4 : character maps to <undefined>
If I change the print statement to
with open('x', 'wb') as f:
f.write(song.genre)
I get
Traceback (most recent call last):
File "audio2.py", line 6, in <module>
f.write(song.genre)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in position 1:
ordinal not in range(128)

For your actual question, you need to write bytes, not characters, to files. Call:
f.write(song.genre.encode('utf-8'))
and you won't get the error. You can use io.open to get a character stream that you can write to with the encoding done automatically, ie:
with io.open('x', 'wb', encoding='utf-8') as f:
f.write(song.genre)
Getting Unicode to the Console can be a matter of some difficulty (under Windows in particular)—see PrintFails.
However, as discussed in the comments, what you've got doesn't look like a working tag value... it looks more like an mangled ID3v2 frame data block, which it might not be possible to recover. I don't know if this is a bug in your tag reading library or you just have a file with rubbish tags.

Prevent encoding errors in Python

I have scripts which print out messages by the logging system or sometimes print commands. On the Windows console I get error messages like
Traceback (most recent call last):
File "C:\Python32\lib\logging\__init__.py", line 939, in emit
stream.write(msg)
File "C:\Python32\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2019' in position 4537:character maps to <undefined>
Is there a general way to make all encodings in the logging system, print commands, etc. fail-safe (ignore errors)?

The problem is that your terminal/shell (cmd as your are on Windows) cannot print every Unicode character.
You can fail-safe encode your strings with the errors argument of the str.encode method. For example you can replace not supported chars with ? by setting errors='replace'.
>>> s = u'\u2019'
>>> print s
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can\'t encode character u'\u2019' in position
0: character maps to <undefined>
>>> print s.encode('cp850', errors='replace')
?
See the documentation for other options.
Edit If you want a general solution for the logging, you can subclass StreamHandler:
class CustomStreamHandler(logging.StreamHandler):
def emit(self, record):
record = record.encode('cp850', errors='replace')
logging.StreamHandler.emit(self, record)

Python : UnicodeEncodeError: 'latin-1' codec can't encode character

I am at a scenario where I call api and based on the results from api I call database for each record that I in api. My api call return strings and when I make the database call for the items return by api, for some elements I get the following error.
Traceback (most recent call last):
File "TopLevelCategories.py", line 267, in <module>
cursor.execute(categoryQuery, {'title': startCategory});
File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/cursors.py", line 158, in execute
query = query % db.literal(args)
File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/connections.py", line 265, in literal
return self.escape(o, self.encoders)
File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/connections.py", line 203, in unicode_literal
return db.literal(u.encode(unicode_literal.charset))
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 3: ordinal not in range(256)
The segment of my code the above error is referring is:
...
for startCategory in value[0]:
categoryResults = []
try:
categoryRow = ""
baseCategoryTree[startCategory] = []
#print categoryQuery % {'title': startCategory};
cursor.execute(categoryQuery, {'title': startCategory}) #unicode issue
done = False
cont...
After doing some google search I tried the following on my command line to understand whats going on...
>>> import sys
>>> u'\u2013'.encode('iso-8859-1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 0: ordinal not in range(256)
>>> u'\u2013'.encode('cp1252')
'\x96'
>>> '\u2013'.encode('cp1252')
'\\u2013'
>>> u'\u2013'.encode('cp1252')
'\x96'
But I am not sure what would be the solution to overcome this issue. Also I don't know what is the theory behind encode('cp1252') it would be great if I can get some explanation on what I tried above.

If you need Latin-1 encoding, you have several options to get rid of the en-dash or other code points above 255 (characters not included in Latin-1):
>>> u = u'hello\u2013world'
>>> u.encode('latin-1', 'replace') # replace it with a question mark
'hello?world'
>>> u.encode('latin-1', 'ignore') # ignore it
'helloworld'
Or do your own custom replacements:
>>> u.replace(u'\u2013', '-').encode('latin-1')
'hello-world'
If you aren't required to output Latin-1, then UTF-8 is a common and preferred choice. It is recommended by the W3C and nicely encodes all Unicode code points:
>>> u.encode('utf-8')
'hello\xe2\x80\x93world'

The unicode character u'\02013' is the "en dash". It is contained in the Windows-1252 (cp1252) character set (with the encoding x96), but not in the Latin-1 (iso-8859-1) character set. The Windows-1252 character set has some more characters defined in the area x80 - x9f, among them the en dash.
The solution would be for you to choose a different target character set than Latin-1, such as Windows-1252 or UTF-8, or to replace the en dash with a simple "-".

u.encode('utf-8') converts it to bytes which can then be printed on stdout using sys.stdout.buffer.write(bytes)
checkout the displayhook on
https://docs.python.org/3/library/sys.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do we remove all emoji values from strings in python 3? - python

You can remove the emoji by ignoring it when it cannot be encoded: import codecs codecs.charmap_encode('\U0001f44c', 'ignore') # outputs: (b'', 1)

Related

Python 2 Print emoji in a sentence as rectangles

Syntax Error in ArcGIS Field Calculator

Another python unicode error

Prevent encoding errors in Python

Python : UnicodeEncodeError: 'latin-1' codec can't encode character

Categories

Resources