Python emoji to bytes write to txt file - python

I get this error
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f525' in position 0: character maps to
I would like to write for example "🔥" to a txt file and it should be \U0001f525 written in the txt file
Here's my code
test1 = f"{config['emoji']}"
with open('emoji.txt', 'w') as f:
f.write(test1)

test1 = "🔥"
with open('emoji.txt', 'w') as f:
transformed = (test1
.encode('utf-16', 'surrogatepass')\
.decode('utf-16')\
.encode("raw_unicode_escape")\
.decode("latin_1"))
f.write(transformed)
Adapted from this answer

Related

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 295: character maps to <undefined>

I am trying to run this program from a book. I have created the file called 'alice_2.txt'
def count_words(filename):
"""Count the approximate number of words in a file."""
try:
with open(filename) as f_obj:
contents = f_obj.read()
except FileNotFoundError:
msg = "Sorry, the file " + filename + " does not exist."
print(msg)
else:
# Count approximate number of words in the file.
words = contents.split()
num_words = len(words)
print("The file " + filename + " has about " + str(num_words) +
" words.")
filename = 'alice_2.txt'
count_words(filename)
But I keep getting this error message
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 295: character maps to <undefined>
Can anyone explain what this means, and how to solve it?
You are trying to use an encoding which cannot store the character you have in file.
for example É› can't be opened in ascii since it doesn't have valid ascii code.
try to open the file using utf-8.
with open(filename, encoding='utf8') as f_obj:
pass
# DO your stuff

'charmap' codec can't decode byte 0x9d in position 4836: character maps to <undefined>

I am trying to figure out this error that pops up from this code:
filename = os.path.join(os.path.expanduser("~"), "data", "blogs",
"1005545.male.25.Engineering.Sagittarius.xml")
#filename = open('C:/Users/spenc/data/blogs/1005545.male.25.Engineering.Sagittarius.xml',
#encoding='utf-8', errors = 'ignore')
all_posts = []
allPosts = []
with open(filename) as inf:
postStart = False
post = []
for line in inf:
line = line.strip()
if line == "<post>":
postStart = True
elif line == "</post>":
postStart = False
allPosts.append("\n".join(post))
post =[]
elif postStart:
post.append(line)
print(allPosts[0])
print(len(allPosts))
filename.close()
and get this error:
File "D:\Anaconda-Python\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4836: character maps to <undefined> here
I am just trying to figure out the encoding error to make sure this works in finding the length of the posts and print the post itself, but it keeps getting caught up on the allposts.append line. Not really sure of anywork around or if there is a newer way of doing something of this sort. I was trying to follow a textbook on it, but cant continue on in the chapter until this has been worked out.

How can I replace 'æ' 'ø' and 'å' in a text without error: 'ascii' codec can't decode byte 0xc3 in position 0

I am making a program which is supposed to open a textfile, then replace letters 'æ, ø, and å' (Danish text) with 'ae, oe, aa'.
I need to open the program and run it through the mac terminal.
I tried using the replace() function, and tried writing:
# -*- coding: utf-8 -*-
#!/usr/bin/env python
in the beginning of the file.
But I keep getting error:
File "replace.py", line 20, in replace_nonascii
word = word.replace('Ã¥', 'aa')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
any suggestions? have tried googling this for days, I have no clue how to fix it.
Here is my program:
filepath = input('insert path for text')
with codecs.open(filepath, 'r', encoding = 'utf8') as file_object:
filename_cont['text1'] = file_object.read()
def replace_nonascii(word):
word = word.lower()
word = word.replace('Ã¥', 'aa')
word = word.replace('æ', 'ae')
word = word.strip('/-.,?!')
print(word)
for text in filename_cont:
newtext = filename_cont[text]
for word in newtext.split():
replace_nonascii(word)

convert pdf to text file in python

My code works perfectly for some pdf, but some show error:
Traceback (most recent call last):
File "con.py", line 24, in <module>
print getPDFContent("abc.pdf")
File "con.py", line 17, in getPDFContent
f.write(a)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u02dd' in position 64: ordinal not in range(128)
My code is
import pyPdf
def getPDFContent(path):
content = ""
pdf = pyPdf.PdfFileReader(file(path, "rb"))
for i in range(0, pdf.getNumPages()):
f=open("xxx.txt",'a')
content= pdf.getPage(i).extractText() + "\n"
import string
c=content.split()
for a in c:
f.write(" ")
f.write(a)
f.write('\n')
f.close()
return content
print getPDFContent("abc.pdf")
Your problem is that when you call f.write() with a string, it is trying to encode it using the ascii codec. Your pdf contains characters that can not be represented by the ascii codec. Try explicitly encoding your str, e.g.
a = a.encode('utf-8')
f.write(a)
Try
import sys
print getPDFContent("abc.pdf").encode(sys.getfilesystemencoding())

python: unicode problem

I am trying to decode a string I took from file:
file = open ("./Downloads/lamp-post.csv", 'r')
data = file.readlines()
data[0]
'\xff\xfeK\x00e\x00y\x00w\x00o\x00r\x00d\x00\t\x00C\x00o\x00m\x00p\x00e\x00t\x00i\x00t\x00i\x00o\x00n\x00\t\x00G\x00l\x00o\x00b\x00a\x00l\x00
\x00M\x00o\x00n\x00t\x00h\x00l\x00y\x00
\x00S\x00e\x00a\x00r\x00c\x00h\x00e\x00s\x00\t\x00D\x00e\x00c\x00
\x002\x000\x001\x000\x00\t\x00N\x00o\x00v\x00
\x002\x000\x001\x000\x00\t\x00O\x00c\x00t\x00
\x002\x000\x001\x000\x00\t\x00S\x00e\x00p\x00
\x002\x000\x001\x000\x00\t\x00A\x00u\x00g\x00
\x002\x000\x001\x000\x00\t\x00J\x00u\x00l\x00
\x002\x000\x001\x000\x00\t\x00J\x00u\x00n\x00
\x002\x000\x001\x000\x00\t\x00M\x00a\x00y\x00
\x002\x000\x001\x000\x00\t\x00A\x00p\x00r\x00
\x002\x000\x001\x000\x00\t\x00M\x00a\x00r\x00
\x002\x000\x001\x000\x00\t\x00F\x00e\x00b\x00
\x002\x000\x001\x000\x00\t\x00J\x00a\x00n\x00
\x002\x000\x001\x000\x00\t\x00A\x00d\x00
\x00s\x00h\x00a\x00r\x00e\x00\t\x00S\x00e\x00a\x00r\x00c\x00h\x00
\x00s\x00h\x00a\x00r\x00e\x00\t\x00E\x00s\x00t\x00i\x00m\x00a\x00t\x00e\x00d\x00
\x00A\x00v\x00g\x00.\x00
\x00C\x00P\x00C\x00\t\x00E\x00x\x00t\x00r\x00a\x00c\x00t\x00e\x00d\x00
\x00F\x00r\x00o\x00m\x00
\x00W\x00e\x00b\x00
\x00P\x00a\x00g\x00e\x00\t\x00L\x00o\x00c\x00a\x00l\x00
\x00M\x00o\x00n\x00t\x00h\x00l\x00y\x00
\x00S\x00e\x00a\x00r\x00c\x00h\x00e\x00s\x00\n'
Adding ignore do not really help...:
In [69]: data[2]
Out[69]: u'\u6700\u6100\u7200\u6400\u6500\u6e00\u2000\u6c00\u6100\u6d00\u7000\u2000\u7000\u6f00\u7300\u7400\u0900\u3000\u2e00\u3900\u3400\u0900\u3800\u3800\u3000\u0900\u2d00\u0900\u3300\u3200\u3000\u0900\u3300\u3900\u3000\u0900\u3300\u3900\u3000\u0900\u3400\u3800\u3000\u0900\u3500\u3900\u3000\u0900\u3500\u3900\u3000\u0900\u3700\u3200\u3000\u0900\u3700\u3200\u3000\u0900\u3300\u3900\u3000\u0900\u3300\u3200\u3000\u0900\u3200\u3600\u3000\u0900\u2d00\u0900\u2d00\u0900\ua300\u3200\u2e00\u3100\u3800\u0900\u2d00\u0900\u3400\u3800\u3000\u0a00'
In [70]: data[2].decode("utf-8",
"replace")
---------------------------------------------------------------------------
Traceback (most recent call last)
/Users/oleg/ in
()
/opt/local/lib/python2.5/encodings/utf_8.py
in decode(input, errors)
14
15 def decode(input, errors='strict'):
---> 16 return codecs.utf_8_decode(input, errors,
True)
17
18 class IncrementalEncoder(codecs.IncrementalEncoder):
:
'ascii' codec can't encode characters
in position 0-87: ordinal not in
range(128)
In [71]:
This looks like UTF-16 data. So try
data[0].rstrip("\n").decode("utf-16")
Edit (for your update): Try to decode the whole file at once, that is
data = open(...).read()
data.decode("utf-16")
The problem is that the line breaks in UTF-16 are "\n\x00", but using readlines() will split at the "\n", leaving the "\x00" character for the next line.
This file is a UTF-16-LE encoded file, with an initial BOM.
import codecs
fp= codecs.open("a", "r", "utf-16")
lines= fp.readlines()
EDIT
Since you posted 2.7 this is the 2.7 solution:
file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "replace") for line in file]
Ignoring undecodeable characters:
file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "ignore") for line in file]

Categories