UnicodeDecodeError - Add encoding to custom function input - python

Could you tell me where I'm going wrong with my current way of thinking? This is my function:
def replace_line(file_name, line_num, text):
lines = open(f"realfolder/files/{item}.html", "r").readlines()
lines[line_num] = text
out = open(file_name, 'w')
out.writelines(lines)
out.close()
This is an example of it being called:
replace_line(f'./files/{item}.html', 9, f'text {item} wordswordswords' + '\n')
I need to encode the text input as utf-8. I'm not sure why I haven't been able to do this already. I also need to retain the fstring value.
I've been doing things like adding:
str.encode(text)
#or
text.encode(encoding = 'utf-8')
To the top of my replace line function. This hasn't worked. I have tried dozens of different methods but each continues to leave me with this error.
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2982: character maps to
undefined

You need to set the encoding to utf-8 for both opening the file to read from
lines = open(f"realfolder/files/{item}.html", "r", encoding="utf-8").readlines()
and opening the file to write to
out = open(file_name, 'w', encoding="utf-8")

Related

Python can't read txt file that contain "\"

Here is my code that simply just read a txt file as a list:
with open('test.txt, 'r') as f:
account_list = f.readlines()
f.close()
and here is the sample of test.txt
...
teosjis232:23123/2!#
fdios2313:43242///2323#
...
When I run this code to read this txt file, it shows Unicode error:
UnicodeDecodeError:'charmap' codec can't decode byte 0x9d in position 1632: character maps to <undefined>
I think the problem should be \ in txt file. Anyone can tell me how to read a txt file that contain a lot of \?
Try this, using utf8 encoding
with open('test.txt', 'r', encoding='utf-8') as f:
account_list = f.readlines()
Problem sovled.
with open('test.txt', 'r', encoding='unicode_escape') as f:
account_list = f.readlines()
encoding type unicode_escape works for me.
You can use pathlib.
import pathlib
with pathlib.Path('test.txt') as f:
data = f.read_text()

Emoji converter by using python - (\ud83d\udc40) to the actual emoji symbol 👀

I have simple (but extremely hard) question.
I'm looking for a way to convert a text file which contains this type of emoji code (\ud83d\udc40) and replace it with the one which will contain - actual emoji symbol 👀
E.G.
with open(OUTPUT, "r+") as infileInsight:
insightData = infileInsight.read()\
.replace('\ud83d\udc40','👀')\
......
with open(OUTPUT, "w+") as outfileInsight:
outfileInsight.write(insightData)
Regarding, that it is duplicated:
If I do this way:
with open(OUTPUT, "r+") as infileInsight:
insightData = infileInsight.read()\
.replace('\ud83d\udc40','👀')\
......
with open(OUTPUT, "w+") as outfileInsight:
outfileInsight.write(insightData.decode('unicode-escape'))
I have an error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2600' in position 30: ordinal not in range(128)
You just need the ensure_ascii=False option in json.dump.
If you're creating this file in the first place, just pass that option.
If someone else gave you this JSON file and you want to change it to use Unicode characters directly in strings (as opposed to Unicode escapes as it is now), you can do something like this:
import json
with open('input.txt', 'r') as infile:
with open('output.txt', 'w') as outfile:
for line in infile:
data = json.loads(line)
json.dump(data, outfile, ensure_ascii=False)
outfile.write('\n')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 0: invalid start byte

This is my code.
stock_code = open('/home/ubuntu/trading/456.csv', 'r')
csvReader = csv.reader(stock_code)
for st in csvReader:
eventcode = st[1]
print(eventcode)
I want to know content in excel.
But there are unicodeDecodeError.
How can i fix it?
The CSV docs say,
Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding...
The error message shows that your system is expecting the file to be using UTF-8 encoding.
Solutions:
Make sure the file is using the correct encoding.
For example, open the file using NotePad++, select Encoding from the menu
and select UTF-8. Then resave the file.
Alternatively, specify the encoding of the file when calling open(), like this
my_encoding = 'UTF-8' # or whatever is the encoding of the file.
with open('/home/ubuntu/trading/456.csv', 'r', encoding=my_encoding) as stock_code:
stock_code = open('/home/ubuntu/trading/456.csv', 'r')
csvReader = csv.reader(stock_code)
for st in csvReader:
eventcode = st[1]
print(eventcode)

UnicodeEncodeError: 'ascii' codec can't encode

Ia have the following data container which is constantly being updated:
data = []
for val, track_id in zip(values,list(track_ids)):
#below
if val < threshold:
#structure data as dictionary
pre_data = {"artist": sp.track(track_id)['artists'][0]['name'], "track":sp.track(track_id)['name'], "feature": filter_name, "value": val}
data.append(pre_data)
#write to file
with open('db/json/' + user + '_' + product + '_' + filter_name + '.json', 'w') as f:
json.dump(data,f, ensure_ascii=False, indent=4, sort_keys=True)
but I am getting a lot of errors like this:
json.dump(data,f, ensure_ascii=False, indent=4, sort_keys=True)
File"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 6: ordinal not in range(128)
Is there a way I can get rid of this encoding problem once and for all?
I was told that this would do it:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
but many people do not recommend it.
I use python 2.7.10
any clues?
When you write to a file that was opened in text mode, Python encodes the string for you. The default encoding is ascii, which generates the error you see; there are a lot of characters that can't be encoded to ASCII.
The solution is to open the file in a different encoding. In Python 2 you must use the codecs module, in Python 3 you can add the encoding= parameter directly to open. utf-8 is a popular choice since it can handle all of the Unicode characters, and for JSON specifically it's the standard; see https://en.wikipedia.org/wiki/JSON#Data_portability_issues.
import codecs
with codecs.open('db/json/' + user + '_' + product + '_' + filter_name + '.json', 'w', encoding='utf-8') as f:
Your object has unicode strings and python 2.x's support for unicode can be a bit spotty. First, lets make a short example that demonstrates the problem:
>>> obj = {"artist":u"Björk"}
>>> import json
>>> with open('deleteme', 'w') as f:
... json.dump(obj, f, ensure_ascii=False)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 3: ordinal not in range(128)
From the json.dump help text:
If ``ensure_ascii`` is true (the default), all non-ASCII characters in the
output are escaped with ``\uXXXX`` sequences, and the result is a ``str``
instance consisting of ASCII characters only. If ``ensure_ascii`` is
``False``, some chunks written to ``fp`` may be ``unicode`` instances.
This usually happens because the input contains unicode strings or the
``encoding`` parameter is used. Unless ``fp.write()`` explicitly
understands ``unicode`` (as in ``codecs.getwriter``) this is likely to
cause an error.
Ah! There is the solution. Either use the default ensure_ascii=True and get ascii escaped unicode characters or use the codecs module to open the file with the encoding you want. This works:
>>> import codecs
>>> with codecs.open('deleteme', 'w', encoding='utf-8') as f:
... json.dump(obj, f, ensure_ascii=False)
...
>>>
Why not encode the specific string instead? try, the .encode('utf-8') method on the string that is raising the exception.

reading/writing files with umlauts in python (html to txt)

I know this has been asked several times, but I think I'm doing everything right and it still doesn't work, so before I go clinically insane I'll make a post. This is the code (It's supposed to convert HTML Files to txt files and leave out certain lines):
fid = codecs.open(htmlFile, "r", encoding = "utf-8")
if not fid:
return
htmlText = fid.read()
fid.close()
stripped = strip_tags(unicode(htmlText)) ### strip html tags (this is not the prob)
lines = stripped.split('\n')
out = []
for line in lines: # just some stuff i want to leave out of the output
if len(line) < 6:
continue
if '*' in line or '(' in line or '#' in line or ':' in line:
continue
out.append(line)
result= '\n'.join(out)
base, ext = os.path.splitext(htmlFile)
outfile = base + '.txt'
fid = codecs.open(outfile, "w", encoding = 'utf-8')
fid.write(result)
fid.close()
Thanks!
Not sure but by doing
'\n'.join(out)
Using a non-unicode string (but a plain old bytes string), you may be falling back to some non-UTF-8 codec. Try:
u'\n'.join(out)
To make sure you're using unicode objects everywhere.
You haven't specified the problem, so this is a complete guess.
What is being returned by your strip_tags() function? Is it returning a unicode object, or is it a byte string? If the latter, it would likely cause decoding issues when you attempt to write it to a file. For example, if strip_tags() is returning a utf-8 encoded byte string:
>>> s = u'This is \xe4 test\nHere is \xe4nother line.'
>>> print s
This is ä test
Here is änother line.
>>> s_utf8 = s.encode('utf-8')
>>> f=codecs.open('test', 'w', encoding='utf8')
>>> f.write(s) # no problem with this... s is unicode, but
>>> f.write(s_utf8)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/usr/lib64/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)
If this is what you are seeing you need to make sure that you pass unicode in fid.write(result), which probably means ensuring that unicode is returned by strip_tags().
Also, a couple of other things I noticed in passing:
codecs.open() will raise an IOError exception if it can not open the file. It will not return None, so the if not fid: test will not assist. You need to use try/except, ideally with with.
try:
with codecs.open(htmlFile, "r", encoding = "utf-8") as fid:
htmlText = fid.read()
except IOError, e:
# handle error
print e
And, data that you read from a file opened via codecs.open() will automatically be converted to unicode, therefore calling unicode(htmlText) achieves nothing.

Categories