I have simple (but extremely hard) question.
I'm looking for a way to convert a text file which contains this type of emoji code (\ud83d\udc40) and replace it with the one which will contain - actual emoji symbol π
E.G.
with open(OUTPUT, "r+") as infileInsight:
insightData = infileInsight.read()\
.replace('\ud83d\udc40','π')\
......
with open(OUTPUT, "w+") as outfileInsight:
outfileInsight.write(insightData)
Regarding, that it is duplicated:
If I do this way:
with open(OUTPUT, "r+") as infileInsight:
insightData = infileInsight.read()\
.replace('\ud83d\udc40','π')\
......
with open(OUTPUT, "w+") as outfileInsight:
outfileInsight.write(insightData.decode('unicode-escape'))
I have an error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2600' in position 30: ordinal not in range(128)
You just need the ensure_ascii=False option in json.dump.
If you're creating this file in the first place, just pass that option.
If someone else gave you this JSON file and you want to change it to use Unicode characters directly in strings (as opposed to Unicode escapes as it is now), you can do something like this:
import json
with open('input.txt', 'r') as infile:
with open('output.txt', 'w') as outfile:
for line in infile:
data = json.loads(line)
json.dump(data, outfile, ensure_ascii=False)
outfile.write('\n')
Related
Here is my code that simply just read a txt file as a list:
with open('test.txt, 'r') as f:
account_list = f.readlines()
f.close()
and here is the sample of test.txt
...
teosjis232:23123/2!#
fdios2313:43242///2323#
...
When I run this code to read this txt file, it shows Unicode error:
UnicodeDecodeError:'charmap' codec can't decode byte 0x9d in position 1632: character maps to <undefined>
I think the problem should be \ in txt file. Anyone can tell me how to read a txt file that contain a lot of \?
Try this, using utf8 encoding
with open('test.txt', 'r', encoding='utf-8') as f:
account_list = f.readlines()
Problem sovled.
with open('test.txt', 'r', encoding='unicode_escape') as f:
account_list = f.readlines()
encoding type unicode_escape works for me.
You can use pathlib.
import pathlib
with pathlib.Path('test.txt') as f:
data = f.read_text()
Could you tell me where I'm going wrong with my current way of thinking? This is my function:
def replace_line(file_name, line_num, text):
lines = open(f"realfolder/files/{item}.html", "r").readlines()
lines[line_num] = text
out = open(file_name, 'w')
out.writelines(lines)
out.close()
This is an example of it being called:
replace_line(f'./files/{item}.html', 9, f'text {item} wordswordswords' + '\n')
I need to encode the text input as utf-8. I'm not sure why I haven't been able to do this already. I also need to retain the fstring value.
I've been doing things like adding:
str.encode(text)
#or
text.encode(encoding = 'utf-8')
To the top of my replace line function. This hasn't worked. I have tried dozens of different methods but each continues to leave me with this error.
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2982: character maps to
undefined
You need to set the encoding to utf-8 for both opening the file to read from
lines = open(f"realfolder/files/{item}.html", "r", encoding="utf-8").readlines()
and opening the file to write to
out = open(file_name, 'w', encoding="utf-8")
I'm trying to write 32 bytes of binary data to a file but an extra byte is being added
wb mode doesn't seem to accept a newline argument so I'm not sure what to do here.
str_ = b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
with open('test.bin', 'wb') as f:
f.write(str_)
You MUST view the file in a hex editor to be able to see the extra byte being added.
Hex View of the file from VIM: https://i.imgur.com/0VcjTCT.png
os.system('touch efuse.bin')
with open('efuse.bin', 'wb') as f:
f.write(generateBinString())
why you are writing newline=""
c=b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
with open('efuse.bin', 'wb') as f:
f.write(c.rstrip())
with open('efuse.bin', 'rb') as f:
print(f.read())
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
Line Feeds (0A) and Carriage Returns (0D)
CAN YOU SEE Python file.write creating extra carriage return
hope it helps
I am having a the following string:
>>> line = '\x00\t\x007\x00\t\x00C\x00a\x00r\x00d\x00i\x00o\x00 \x00M\x00e\x00t\x00a\x00b\x00o\x00l\x00i\x00c\x00 \x00C\x00a\x00r\x00e\x00\t\x00\t\x00\t\x00\t\x00 \x001\x002\x00,\x007\x008\x008\x00,\x005\x002\x008\x00.\x000\x004\x00\r\x00\n'
When I type the variable line in the python terminal it showing the following:
>>> line
'\x00\t\x007\x00\t\x00C\x00a\x00r\x00d\x00i\x00o\x00 \x00M\x00e\x00t\x00a\x00b\x00o\x00l\x00i\x00c\x00 \x00C\x00a\x00r\x00e\x00\t\x00\t\x00\t\x00\t\x00 \x001\x002\x00,\x007\x008\x008\x00,\x005\x002\x008\x00.\x000\x004\x00\r\x00\n'
When I am printing it, its showing the following:
>>> print line
7 Cardio Metabolic Care 12,788,528.04
In the variable line each word is separated using \t and I wanted to save it to a csv file. So I tried using the following code:
import csv
with open('test.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
spamwriter.writerow(line.split('\t'))
When I look into the test.csv file, I am getting only the following
,,,,,,
Is there any to get the words into the csv file. Kindly help.
Your input text is not corrupted, it's encoded - as UTF-16 (Big Endian in this case). And it's CSV itself, just with tab as the delimiter.
You must decode it into a string, after that you can use it normally.
Ideally you declare the proper byte encoding when you read it from a source. For example, when you open a file you can state the encoding the file uses so that the file reader will decode the contents for you.
If you have that byte string from a source where you can't declare an encoding while reading it, you can decode manually:
line = '\x00\t\x007\x00\t\x00C\x00a\x00r\x00d\x00i\x00o\x00 \x00M\x00e\x00t\x00a\x00b\x00o\x00l\x00i\x00c\x00 \x00C\x00a\x00r\x00e\x00\t\x00\t\x00\t\x00\t\x00 \x001\x002\x00,\x007\x008\x008\x00,\x005\x002\x008\x00.\x000\x004\x00\r\x00\n'
decoded = line.decode('utf_16_be')
print decoded
# 7 Cardio Metabolic Care 12,788,528.04
But since I suppose that you are actually reading it from a file:
import csv
import codecs
with codecs.open('input.txt', 'r', encoding='utf16') as in_file, codecs.open('output.csv', 'w', encoding='utf8') as out_file:
reader = csv.reader(in_file, delimiter='\t')
writer = csv.writer(out_file, delimiter=',', quotechar='"')
writer.writerows(reader)
Ia have the following data container which is constantly being updated:
data = []
for val, track_id in zip(values,list(track_ids)):
#below
if val < threshold:
#structure data as dictionary
pre_data = {"artist": sp.track(track_id)['artists'][0]['name'], "track":sp.track(track_id)['name'], "feature": filter_name, "value": val}
data.append(pre_data)
#write to file
with open('db/json/' + user + '_' + product + '_' + filter_name + '.json', 'w') as f:
json.dump(data,f, ensure_ascii=False, indent=4, sort_keys=True)
but I am getting a lot of errors like this:
json.dump(data,f, ensure_ascii=False, indent=4, sort_keys=True)
File"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 6: ordinal not in range(128)
Is there a way I can get rid of this encoding problem once and for all?
I was told that this would do it:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
but many people do not recommend it.
I use python 2.7.10
any clues?
When you write to a file that was opened in text mode, Python encodes the string for you. The default encoding is ascii, which generates the error you see; there are a lot of characters that can't be encoded to ASCII.
The solution is to open the file in a different encoding. In Python 2 you must use the codecs module, in Python 3 you can add the encoding= parameter directly to open. utf-8 is a popular choice since it can handle all of the Unicode characters, and for JSON specifically it's the standard; see https://en.wikipedia.org/wiki/JSON#Data_portability_issues.
import codecs
with codecs.open('db/json/' + user + '_' + product + '_' + filter_name + '.json', 'w', encoding='utf-8') as f:
Your object has unicode strings and python 2.x's support for unicode can be a bit spotty. First, lets make a short example that demonstrates the problem:
>>> obj = {"artist":u"BjΓΆrk"}
>>> import json
>>> with open('deleteme', 'w') as f:
... json.dump(obj, f, ensure_ascii=False)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python2.7/json/__init__.py", line 190, in dump
fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 3: ordinal not in range(128)
From the json.dump help text:
If ``ensure_ascii`` is true (the default), all non-ASCII characters in the
output are escaped with ``\uXXXX`` sequences, and the result is a ``str``
instance consisting of ASCII characters only. If ``ensure_ascii`` is
``False``, some chunks written to ``fp`` may be ``unicode`` instances.
This usually happens because the input contains unicode strings or the
``encoding`` parameter is used. Unless ``fp.write()`` explicitly
understands ``unicode`` (as in ``codecs.getwriter``) this is likely to
cause an error.
Ah! There is the solution. Either use the default ensure_ascii=True and get ascii escaped unicode characters or use the codecs module to open the file with the encoding you want. This works:
>>> import codecs
>>> with codecs.open('deleteme', 'w', encoding='utf-8') as f:
... json.dump(obj, f, ensure_ascii=False)
...
>>>
Why not encode the specific string instead? try, the .encode('utf-8') method on the string that is raising the exception.