Here is my code that simply just read a txt file as a list:
with open('test.txt, 'r') as f:
account_list = f.readlines()
f.close()
and here is the sample of test.txt
...
teosjis232:23123/2!#
fdios2313:43242///2323#
...
When I run this code to read this txt file, it shows Unicode error:
UnicodeDecodeError:'charmap' codec can't decode byte 0x9d in position 1632: character maps to <undefined>
I think the problem should be \ in txt file. Anyone can tell me how to read a txt file that contain a lot of \?
Try this, using utf8 encoding
with open('test.txt', 'r', encoding='utf-8') as f:
account_list = f.readlines()
Problem sovled.
with open('test.txt', 'r', encoding='unicode_escape') as f:
account_list = f.readlines()
encoding type unicode_escape works for me.
You can use pathlib.
import pathlib
with pathlib.Path('test.txt') as f:
data = f.read_text()
Related
I want to use BOM with UTF-8. But it only saves files in UTF-8. What can I do ?I'm rather new, could you please write an answer as an addition to the sample code I shared directly?
import os
import codecs
a=1
filelist=os.listdir("name")
for file in filelist:
filelen=len(os.listdir("name/"+file))
if filelen==10:
with open(file + ".iadx", "w", encoding="UTF-8") as f:
f.write("<name>")
f.write("\n")
f.write('something')
From Python documentation on codecs (search for "-sig") :
On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file.
So just doing :
with open(file + ".iadx", "w", encoding="utf-8-sig") as f:
# ^^^^
will do the trick.
Could you tell me where I'm going wrong with my current way of thinking? This is my function:
def replace_line(file_name, line_num, text):
lines = open(f"realfolder/files/{item}.html", "r").readlines()
lines[line_num] = text
out = open(file_name, 'w')
out.writelines(lines)
out.close()
This is an example of it being called:
replace_line(f'./files/{item}.html', 9, f'text {item} wordswordswords' + '\n')
I need to encode the text input as utf-8. I'm not sure why I haven't been able to do this already. I also need to retain the fstring value.
I've been doing things like adding:
str.encode(text)
#or
text.encode(encoding = 'utf-8')
To the top of my replace line function. This hasn't worked. I have tried dozens of different methods but each continues to leave me with this error.
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2982: character maps to
undefined
You need to set the encoding to utf-8 for both opening the file to read from
lines = open(f"realfolder/files/{item}.html", "r", encoding="utf-8").readlines()
and opening the file to write to
out = open(file_name, 'w', encoding="utf-8")
I'm trying to read what is supposed to be a cp1252 file according to Sublime Text3 and I'm getting the UnicodeEncodeError.
with codecs.open(config_path, mode='rb', encoding='cp1252') as f:
lines = f.readlines()
UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position 15: character maps to <undefined>
I can read the file if I change the encoding to latin-1 which is a bit weird...I'm fairly new to encode/decode stuff and if I open the file in notepad++/ST3/excel it is just an incomprehensible list of what it's look like to be binary data to me.
with codecs.open(config_path, mode='r', encoding='latin-1') as f:
lines = f.readlines()
for l in lines:
utf_line = l.encode("utf-8")
print(utf_line)
b"\x00\x03'\xc2\x9a\x00\x03'\xc2\x9a\x00\x03&\xc3\xba\x00\x03'\xc3\x9a\x00\x03'?\x00\x03'\xc2\xbd\x00\x03't\x00\x03'\xc2\xb2\x00\x03'\xc3\xac\x00\x03'\xc3\x9b\x00\x03'1\x00\x03'\xc2\x98\x00\x03'M\x00\x03'o\x00\x03'\xc3\x8b\x00\x03'\xc2\xbf\x00\x03'd\x00\x03'\xc2\xbf\x00\x03'\xc3\xb0\x00\x03'1\x00\x03'\xc2\x9f\x00\x03'\xc2\x9f\x00\x03'V\x00\x03'\xc2\xa0\x00\x03'G\x00\x03'\x15\x00\x03'u\x00\x03'\xc2\xae\x00\x03'`\x00\x03'|\x00\x03'\x17\x00\x03'Q\x00\x03'8\x00\x03'\xc2\x94\x00\x03':\x00\x03'4\x00\x03'P\x00\x03'\xc2\x9d\x00\x03'\xc2\x9f\x00\x03''\x00\x03'\xc3\x92\x00\x03't\x00\x03'\xc3\xb3\x00\x03'l\x00\x03'c\x00\x03'2\x00\x03'i\x00\x03'C\x00\x03'=\x00\x03'\x0f\x00\x03'\xc3\x89\x00\x03'\xc3\x8a\x00\x03'\xc2\xb7\x00\x03'`\x00\x03'T\x00\x03'\xc2\x90\x00\x03'\xc3\x9b\x00\x03'\xc2\x90\x00\x03'y\x00\x03'?\x00\x03'\xc2\x92\x00\x03'\xc3\xad\x00\x03'g\x00\x03'\xc2\x84\x00\x03'#\x00\x03'\xc2\xa9\x00\x03'q\x00\x03'L\x00\x03'\xc2\xae\x00\x03'
Here is the file
As suggested I've tried to use chardet as follow:
with open(config_path, mode='rb') as f:
lines = f.read()
encoding = chardet.detect(lines)
print(encoding)
{'encoding': None, 'confidence': 0.0, 'language': None}
If I'm testing each line I'm getting a bunch of encoding: cp1252, cp1253, ascii...
Thank you
I have simple (but extremely hard) question.
I'm looking for a way to convert a text file which contains this type of emoji code (\ud83d\udc40) and replace it with the one which will contain - actual emoji symbol 👀
E.G.
with open(OUTPUT, "r+") as infileInsight:
insightData = infileInsight.read()\
.replace('\ud83d\udc40','👀')\
......
with open(OUTPUT, "w+") as outfileInsight:
outfileInsight.write(insightData)
Regarding, that it is duplicated:
If I do this way:
with open(OUTPUT, "r+") as infileInsight:
insightData = infileInsight.read()\
.replace('\ud83d\udc40','👀')\
......
with open(OUTPUT, "w+") as outfileInsight:
outfileInsight.write(insightData.decode('unicode-escape'))
I have an error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2600' in position 30: ordinal not in range(128)
You just need the ensure_ascii=False option in json.dump.
If you're creating this file in the first place, just pass that option.
If someone else gave you this JSON file and you want to change it to use Unicode characters directly in strings (as opposed to Unicode escapes as it is now), you can do something like this:
import json
with open('input.txt', 'r') as infile:
with open('output.txt', 'w') as outfile:
for line in infile:
data = json.loads(line)
json.dump(data, outfile, ensure_ascii=False)
outfile.write('\n')
I am having a the following string:
>>> line = '\x00\t\x007\x00\t\x00C\x00a\x00r\x00d\x00i\x00o\x00 \x00M\x00e\x00t\x00a\x00b\x00o\x00l\x00i\x00c\x00 \x00C\x00a\x00r\x00e\x00\t\x00\t\x00\t\x00\t\x00 \x001\x002\x00,\x007\x008\x008\x00,\x005\x002\x008\x00.\x000\x004\x00\r\x00\n'
When I type the variable line in the python terminal it showing the following:
>>> line
'\x00\t\x007\x00\t\x00C\x00a\x00r\x00d\x00i\x00o\x00 \x00M\x00e\x00t\x00a\x00b\x00o\x00l\x00i\x00c\x00 \x00C\x00a\x00r\x00e\x00\t\x00\t\x00\t\x00\t\x00 \x001\x002\x00,\x007\x008\x008\x00,\x005\x002\x008\x00.\x000\x004\x00\r\x00\n'
When I am printing it, its showing the following:
>>> print line
7 Cardio Metabolic Care 12,788,528.04
In the variable line each word is separated using \t and I wanted to save it to a csv file. So I tried using the following code:
import csv
with open('test.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
spamwriter.writerow(line.split('\t'))
When I look into the test.csv file, I am getting only the following
,,,,,,
Is there any to get the words into the csv file. Kindly help.
Your input text is not corrupted, it's encoded - as UTF-16 (Big Endian in this case). And it's CSV itself, just with tab as the delimiter.
You must decode it into a string, after that you can use it normally.
Ideally you declare the proper byte encoding when you read it from a source. For example, when you open a file you can state the encoding the file uses so that the file reader will decode the contents for you.
If you have that byte string from a source where you can't declare an encoding while reading it, you can decode manually:
line = '\x00\t\x007\x00\t\x00C\x00a\x00r\x00d\x00i\x00o\x00 \x00M\x00e\x00t\x00a\x00b\x00o\x00l\x00i\x00c\x00 \x00C\x00a\x00r\x00e\x00\t\x00\t\x00\t\x00\t\x00 \x001\x002\x00,\x007\x008\x008\x00,\x005\x002\x008\x00.\x000\x004\x00\r\x00\n'
decoded = line.decode('utf_16_be')
print decoded
# 7 Cardio Metabolic Care 12,788,528.04
But since I suppose that you are actually reading it from a file:
import csv
import codecs
with codecs.open('input.txt', 'r', encoding='utf16') as in_file, codecs.open('output.csv', 'w', encoding='utf8') as out_file:
reader = csv.reader(in_file, delimiter='\t')
writer = csv.writer(out_file, delimiter=',', quotechar='"')
writer.writerows(reader)