Python reading a cp1252 file

Python reading a cp1252 file - python

I'm trying to read what is supposed to be a cp1252 file according to Sublime Text3 and I'm getting the UnicodeEncodeError.
with codecs.open(config_path, mode='rb', encoding='cp1252') as f:
lines = f.readlines()
UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position 15: character maps to <undefined>
I can read the file if I change the encoding to latin-1 which is a bit weird...I'm fairly new to encode/decode stuff and if I open the file in notepad++/ST3/excel it is just an incomprehensible list of what it's look like to be binary data to me.
with codecs.open(config_path, mode='r', encoding='latin-1') as f:
lines = f.readlines()
for l in lines:
utf_line = l.encode("utf-8")
print(utf_line)
b"\x00\x03'\xc2\x9a\x00\x03'\xc2\x9a\x00\x03&\xc3\xba\x00\x03'\xc3\x9a\x00\x03'?\x00\x03'\xc2\xbd\x00\x03't\x00\x03'\xc2\xb2\x00\x03'\xc3\xac\x00\x03'\xc3\x9b\x00\x03'1\x00\x03'\xc2\x98\x00\x03'M\x00\x03'o\x00\x03'\xc3\x8b\x00\x03'\xc2\xbf\x00\x03'd\x00\x03'\xc2\xbf\x00\x03'\xc3\xb0\x00\x03'1\x00\x03'\xc2\x9f\x00\x03'\xc2\x9f\x00\x03'V\x00\x03'\xc2\xa0\x00\x03'G\x00\x03'\x15\x00\x03'u\x00\x03'\xc2\xae\x00\x03'`\x00\x03'|\x00\x03'\x17\x00\x03'Q\x00\x03'8\x00\x03'\xc2\x94\x00\x03':\x00\x03'4\x00\x03'P\x00\x03'\xc2\x9d\x00\x03'\xc2\x9f\x00\x03''\x00\x03'\xc3\x92\x00\x03't\x00\x03'\xc3\xb3\x00\x03'l\x00\x03'c\x00\x03'2\x00\x03'i\x00\x03'C\x00\x03'=\x00\x03'\x0f\x00\x03'\xc3\x89\x00\x03'\xc3\x8a\x00\x03'\xc2\xb7\x00\x03'`\x00\x03'T\x00\x03'\xc2\x90\x00\x03'\xc3\x9b\x00\x03'\xc2\x90\x00\x03'y\x00\x03'?\x00\x03'\xc2\x92\x00\x03'\xc3\xad\x00\x03'g\x00\x03'\xc2\x84\x00\x03'#\x00\x03'\xc2\xa9\x00\x03'q\x00\x03'L\x00\x03'\xc2\xae\x00\x03'
Here is the file
As suggested I've tried to use chardet as follow:
with open(config_path, mode='rb') as f:
lines = f.read()
encoding = chardet.detect(lines)
print(encoding)
{'encoding': None, 'confidence': 0.0, 'language': None}
If I'm testing each line I'm getting a bunch of encoding: cp1252, cp1253, ascii...
Thank you

Related

Python can't read txt file that contain "\"

Here is my code that simply just read a txt file as a list:
with open('test.txt, 'r') as f:
account_list = f.readlines()
f.close()
and here is the sample of test.txt
...
teosjis232:23123/2!#
fdios2313:43242///2323#
...
When I run this code to read this txt file, it shows Unicode error:
UnicodeDecodeError:'charmap' codec can't decode byte 0x9d in position 1632: character maps to <undefined>
I think the problem should be \ in txt file. Anyone can tell me how to read a txt file that contain a lot of \?

Try this, using utf8 encoding
with open('test.txt', 'r', encoding='utf-8') as f:
account_list = f.readlines()

Problem sovled.
with open('test.txt', 'r', encoding='unicode_escape') as f:
account_list = f.readlines()
encoding type unicode_escape works for me.

You can use pathlib.
import pathlib
with pathlib.Path('test.txt') as f:
data = f.read_text()

Encoding and decoding with utf-8 returns UnicodeError

I am both enconding and decoding with utf-8 but still I get a UnicodeError.
import pandas as pd
df.to_csv('myfile.csv', index=False, encoding='utf-8')
Then, in another .py, same project
import pandas as pd
with open(file, 'r') as f:
csv = pd.read_csv(f, encoding='utf-8')
The error is:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 51956: character maps to <undefined>
This is not the first time I get this issue.

Ok, found it. Makes a lot of sense now.
with open(file, 'r', encoding='utf-8') as f:
csv = pd.read_csv(f)

UnicodeDecodeError - Add encoding to custom function input

Could you tell me where I'm going wrong with my current way of thinking? This is my function:
def replace_line(file_name, line_num, text):
lines = open(f"realfolder/files/{item}.html", "r").readlines()
lines[line_num] = text
out = open(file_name, 'w')
out.writelines(lines)
out.close()
This is an example of it being called:
replace_line(f'./files/{item}.html', 9, f'text {item} wordswordswords' + '\n')
I need to encode the text input as utf-8. I'm not sure why I haven't been able to do this already. I also need to retain the fstring value.
I've been doing things like adding:
str.encode(text)
#or
text.encode(encoding = 'utf-8')
To the top of my replace line function. This hasn't worked. I have tried dozens of different methods but each continues to leave me with this error.
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2982: character maps to
undefined

You need to set the encoding to utf-8 for both opening the file to read from
lines = open(f"realfolder/files/{item}.html", "r", encoding="utf-8").readlines()
and opening the file to write to
out = open(file_name, 'w', encoding="utf-8")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 0: invalid start byte

This is my code.
stock_code = open('/home/ubuntu/trading/456.csv', 'r')
csvReader = csv.reader(stock_code)
for st in csvReader:
eventcode = st[1]
print(eventcode)
I want to know content in excel.
But there are unicodeDecodeError.
How can i fix it?

The CSV docs say,
Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding...
The error message shows that your system is expecting the file to be using UTF-8 encoding.
Solutions:
Make sure the file is using the correct encoding.
For example, open the file using NotePad++, select Encoding from the menu
and select UTF-8. Then resave the file.
Alternatively, specify the encoding of the file when calling open(), like this
my_encoding = 'UTF-8' # or whatever is the encoding of the file.
with open('/home/ubuntu/trading/456.csv', 'r', encoding=my_encoding) as stock_code:
stock_code = open('/home/ubuntu/trading/456.csv', 'r')
csvReader = csv.reader(stock_code)
for st in csvReader:
eventcode = st[1]
print(eventcode)

Python Error; UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026'

I am trying to extract some data from a JSON file which contains tweets and write it to a csv. The file contains all kinds of characters, I'm guessing this is why i get this error message:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026'
I guess I have to convert the output to utf-8 before writing the csv file, but I have not been able to do that. I have found similar questions here on stackoverflow, but not I've not been able to adapt the solutions to my problem (I should add that I am not really familiar with python. I'm a social scientist, not a programmer)
import csv
import json
fieldnames = ['id', 'text']
with open('MY_SOURCE_FILE', 'r') as f, open('MY_OUTPUT', 'a') as out:
writer = csv.DictWriter(
out, fieldnames=fieldnames, delimiter=',', quoting=csv.QUOTE_ALL)
for line in f:
tweet = json.loads(line)
user = tweet['user']
output = {
'text': tweet['text'],
'id': tweet['id'],
}
writer.writerow(output)

You just need to encode the text to utf-8:
for line in f:
tweet = json.loads(line)
user = tweet['user']
output = {
'text': tweet['text'].encode("utf-8"),
'id': tweet['id'],
}
writer.writerow(output)
The csv module does not support writing unicode in python2:
Note This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python reading a cp1252 file - python

Related

Python can't read txt file that contain "\"

Encoding and decoding with utf-8 returns UnicodeError

UnicodeDecodeError - Add encoding to custom function input

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 0: invalid start byte

Python Error; UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026'

Categories

Resources