Combining multiple txt files (Python 3, UnicodeDecodeError)

Combining multiple txt files (Python 3, UnicodeDecodeError) - python

Below codes were used in Python 2 to combine all txt files in a folder. It worked fine.
import os
base_folder = "C:\\FDD\\"
all_files = []
for each in os.listdir(base_folder):
if each.endswith('.txt'):
kk = os.path.join(base_folder, each)
all_files.append(kk)
with open(base_folder + "Combined.txt", 'w') as outfile:
for fname in all_files:
with open(fname) as infile:
for line in infile:
outfile.write(line)
When in Python 3, it gives an error:
Traceback (most recent call last):
File "C:\Scripts\thescript.py", line 26, in <module>
for line in infile:
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'CP_UTF8' codec can't decode byte 0xe4 in position 53: No mapping for the Unicode character exists in the target code page.
I made this change:
with open(fname) as infile:
to
with open(fname, 'r', encoding = 'latin-1') as infile:
It gives me “MemoryError”.
How can I correct this error in Python 3? Thank you.

As #transilvlad suggested here, use the open method from the codecs module to read in the file:
import codecs
with codecs.open(fname, 'r', encoding = 'utf-8',
errors='ignore') as infile:
This will strip out (ignore) the characters in the error returning the string without them.

Related

Getting Error codec can't encode characters in position 8-13: character maps to <undefined>

I get this error
Traceback (most recent call last):
File "C:\Users\Anthony\PycharmProjects\ReadFile\main.py", line 14, in <module>
masterFile.write("Line {}: {}\n".format(index, line.strip()))
File "C:\Users\Anthony\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 8-13: character maps to undefined
The program is supposed to search for all txts in a directory and search them for a specific word. Once it finds it print them to a file with the line and then also print another copy of the file with full line numbers. There will be like 100 txt files and it will work on the first 3 before I get this error message. All the files are UTF-8 encoded. I tried changing
with open(file, encoding="utf-8") as f: but it didn't work.
import glob
searchWord = "Hello"
dataFile = open("C:/Users/Anthony/Documents/TextDataFolder/TextData.txt", 'w')
masterFile = open("C:/Users/Anthony/Documents/TextDataFolder/masterFile.txt", 'w')
files = glob.iglob("#C:/Users/Anthony/Documents/Texts/*.txt", recursive = True)
for file in files:
with open(file) as f:
print(file)
for index, line in enumerate(f):
#print("Line {}: {}".format(index, line.strip()))
masterFile.write("Line {}: {}\n".format(index, line.strip()))
if searchWord in line:
print("Line {}: {}".format(index, line.strip()))
dataFile.write("Line {}: {}\n".format(index, line.strip()))

I eventually figured it out... I feel like an idiot. The problem wasn't my reading of the files. It was my writing wasn't encoded. had only attempted to encoding my read. So Final Looks like this
import glob
searchWord = "Hello"
dataFile = open("C:/Users/Anthony/Documents/TextDataFolder/TextData.txt", 'w', encoding="utf-8")masterFile = masterFile = open("C:/Users/Anthony/Documents/TextDataFolder/masterFile.txt", 'w', encoding="utf-8")
files = glob.iglob("#C:/Users/Anthony/Documents/Texts/*.txt", recursive = True)
for file in files:
with open(file, "r", encoding="utf-8") as f:
print(file)
for index, line in enumerate(f):
#print("Line {}: {}".format(index, line.strip()))
masterFile.write("Line {}: {}\n".format(index, line.strip()))
if searchWord in line:
print("Line {}: {}".format(index, line.strip()))
dataFile.write("Line {}: {}\n".format(index, line.strip()))

Python3 f.write UnicodeEncodeError: 'utf-8' codec can't encode characters surrogates not allowed

Run Python files over the web (php).
Afterwards, an error occurs while printing the Korean string to a file with Python.
On the other hand, running Python files directly using the terminal does not cause errors.
Do you know what the problem is?
Please help me.
error Traceback (most recent call last): File "makeApp.py", line 171,
in modify_app_info(app_name) File "makeApp.py", line 65, in modify_app_info f.write(line+"\n") UnicodeEncodeError: 'utf-8' codec can't encode characters in position 13-30: surrogates not allowed
Below is the code that causes the problem.
lines = read_file(read_file_path)
f = open(read_file_path, 'r', encoding='UTF-8')
lines = f.readlines()
f.close()
f = open(write_file_path, 'w', encoding='UTF-8')
for line in lines:
if '"name": "userInputAppName"' in line:
line = ' "name": "' + app_name + '",')
continue
f.write(line+"\n")
# f.write(line)
f.close()

Remove the encoding param, cauz you open a file in encoded mode so, you can't join any substring on the string.
So your code will be--
# ...
lines = read_file(read_file_path)
f = open(read_file_path, 'r')
lines = f.readlines()
f.close()
f = open(write_file_path, 'w')
for line in lines:
if '"name": "userInputAppName"' in line:
line = ' "name": "' + app_name + '",')
continue
f.write(line+"\n")
# f.write(line)
f.close()

How to read csv file using python that have multi line data in one field [duplicate]

I've read every post I can find, but my situation seems unique. I'm totally new to Python so this could be basic. I'm getting the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 70: character maps to undefined
When I run the code:
import csv
input_file = 'input.csv'
output_file = 'output.csv'
cols_to_remove = [4, 6, 8, 9, 10, 11,13, 14, 19, 20, 21, 22, 23, 24]
cols_to_remove = sorted(cols_to_remove, reverse=True)
row_count = 0 # Current amount of rows processed
with open(input_file, "r") as source:
reader = csv.reader(source)
with open(output_file, "w", newline='') as result:
writer = csv.writer(result)
for row in reader:
row_count += 1
print('\r{0}'.format(row_count), end='')
for col_index in cols_to_remove:
del row[col_index]
writer.writerow(row)
What am I doing wrong?

In Python 3, the csv module processes the file as unicode strings, and because of that has to first decode the input file. You can use the exact encoding if you know it, or just use Latin1 because it maps every byte to the unicode character with same code point, so that decoding+encoding keep the byte values unchanged. Your code could become:
...
with open(input_file, "r", encoding='Latin1') as source:
reader = csv.reader(source)
with open(output_file, "w", newline='', encoding='Latin1') as result:
...

Add encoding="utf8" while opening file. Try below instead:
with open(input_file, "r", encoding="utf8") as source:
reader = csv.reader(source)
with open(output_file, "w", newline='', encoding="utf8") as result:

Try pandas
input_file = pandas.read_csv('input.csv')
output_file = pandas.read_csv('output.csv')
Try saving the file again as CSV UTF-8

csv read raises "UnicodeDecodeError: 'charmap' codec can't decode..."

I've read every post I can find, but my situation seems unique. I'm totally new to Python so this could be basic. I'm getting the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 70: character maps to undefined
When I run the code:
import csv
input_file = 'input.csv'
output_file = 'output.csv'
cols_to_remove = [4, 6, 8, 9, 10, 11,13, 14, 19, 20, 21, 22, 23, 24]
cols_to_remove = sorted(cols_to_remove, reverse=True)
row_count = 0 # Current amount of rows processed
with open(input_file, "r") as source:
reader = csv.reader(source)
with open(output_file, "w", newline='') as result:
writer = csv.writer(result)
for row in reader:
row_count += 1
print('\r{0}'.format(row_count), end='')
for col_index in cols_to_remove:
del row[col_index]
writer.writerow(row)
What am I doing wrong?

In Python 3, the csv module processes the file as unicode strings, and because of that has to first decode the input file. You can use the exact encoding if you know it, or just use Latin1 because it maps every byte to the unicode character with same code point, so that decoding+encoding keep the byte values unchanged. Your code could become:
...
with open(input_file, "r", encoding='Latin1') as source:
reader = csv.reader(source)
with open(output_file, "w", newline='', encoding='Latin1') as result:
...

Add encoding="utf8" while opening file. Try below instead:
with open(input_file, "r", encoding="utf8") as source:
reader = csv.reader(source)
with open(output_file, "w", newline='', encoding="utf8") as result:

Try pandas
input_file = pandas.read_csv('input.csv')
output_file = pandas.read_csv('output.csv')
Try saving the file again as CSV UTF-8

python: unicode problem

I am trying to decode a string I took from file:
file = open ("./Downloads/lamp-post.csv", 'r')
data = file.readlines()
data[0]
'\xff\xfeK\x00e\x00y\x00w\x00o\x00r\x00d\x00\t\x00C\x00o\x00m\x00p\x00e\x00t\x00i\x00t\x00i\x00o\x00n\x00\t\x00G\x00l\x00o\x00b\x00a\x00l\x00
\x00M\x00o\x00n\x00t\x00h\x00l\x00y\x00
\x00S\x00e\x00a\x00r\x00c\x00h\x00e\x00s\x00\t\x00D\x00e\x00c\x00
\x002\x000\x001\x000\x00\t\x00N\x00o\x00v\x00
\x002\x000\x001\x000\x00\t\x00O\x00c\x00t\x00
\x002\x000\x001\x000\x00\t\x00S\x00e\x00p\x00
\x002\x000\x001\x000\x00\t\x00A\x00u\x00g\x00
\x002\x000\x001\x000\x00\t\x00J\x00u\x00l\x00
\x002\x000\x001\x000\x00\t\x00J\x00u\x00n\x00
\x002\x000\x001\x000\x00\t\x00M\x00a\x00y\x00
\x002\x000\x001\x000\x00\t\x00A\x00p\x00r\x00
\x002\x000\x001\x000\x00\t\x00M\x00a\x00r\x00
\x002\x000\x001\x000\x00\t\x00F\x00e\x00b\x00
\x002\x000\x001\x000\x00\t\x00J\x00a\x00n\x00
\x002\x000\x001\x000\x00\t\x00A\x00d\x00
\x00s\x00h\x00a\x00r\x00e\x00\t\x00S\x00e\x00a\x00r\x00c\x00h\x00
\x00s\x00h\x00a\x00r\x00e\x00\t\x00E\x00s\x00t\x00i\x00m\x00a\x00t\x00e\x00d\x00
\x00A\x00v\x00g\x00.\x00
\x00C\x00P\x00C\x00\t\x00E\x00x\x00t\x00r\x00a\x00c\x00t\x00e\x00d\x00
\x00F\x00r\x00o\x00m\x00
\x00W\x00e\x00b\x00
\x00P\x00a\x00g\x00e\x00\t\x00L\x00o\x00c\x00a\x00l\x00
\x00M\x00o\x00n\x00t\x00h\x00l\x00y\x00
\x00S\x00e\x00a\x00r\x00c\x00h\x00e\x00s\x00\n'
Adding ignore do not really help...:
In [69]: data[2]
Out[69]: u'\u6700\u6100\u7200\u6400\u6500\u6e00\u2000\u6c00\u6100\u6d00\u7000\u2000\u7000\u6f00\u7300\u7400\u0900\u3000\u2e00\u3900\u3400\u0900\u3800\u3800\u3000\u0900\u2d00\u0900\u3300\u3200\u3000\u0900\u3300\u3900\u3000\u0900\u3300\u3900\u3000\u0900\u3400\u3800\u3000\u0900\u3500\u3900\u3000\u0900\u3500\u3900\u3000\u0900\u3700\u3200\u3000\u0900\u3700\u3200\u3000\u0900\u3300\u3900\u3000\u0900\u3300\u3200\u3000\u0900\u3200\u3600\u3000\u0900\u2d00\u0900\u2d00\u0900\ua300\u3200\u2e00\u3100\u3800\u0900\u2d00\u0900\u3400\u3800\u3000\u0a00'
In [70]: data[2].decode("utf-8",
"replace")
---------------------------------------------------------------------------
Traceback (most recent call last)
/Users/oleg/ in
()
/opt/local/lib/python2.5/encodings/utf_8.py
in decode(input, errors)
14
15 def decode(input, errors='strict'):
---> 16 return codecs.utf_8_decode(input, errors,
True)
17
18 class IncrementalEncoder(codecs.IncrementalEncoder):
:
'ascii' codec can't encode characters
in position 0-87: ordinal not in
range(128)
In [71]:

This looks like UTF-16 data. So try
data[0].rstrip("\n").decode("utf-16")
Edit (for your update): Try to decode the whole file at once, that is
data = open(...).read()
data.decode("utf-16")
The problem is that the line breaks in UTF-16 are "\n\x00", but using readlines() will split at the "\n", leaving the "\x00" character for the next line.

This file is a UTF-16-LE encoded file, with an initial BOM.
import codecs
fp= codecs.open("a", "r", "utf-16")
lines= fp.readlines()

EDIT
Since you posted 2.7 this is the 2.7 solution:
file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "replace") for line in file]
Ignoring undecodeable characters:
file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "ignore") for line in file]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Combining multiple txt files (Python 3, UnicodeDecodeError) - python

As #transilvlad suggested here, use the open method from the codecs module to read in the file: import codecs with codecs.open(fname, 'r', encoding = 'utf-8', errors='ignore') as infile: This will strip out (ignore) the characters in the error returning the string without them.

Related

Getting Error codec can't encode characters in position 8-13: character maps to <undefined>

Python3 f.write UnicodeEncodeError: 'utf-8' codec can't encode characters surrogates not allowed

How to read csv file using python that have multi line data in one field [duplicate]

csv read raises "UnicodeDecodeError: 'charmap' codec can't decode..."

python: unicode problem

Categories

Resources