Problem with UTF-8. The accent is printing wrong - python

I'm new to Python (version 3.9) and I'm trying to read and print a CSV file, however, when it comes to printing, the centering appears wrong. I've used 'UTF-8', 'latin' and 'ISO-8859-1'. But none worked.
My code:
import io
import csv
with io.open(filename.csv, 'r', encoding='utf-8') as f:
text = f.read()
# process Unicode text
with io.open(filename.csv, 'w', encoding='utf-8') as f:
f.write(text)
print(text.encode('utf-8'))
printing:
b'\xef\xbb\xbf"N\xc3\xbamero,""Descri\xc3\xa7\xc3\xa3o"",""Fonte"",""Situa\xc3\xa7\xc3\xa3o""
How can i fix this?

Apparently, input file contains Byte order mark b'\xef\xbb\xbf'. Apply utf_8_sig — UTF-8 codec with BOM signature (an optional UTF-8 encoded BOM at the start of the data will be skipped).
import io
import csv
with io.open(filename.csv, 'r', encoding='utf_8_sig') as f:
text = f.read()
# process Unicode text
with io.open(filename.csv, 'w', encoding='utf-8') as f:
f.write(text)
print(text)
# "Número","Descrição","Fonte","Situação"
Not sure about return value as your input and output example is not minimal, complete and verifiable

I changed print (text.encode (utf-8)) by print(text) and it worked.

Related

Python can't parse my list of ints [duplicate]

I needed to parse files generated by other tool, which unconditionally outputs json file with UTF-8 BOM header (EFBBBF). I soon found that this was the problem, as Python 2.7 module can't seem to parse it:
>>> import json
>>> data = json.load(open('sample.json'))
ValueError: No JSON object could be decoded
Removing BOM, solves it, but I wonder if there is another way of parsing json file with BOM header?
You can open with codecs:
import json
import codecs
json.load(codecs.open('sample.json', 'r', 'utf-8-sig'))
or decode with utf-8-sig yourself and pass to loads:
json.loads(open('sample.json').read().decode('utf-8-sig'))
Simple! You don't even need to import codecs.
with open('sample.json', encoding='utf-8-sig') as f:
data = json.load(f)
Since json.load(stream) uses json.loads(stream.read()) under the hood, it won't be that bad to write a small hepler function that lstrips the BOM:
from codecs import BOM_UTF8
def lstrip_bom(str_, bom=BOM_UTF8):
if str_.startswith(bom):
return str_[len(bom):]
else:
return str_
json.loads(lstrip_bom(open('sample.json').read()))
In other situations where you need to wrap a stream and fix it somehow you may look at inheriting from codecs.StreamReader.
you can also do it with keyword with
import codecs
with codecs.open('samples.json', 'r', 'utf-8-sig') as json_file:
data = json.load(json_file)
or better:
import io
with io.open('samples.json', 'r', encoding='utf-8-sig') as json_file:
data = json.load(json_file)
If this is a one-off, a very simple super high-tech solution that worked for me...
Open the JSON file in your favorite text editor.
Select-all
Create a new file
Paste
Save.
BOOM, BOM header gone!
I removed the BOM manually with Linux command.
First I check if there are efbb bf bytes for the file, with head i_have_BOM | xxd.
Then I run dd bs=1 skip=3 if=i_have_BOM.json of=I_dont_have_BOM.json.
bs=1 process 1 byte each time, skip=3, skip the first 3 bytes.
I'm using utf-8-sig just with import json
with open('estados.json', encoding='utf-8-sig') as json_file:
data = json.load(json_file)
print(data)

Raw string for variables in python?

I have seen several similar posts on this but nothing has solved my problem.
I am reading a list of numbers with backslashes and writing them to a .csv. Obviously the backslashes are causing problems.
addr = "6253\342\200\2236387"
with open("output.csv", 'a') as w:
write = writer(w)
write.writerow([addr])
I found that using r"6253\342\200\2236387" gave me exactly what I want for the output but since I am reading my input from a file I can't use raw string. i tried .encode('string-escape') but that gave me 6253\xe2\x80\x936387 as output which is definitely not what I want. unicode-escape gave me an error. Any thoughts?
The r in front of a string is only for defining a string. If you're reading data from a file, it's already 'raw'. You shouldn't have to do anything special when reading in your data.
Note that if your data is not plain ascii, you may need to decode it or read it in binary. For example, if the data is utf-8, you can open the file like this before reading:
import codecs
f = codecs.open("test", "r", "utf-8")
Text file contains...
1234\4567\7890
41\5432\345\6789
Code:
with open('c:/tmp/numbers.csv', 'ab') as w:
f = open(textfilepath)
wr = csv.writer(w)
for line in f:
line = line.strip()
wr.writerow([line])
f.close()
This produced a csv with whole lines in a column. Maybe use 'ab' rather than 'a' as your file open type. I was getting extra blank records in my csv when using just 'a'.
I created this awhile back. This helps you write to a csv file.
def write2csv(fileName,theData):
theFile = open(fileName+'.csv', 'a')
wr = csv.writer(theFile, delimiter = ',', quoting=csv.QUOTE_MINIMAL)
wr.writerow(theData)

Write decoded from base64 string to file

the question is that how to write string decoded from base64 to a file? I use next piece of code:
import base64
input_file = open('Input.txt', 'r')
coded_string = input_file.read()
decoded = base64.b64decode(coded_string)
output_file = open('Output.txt', 'w')
output_file.write(decoded)
output_file.close()
Input.txt contains base64 string (smth. like PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz48cmV2aW). After script execution I suppose to see xml in Output.txt but output file contains some wrong symbols (like <?xml version="1.0" encoding="UTF-8"?><review-case create®vFFSТ#2). At the same time if I not read from base64 string from file Input.txt but specify it in script as coded_string = '''PD94bWwgdmVyc2lvbj0iMS4wIiBlbm...''' then Output.txt contains correct xml. Is this something wrong with utf encoding? How to fix this? I use Python2.7 on Windows 7. Thanks in advance.
You probably figured out, now 5 years later, but here is the solution if anyone needs it.
import base64
with open('Input.txt', 'r') as input_file:
coded_string = input_file.read()
decoded = base64.b64decode(coded_string)
with open('Output.txt', 'w', encoding="utf-8") as output_file:
output_file.write(decoded.decode("utf-8"))
under windows you open with 'rb' instead of 'r'.
in your case your code should be :
input_file = open('Input.txt', 'rb')
instead of
input_file = open('Input.txt', 'r')
btw:
http://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a 'b' to the mode, so you can use it platform-independently for all binary files.
hope it helps

Python reading from a file and saving to utf-8

I'm having problems reading from a file, processing its string and saving to an UTF-8 File.
Here is the code:
try:
filehandle = open(filename,"r")
except:
print("Could not open file " + filename)
quit()
text = filehandle.read()
filehandle.close()
I then do some processing on the variable text.
And then
try:
writer = open(output,"w")
except:
print("Could not open file " + output)
quit()
#data = text.decode("iso 8859-15")
#writer.write(data.encode("UTF-8"))
writer.write(text)
writer.close()
This output the file perfectly but it does so in iso 8859-15 according to my editor. Since the same editor recognizes the input file (in the variable filename) as UTF-8 I don't know why this happened. As far as my reasearch has shown the commented lines should solve the problem. However when I use those lines the resulting file has gibberish in special character mainly, words with tilde as the text is in spanish. I would really appreciate any help as I am stumped....
Process text to and from Unicode at the I/O boundaries of your program using open with the encoding parameter. Make sure to use the (hopefully documented) encoding of the file being read. The default encoding varies by OS (specifically, locale.getpreferredencoding(False) is the encoding used), so I recommend always explicitly using the encoding parameter for portability and clarity (Python 3 syntax below):
with open(filename, 'r', encoding='utf8') as f:
text = f.read()
# process Unicode text
with open(filename, 'w', encoding='utf8') as f:
f.write(text)
If still using Python 2 or for Python 2/3 compatibility, the io module implements open with the same semantics as Python 3's open and exists in both versions:
import io
with io.open(filename, 'r', encoding='utf8') as f:
text = f.read()
# process Unicode text
with io.open(filename, 'w', encoding='utf8') as f:
f.write(text)
You can also get through it by the code below:
file=open(completefilepath,'r',encoding='utf8',errors="ignore")
file.read()
You can't do that using open. use codecs.
when you are opening a file in python using the open built-in function you will always read/write the file in ascii. To write it in utf-8 try this:
import codecs
file = codecs.open('data.txt','w','utf-8')
The encoding parameter is what does the trick.
my_list = ['1', '2', '3', '4']
with open('test.txt', 'w', encoding='utf8') as file:
for i in my_list:
file.write(i + '\n')
You can try using utf-16, it might work.
data = pd.read_table(filename, encoding='utf-16', delimiter="\t")

Python: Problems with latin characters in output

I have a document in Spanish I'd like to format using Python. Problem is that in the output file, the accented characters are messed up, in this manner: \xc3\xad.
I succeeded in keeping the proper characters when I did some similar editing a while back, and although I've tried everything I did then and more, somehow it won't work this time.
This is current version of the code:
# -*- coding: utf-8 -*-
import re
import pickle
inputfile = open("input.txt").read()
pat = re.compile(r"(#.*\*)")
mylist = pat.findall(inputfile)
outputfile = open("output.txt", "w")
pickle.dump(mylist, outputfile)
outputfile.close()
I'm using Python 2.7 on Windows 7.
Can anyone see any obvious problems? The inputfile is encoded in utf-8, but I've tried encoding it latin-1 too. Thanks.
To clarify: My problem is that the latin characters doesn't show up properly in the output.
It's solved now, I just had to add this line as suggested by mata:
inputfile = inputfile.decode('utf-8')
it the input file is encoded in utf-8, then you should decode it first to work with it:
import re
import pickle
inputfile = open("input.txt").read()
inputfile = inputfile.decode('utf-8')
pat = re.compile(r"(#.*\*)")
mylist = pat.findall(inputfile)
outputfile = open("output.txt", "w")
pickle.dump(mylist, outputfile)
outputfile.close()
the so created file will contain a pickled version of your list. it you would rather hava a human readable file, then you might want to just use a plain file.
also a good way to deal with different encodings is using the codecs module:
import re
import codecs
with codecs.open("input.txt", "r", "utf-8") as infile:
inp = infile.read()
pat = re.compile(r"(#.*\*)")
mylist = pat.findall(inp)
with codecs.open("output.txt", "w", "utf-8") as outfile:
outfile.write("\n".join(mylist))

Categories