I have a document in Spanish I'd like to format using Python. Problem is that in the output file, the accented characters are messed up, in this manner: \xc3\xad.
I succeeded in keeping the proper characters when I did some similar editing a while back, and although I've tried everything I did then and more, somehow it won't work this time.
This is current version of the code:
# -*- coding: utf-8 -*-
import re
import pickle
inputfile = open("input.txt").read()
pat = re.compile(r"(#.*\*)")
mylist = pat.findall(inputfile)
outputfile = open("output.txt", "w")
pickle.dump(mylist, outputfile)
outputfile.close()
I'm using Python 2.7 on Windows 7.
Can anyone see any obvious problems? The inputfile is encoded in utf-8, but I've tried encoding it latin-1 too. Thanks.
To clarify: My problem is that the latin characters doesn't show up properly in the output.
It's solved now, I just had to add this line as suggested by mata:
inputfile = inputfile.decode('utf-8')
it the input file is encoded in utf-8, then you should decode it first to work with it:
import re
import pickle
inputfile = open("input.txt").read()
inputfile = inputfile.decode('utf-8')
pat = re.compile(r"(#.*\*)")
mylist = pat.findall(inputfile)
outputfile = open("output.txt", "w")
pickle.dump(mylist, outputfile)
outputfile.close()
the so created file will contain a pickled version of your list. it you would rather hava a human readable file, then you might want to just use a plain file.
also a good way to deal with different encodings is using the codecs module:
import re
import codecs
with codecs.open("input.txt", "r", "utf-8") as infile:
inp = infile.read()
pat = re.compile(r"(#.*\*)")
mylist = pat.findall(inp)
with codecs.open("output.txt", "w", "utf-8") as outfile:
outfile.write("\n".join(mylist))
Related
This question already has answers here:
Wrong encoding when reading file in Python 3?
(1 answer)
Read special characters from .txt file in python
(3 answers)
Closed last year.
I have german wordlist which contain special charachters like ä,ö,ü. and a word e.g. like "Nährstoffe". But when i read the text file and create a dict from it i get a wrong word out of it.
Here is my code in python3:
import random
import csv
import os
permanettxtfile='wortliste.txt'
newlines = open(permanettxtfile, "r")
lines=newlines.read().split('\n')
random.shuffle(lines)
linkdict=dict.fromkeys(lines)
print(linkdict)
I get as output:
'Nährstoffe': None
But i want:
'Nährstoffe': None
How can i solve this issue? Is this an UTF-8 issue?
Try opening file in utf-8 encoding:
import random
import csv
import os
permanettxtfile='wortliste.txt'
with open(permanettxtfile, 'r', encoding='utf-8') as file:
lines = file.read().split('\n')
random.shuffle(lines)
linkdict = dict.fromkeys(lines)
print(linkdict)
Also don't forget to close it with context manager as in my example or with newlines.close() for your example
Specify the encoding using
open(permanettxtfile, "r", encoding="UTF-8")
It is most likly a encoding issue you can try this:
with open("filename.txt", "rb") as f:
contents = f.read().decode("UTF-8")
or
with open("filename.txt", encoding='utf-8') as f:
contents = f.read()
I'm new to Python (version 3.9) and I'm trying to read and print a CSV file, however, when it comes to printing, the centering appears wrong. I've used 'UTF-8', 'latin' and 'ISO-8859-1'. But none worked.
My code:
import io
import csv
with io.open(filename.csv, 'r', encoding='utf-8') as f:
text = f.read()
# process Unicode text
with io.open(filename.csv, 'w', encoding='utf-8') as f:
f.write(text)
print(text.encode('utf-8'))
printing:
b'\xef\xbb\xbf"N\xc3\xbamero,""Descri\xc3\xa7\xc3\xa3o"",""Fonte"",""Situa\xc3\xa7\xc3\xa3o""
How can i fix this?
Apparently, input file contains Byte order mark b'\xef\xbb\xbf'. Apply utf_8_sig — UTF-8 codec with BOM signature (an optional UTF-8 encoded BOM at the start of the data will be skipped).
import io
import csv
with io.open(filename.csv, 'r', encoding='utf_8_sig') as f:
text = f.read()
# process Unicode text
with io.open(filename.csv, 'w', encoding='utf-8') as f:
f.write(text)
print(text)
# "Número","Descrição","Fonte","Situação"
Not sure about return value as your input and output example is not minimal, complete and verifiable
I changed print (text.encode (utf-8)) by print(text) and it worked.
the question is that how to write string decoded from base64 to a file? I use next piece of code:
import base64
input_file = open('Input.txt', 'r')
coded_string = input_file.read()
decoded = base64.b64decode(coded_string)
output_file = open('Output.txt', 'w')
output_file.write(decoded)
output_file.close()
Input.txt contains base64 string (smth. like PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz48cmV2aW). After script execution I suppose to see xml in Output.txt but output file contains some wrong symbols (like <?xml version="1.0" encoding="UTF-8"?><review-case create®vFFSТ#2). At the same time if I not read from base64 string from file Input.txt but specify it in script as coded_string = '''PD94bWwgdmVyc2lvbj0iMS4wIiBlbm...''' then Output.txt contains correct xml. Is this something wrong with utf encoding? How to fix this? I use Python2.7 on Windows 7. Thanks in advance.
You probably figured out, now 5 years later, but here is the solution if anyone needs it.
import base64
with open('Input.txt', 'r') as input_file:
coded_string = input_file.read()
decoded = base64.b64decode(coded_string)
with open('Output.txt', 'w', encoding="utf-8") as output_file:
output_file.write(decoded.decode("utf-8"))
under windows you open with 'rb' instead of 'r'.
in your case your code should be :
input_file = open('Input.txt', 'rb')
instead of
input_file = open('Input.txt', 'r')
btw:
http://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a 'b' to the mode, so you can use it platform-independently for all binary files.
hope it helps
I am using this Python script to convert CSV to XML. After conversion I see tags in the text (vim), which causes XML parsing error.
I am already tried answers from here, without success.
The converted XML file.
Thanks for any help!
Your input file has BOM (byte-order mark) characters, and Python doesn't strip them automatically when file is encoded in utf8. See: Reading Unicode file data with BOM chars in Python
>>> s = '\xef\xbb\xbfABC'
>>> s.decode('utf8')
u'\ufeffABC'
>>> s.decode('utf-8-sig')
u'ABC'
So for your specific case, try something like
from io import StringIO
s = StringIO(open(csvFile).read().decode('utf-8-sig'))
csvData = csv.reader(s)
Very terrible style, but that script is a hacked together script anyway for a one-shot job.
Change utf-8 to utf-8-sig
import csv
with open('example.txt', 'r', encoding='utf-8-sig') as file:
Here's an example of a script that uses a real XML-aware library to run a similar conversion. It doesn't have the exact same output, but, well, it's an example -- salt to taste.
import csv
import lxml.etree
csvFile = 'myData.csv'
xmlFile = 'myData.xml'
reader = csv.reader(open(csvFile, 'r'))
with lxml.etree.xmlfile(xmlFile) as xf:
xf.write_declaration(standalone=True)
with xf.element('root'):
for row in reader:
row_el = lxml.etree.Element('row')
for col in row:
col_el = lxml.etree.SubElement(row_el, 'col')
col_el.text = col
xf.write(row_el)
To refer to the content of, say, row 2 column 3, you'd then use XPath like /row[2]/col[3]/text().
I'm having problems reading from a file, processing its string and saving to an UTF-8 File.
Here is the code:
try:
filehandle = open(filename,"r")
except:
print("Could not open file " + filename)
quit()
text = filehandle.read()
filehandle.close()
I then do some processing on the variable text.
And then
try:
writer = open(output,"w")
except:
print("Could not open file " + output)
quit()
#data = text.decode("iso 8859-15")
#writer.write(data.encode("UTF-8"))
writer.write(text)
writer.close()
This output the file perfectly but it does so in iso 8859-15 according to my editor. Since the same editor recognizes the input file (in the variable filename) as UTF-8 I don't know why this happened. As far as my reasearch has shown the commented lines should solve the problem. However when I use those lines the resulting file has gibberish in special character mainly, words with tilde as the text is in spanish. I would really appreciate any help as I am stumped....
Process text to and from Unicode at the I/O boundaries of your program using open with the encoding parameter. Make sure to use the (hopefully documented) encoding of the file being read. The default encoding varies by OS (specifically, locale.getpreferredencoding(False) is the encoding used), so I recommend always explicitly using the encoding parameter for portability and clarity (Python 3 syntax below):
with open(filename, 'r', encoding='utf8') as f:
text = f.read()
# process Unicode text
with open(filename, 'w', encoding='utf8') as f:
f.write(text)
If still using Python 2 or for Python 2/3 compatibility, the io module implements open with the same semantics as Python 3's open and exists in both versions:
import io
with io.open(filename, 'r', encoding='utf8') as f:
text = f.read()
# process Unicode text
with io.open(filename, 'w', encoding='utf8') as f:
f.write(text)
You can also get through it by the code below:
file=open(completefilepath,'r',encoding='utf8',errors="ignore")
file.read()
You can't do that using open. use codecs.
when you are opening a file in python using the open built-in function you will always read/write the file in ascii. To write it in utf-8 try this:
import codecs
file = codecs.open('data.txt','w','utf-8')
The encoding parameter is what does the trick.
my_list = ['1', '2', '3', '4']
with open('test.txt', 'w', encoding='utf8') as file:
for i in my_list:
file.write(i + '\n')
You can try using utf-16, it might work.
data = pd.read_table(filename, encoding='utf-16', delimiter="\t")