Python3: Convert Latin-1 to UTF-8 [duplicate]

Python3: Convert Latin-1 to UTF-8 [duplicate] - python

This question already has answers here:
Python: Converting from ISO-8859-1/latin1 to UTF-8
(5 answers)
Closed 1 year ago.
My code looks like the following:
for file in glob.iglob(os.path.join(dir, '*.txt')):
print(file)
with codecs.open(file,encoding='latin-1') as f:
infile = f.read()
with codecs.open('test.txt',mode='w',encoding='utf-8') as f:
f.write(infile)
The files I work with are encoded in Latin-1 (I could not open them in UTF-8 obviously). But I want to write the resulting files in utf-8.
But this:
<Trans audio_filename="VALE_M11_070.MP3" xml:lang="español">
<Datos clave_texto=" VALE_M11_070" tipo_texto="entrevista_semidirigida">
<Corpus corpus="PRESEEA" subcorpus="ESESUMA" ciudad="Valencia" pais="España"/>
Instead becomes this (in gedit):
<Trans audio_filename="VALE_M11_070.MP3" xml:lang="espa뇃漀氀∀㸀ഀ਀㰀䐀愀琀`漀猀 挀氀愀瘀攀开琀攀砀琀漀㴀∀ 嘀䄀䰀䔀开䴀㄀㄀开　㜀
If I print it on the Terminal, it shows up normal.
Even more confusing is what I get when I open the resulting file with LibreOffice Writer:
<#T#r#a#n#s# (and so on)
So how do I properly convert a latin-1 string to a utf-8 string? In python2, it's easy, but in python3, it seems confusing to me.
I tried already these in different combinations:
#infile = bytes(infile,'utf-8').decode('utf-8')
#infile = infile.encode('utf-8').decode('utf-8')
#infile = bytes(infile,'utf-8').decode('utf-8')
But somehow I always end up with the same weird output.
Thanks in advance!
Edit: This question is different to the questions linked in the comment, as it concerns Python 3, not Python 2.7.

I have found a half-part way in this. This is not what you want / need, but might help others in the right direction...
# First read the file
txt = open("file_name", "r", encoding="latin-1") # r = read, w = write & a = append
items = txt.readlines()
txt.close()
# and write the changes to file
output = open("file_name", "w", encoding="utf-8")
for string_fin in items:
if "Ã©" in string_fin:
string_fin = string_fin.replace("Ã©", "é")
if "Ã«" in string_fin:
string_fin = string_fin.replace("Ã«", "ë")
# this works if not to much needs changing...
output.write(string_fin)
output.close();
*note for detection

For python 3.6:
your_str = your_str.encode('utf-8').decode('latin-1')

Related

Read txt file into python [duplicate]

This question already has answers here:
Wrong encoding when reading file in Python 3?
(1 answer)
Read special characters from .txt file in python
(3 answers)
Closed last year.
I have german wordlist which contain special charachters like ä,ö,ü. and a word e.g. like "Nährstoffe". But when i read the text file and create a dict from it i get a wrong word out of it.
Here is my code in python3:
import random
import csv
import os
permanettxtfile='wortliste.txt'
newlines = open(permanettxtfile, "r")
lines=newlines.read().split('\n')
random.shuffle(lines)
linkdict=dict.fromkeys(lines)
print(linkdict)
I get as output:
'NÃ¤hrstoffe': None
But i want:
'Nährstoffe': None
How can i solve this issue? Is this an UTF-8 issue?

Try opening file in utf-8 encoding:
import random
import csv
import os
permanettxtfile='wortliste.txt'
with open(permanettxtfile, 'r', encoding='utf-8') as file:
lines = file.read().split('\n')
random.shuffle(lines)
linkdict = dict.fromkeys(lines)
print(linkdict)
Also don't forget to close it with context manager as in my example or with newlines.close() for your example

Specify the encoding using
open(permanettxtfile, "r", encoding="UTF-8")

It is most likly a encoding issue you can try this:
with open("filename.txt", "rb") as f:
contents = f.read().decode("UTF-8")
or
with open("filename.txt", encoding='utf-8') as f:
contents = f.read()

Raw string for variables in python?

I have seen several similar posts on this but nothing has solved my problem.
I am reading a list of numbers with backslashes and writing them to a .csv. Obviously the backslashes are causing problems.
addr = "6253\342\200\2236387"
with open("output.csv", 'a') as w:
write = writer(w)
write.writerow([addr])
I found that using r"6253\342\200\2236387" gave me exactly what I want for the output but since I am reading my input from a file I can't use raw string. i tried .encode('string-escape') but that gave me 6253\xe2\x80\x936387 as output which is definitely not what I want. unicode-escape gave me an error. Any thoughts?

The r in front of a string is only for defining a string. If you're reading data from a file, it's already 'raw'. You shouldn't have to do anything special when reading in your data.
Note that if your data is not plain ascii, you may need to decode it or read it in binary. For example, if the data is utf-8, you can open the file like this before reading:
import codecs
f = codecs.open("test", "r", "utf-8")

Text file contains...
1234\4567\7890
41\5432\345\6789
Code:
with open('c:/tmp/numbers.csv', 'ab') as w:
f = open(textfilepath)
wr = csv.writer(w)
for line in f:
line = line.strip()
wr.writerow([line])
f.close()
This produced a csv with whole lines in a column. Maybe use 'ab' rather than 'a' as your file open type. I was getting extra blank records in my csv when using just 'a'.

I created this awhile back. This helps you write to a csv file.
def write2csv(fileName,theData):
theFile = open(fileName+'.csv', 'a')
wr = csv.writer(theFile, delimiter = ',', quoting=csv.QUOTE_MINIMAL)
wr.writerow(theData)

remove <feff> from a file

I am using this Python script to convert CSV to XML. After conversion I see tags in the text (vim), which causes XML parsing error.
I am already tried answers from here, without success.
The converted XML file.
Thanks for any help!

Your input file has BOM (byte-order mark) characters, and Python doesn't strip them automatically when file is encoded in utf8. See: Reading Unicode file data with BOM chars in Python
>>> s = '\xef\xbb\xbfABC'
>>> s.decode('utf8')
u'\ufeffABC'
>>> s.decode('utf-8-sig')
u'ABC'
So for your specific case, try something like
from io import StringIO
s = StringIO(open(csvFile).read().decode('utf-8-sig'))
csvData = csv.reader(s)
Very terrible style, but that script is a hacked together script anyway for a one-shot job.

Change utf-8 to utf-8-sig
import csv
with open('example.txt', 'r', encoding='utf-8-sig') as file:

Here's an example of a script that uses a real XML-aware library to run a similar conversion. It doesn't have the exact same output, but, well, it's an example -- salt to taste.
import csv
import lxml.etree
csvFile = 'myData.csv'
xmlFile = 'myData.xml'
reader = csv.reader(open(csvFile, 'r'))
with lxml.etree.xmlfile(xmlFile) as xf:
xf.write_declaration(standalone=True)
with xf.element('root'):
for row in reader:
row_el = lxml.etree.Element('row')
for col in row:
col_el = lxml.etree.SubElement(row_el, 'col')
col_el.text = col
xf.write(row_el)
To refer to the content of, say, row 2 column 3, you'd then use XPath like /row[2]/col[3]/text().

Writing unicode data in csv

I know similar kind of question has been asked many times but seriously i have not been able to properly implement the csv writer which writes properly in csv (it shows garbage).
I am trying to use UnicodeWriter as mention in official docs .
ff = open('a.csv', 'w')
writer = UnicodeWriter(ff)
st = unicode('Displaygrößen', 'utf-8') #gives (u'Displaygr\xf6\xdfen', 'utf-8')
writer.writerow([st])
This does not give me any decoding or encoding error. But it writes the word Displaygrößen as DisplaygrÃ¶ÃŸen which is not good. Can any one help me what i am doing wrong here??

You are writing a file in UTF-8 format, but you don't indicate that into your csv file.
You should write the UTF-8 header at the beginning of the file. Add this:
ff = open('a.csv', 'w')
ff.write(codecs.BOM_UTF8)
And your csv file should open correctly after that with the program trying to read it.

Opening the file with codecs.open should fix it.

segmenting and writing binary file using Python

I have two binary input files, firstfile and secondfile. secondfile is firstfile + additional material. I want to isolate this additional material in a separate file, newfile. This is what I have so far:
import os
import struct
origbytes = os.path.getsize(firstfile)
fullbytes = os.path.getsize(secondfile)
numbytes = fullbytes-origbytes
with open(secondfile,'rb') as f:
first = f.read(origbytes)
rest = f.read()
Naturally, my inclination is to do (which seems to work):
with open(newfile,'wb') as f:
f.write(rest)
I can't find it but thought I read on SO that I should pack this first using struct.pack before writing to file. The following gives me an error:
with open(newfile,'wb') as f:
f.write(struct.pack('%%%ds' % numbytes,rest))
-----> error: bad char in struct format
This works however:
with open(newfile,'wb') as f:
f.write(struct.pack('c'*numbytes,*rest))
And for the ones that work, this gives me the right answer
with open(newfile,'rb') as f:
test = f.read()
len(test)==numbytes
-----> True
Is this the correct way to write a binary file? I just want to make sure I'm doing this part correctly to diagnose if the second part of the file is corrupted as another reader program I am feeding newfile to is telling me, or I am doing this wrong. Thank you.

If you know that secondfile is the same as firstfile + appended data, why even read in the first part of secondfile?
with open(secondfile,'rb') as f:
f.seek(origbytes)
rest = f.read()
As for writing things out,
with open(newfile,'wb') as f:
f.write(rest)
is just fine. The stuff with struct would just be a no-op anyway. The only thing you might consider is the size of rest. If it could be large, you may want to read and write the data in blocks.

There is no reason to use the struct module, which is for converting between binary formats and Python objects. There's no conversion needed here.
Strings in Python 2.x are just an array of bytes and can be read and written to and from files. (In Python 3.x, the read function returns a bytes object, which is the same thing, if you open the file with open(filename, 'rb').)
So you can just read the file into a string, then write it again:
import os
origbytes = os.path.getsize(firstfile)
fullbytes = os.path.getsize(secondfile)
numbytes = fullbytes-origbytes
with open(secondfile,'rb') as f:
first = f.seek(origbytes)
rest = f.read()
with open(newfile,'wb') as f:
f.write(rest)

You don't need to read origbytes, just move file pointer to the right position: f.seek(numbytes)
You don't need struct packing, write rest to the newfile.

This is not c, there is no % in the format string. What you want is:
f.write(struct.pack('%ds' % numbytes,rest))
It worked for me:
>>> struct.pack('%ds' % 5,'abcde')
'abcde'
Explanation: '%%%ds' % 15 is '%15s', while what you want is '%ds' % 15 which is '15s'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python3: Convert Latin-1 to UTF-8 [duplicate] - python

For python 3.6: your_str = your_str.encode('utf-8').decode('latin-1')

Related

Read txt file into python [duplicate]

Raw string for variables in python?

remove <feff> from a file

Writing unicode data in csv

segmenting and writing binary file using Python

Categories

Resources