I am trying to write a CSV in Arabic script. I have encoded the string to utf-8 and wrote it in the csv.
The problem is if I open the file in csv it shows strange characters like آلز
سندويتش كاÙيه however if I open the file in notepad++ it shows the expected arabic text.
i checked notepad ++ and converted the encoding to utf-8 instead of utf-8 without BOM , now its working fine in csv reader(excel) too. so what should i do to set encoding to "utf-8 with BOM" in app engine
i am using unicodecsv.writer to write the csv
writer = unicodecsv.writer(self.response.out)
row = []
row.append(transaction.name.encode('utf8'))
writer.writerow(row)
the data to be written is taken from the datastore
To write a CSV with UTF-8 BOM, simply write the BOM first; you can use codecs.BOM_UTF8 for that:
import codecs
self.response.out.write(codecs.BOM_UTF8)
writer = csv.writer(self.response.out)
row = []
row.append(transaction.name.encode('utf8'))
writer.writerow(row)
Excel 2007 and newer pick up on the BOM and correctly open such a CSV file with UTF-8. Silly Microsoft!
Related
I am reading from a csv file and printing to console in Python. The first printed line is including some odd characters at the beginning of the string. The file in its entirety is:
My code is:
import csv
with open("C:\\Users\\user\\key.csv") as file:
reader = csv.reader(file)
for row in reader:
print(row)
the output is:
['john', '12345']
['jacob', '23456']
['jingle', '34567']
['heimer', '45678']
I do not know where the "" in the first line is coming from.
The "ï»" is a byte order mark (BOM), which is used at the beginning of some Unicode files or streams to indicate the "endianness" of multi-byte encoding variants (UTF-16 or UTF-32), or, like in this case, to indicate that the file is using UTF-8. Using a BOM at the beginning of a UTF-8 file is optional, but some applications -- like Microsoft Excel -- will use a BOM to indicate that the file is using UTF-8.
If you're using Python 3 and you know that your file will be using UTF-8, you should be able to just add the encoding when you open the file:
with open("C:\\Users\\user\\key.csv", encoding="utf-8-sig") as file:
reader = csv.reader(file)
for row in reader:
print(row)
Here's the Python documentation on this:
https://docs.python.org/3/howto/unicode.html#reading-and-writing-unicode-data
I have a database export in csv which is UTF8 encoded.
When i open it in Excel, i have to choose Windows (ANSI) at opening in order to see special characters correctly displays (é, è, à for instance).
If i use Python pandas to open csv file specifying UTF8 encoding, it does not seem to get correctly decoded (é,è,à characters are not displayed correctly):
StŽphanie
FrŽdŽrique
GŽraldine
How should i correctly read this file with Python pandas ?
Thanks a lot
This encoding is Windows-1252, referred to as "cp1252" by Python. ANSI is a misnomer; it's completely unrelated to the organisation.
Try:
with open("filepath.csv", encoding="cp1252") as f:
pandas.read_csv(f)
The solution was actually to use latin1 encoding in my case:
Stéphanie
Frédérique
Géraldine
I'm attempting to parse an XML file and print sections of the contents into a CSV file for manipulation with a program such as Microsoft Excel. The issue I'm running into is that the XML file contains multiple alphabets (Arabic, Cyrillic, etc.) and I'm getting confused over what encoding I should be using.
import csv
import xml.etree.ElementTree as ET
import os
file = 'example.xml'
csvf = open(os.path.splitext(file)[0] + '.csv', "w+", newline='')
csvf.seek(0)
csvw = csv.writer(csvf, delimiter=',')
root = ET.parse(file).getroot()
name_base = root.find("name")
name_base_string = ET.tostring(name_base, encoding="unicode", method="xml").strip()
csv_data.append(name_base_string)
csvf.close()
I do not know what encoding to pass to the tostring() method. If I use 'unicode' it returns a unicode python string and all is well when writing to the CSV file, but Excel seems to handle this really improperly (all editors on windows and linux seem to see the character sets properly). If I use encoding 'UTF-8' the method returns a bytearray, which if I pass to the CSV writer without decoding I receive the string b'stuff' in the csv document.
Is there something I'm missing here? Does Excel just suck at handling certain encodings? I've read up on how UTF-8 is an encoding and Unicode is just a character set (that you can't really compare them) but I'm still confused.
I have around 600,000 files encoded in ANSI and I want to convert them to UTF-8. I can do that individually in NOTEPAD++, but i can't do that for 600,000 files.Can i do this in R or Python?
I have found this link but the Python script is not running:
notepad++ converting ansi encoded file to utf-8
Why don't you read the file and write it as UTF-8? You can do that in Python.
#to support encodings
import codecs
#read input file
with codecs.open(path, 'r', encoding = 'utf8') as file:
lines = file.read()
#write output file
with codecs.open(path, 'w', encoding = 'utf8') as file:
file.write(lines)
I appreciate that this is an old question but having just resolved a similar problem recently I thought I would share my solution.
I had a file being prepared by one program that I needed to import in to an sqlite3 database but the text file was always 'ANSI' and sqlite3 requires UTF-8.
The ANSI encoding is recognised as 'mbcs' in python and therefore the code I have used, ripping off something else I found is:
blockSize = 1048576
with codecs.open("your ANSI source file.txt","r",encoding="mbcs") as sourceFile:
with codecs.open("Your UTF-8 output file.txt","w",encoding="UTF-8") as targetFile:
while True:
contents = sourceFile.read(blockSize)
if not contents:
break
targetFile.write(contents)
The below link contains some information on the encoding types that I found on my research
https://docs.python.org/2.4/lib/standard-encodings.html
I have bunch of text files contains Korean characters with wrong encodings. Specifically, it seems the characters are encoded with EUC-KR, but the files themselves were saved with UTF8+BOM.
So far I managed to fix a file with the following:
Open a file with EditPlus (it shows the file's encoding is UTF8+BOM)
In EditPlus, save the file as ANSI
Lastly, in Python:
with codecs.open(html, 'rb', encoding='euc-kr') as source_file:
contents = source_file.read()
with open(html, 'w+b') as dest_file:
dest_file.write(contents.encode('utf-8'))
I want to automate this, but I have not been able to do so. I can open the original file in Python:
codecs.open(html, 'rb', encoding='utf-8-sig')
However, I haven't been able to figure out how to do the 2. part.
I am presuming here that you have text already encoded to EUC-KR, then encoded again to UTF-8. If so, encoding to Latin 1 (what Windows calls ANSI) is indeed the best way to get back to the original EUC-KR bytestring.
Open the file as UTF8 with BOM, encode to Latin1, decode as EUC-KR:
import io
with io.open(html, encoding='utf-8-sig') as infh:
data = infh.read().encode('latin1').decode('euc-kr')
with io.open(html, 'w', encoding='utf8') as outfh:
outfh.write(data)
I'm using the io.open() function here instead of codecs as the more robust method; io is the new Python 3 library also backported to Python 2.
Demo:
>>> broken = '\xef\xbb\xbf\xc2\xb9\xc3\x8c\xc2\xbc\xc3\xba'
>>> print broken.decode('utf-8-sig').encode('latin1').decode('euc-kr')
미술