Convert CSV to YAML, with Unicode? - python

I'm trying to convert a CSV file, containing Unicode strings, to a YAML file using Python 3.4.
Currently, the YAML parser escapes my Unicode text into an ASCII string. I want the YAML parser to export the Unicode string as a Unicode string, without the escape characters. I'm misunderstanding something here, of course, and I'd appreciate any assistance.
Bonus points: how might this be done with Python 2.7?
CSV input
id, title_english, title_russian
1, A Title in English, Название на русском
2, Another Title, Другой Название
current YAML output
- id: 1
title_english: A Title in English
title_russian: "\u041D\u0430\u0437\u0432\u0430\u043D\u0438\u0435 \u043D\u0430\
\ \u0440\u0443\u0441\u0441\u043A\u043E\u043C"
- id: 2
title_english: Another Title
title_russian: "\u0414\u0440\u0443\u0433\u043E\u0439 \u041D\u0430\u0437\u0432\u0430\
\u043D\u0438\u0435"
desired YAML output
- id: 1
title_english: A Title in English
title_russian: Название на русском
- id: 2
title_english: Another Title
title_russian: Другой Название
Python conversion code
import csv
import yaml
in_file = open('csv_file.csv', "r")
out_file = open('yaml_file.yaml', "w")
items = []
def convert_to_yaml(line, counter):
item = {
'id': counter,
'title_english': line[0],
'title_russian': line[1]
}
items.append(item)
try:
reader = csv.reader(in_file)
next(reader) # skip headers
for counter, line in enumerate(reader):
convert_to_yaml(line, counter)
out_file.write( yaml.dump(items, default_flow_style=False) )
finally:
in_file.close()
out_file.close()
Thanks!

I ran into the same issue and this was how I was able to resolve it based on your example above
out_file.write(yaml.dump(items, default_flow_style=False,allow_unicode=True) )
including allow_unicode=True fixes the issue.
also specifically for python2 make use of safe_dump instead of dump to prevent the !!python/unicode displaying along with the unicode text.
out_file.write(yaml.safe_dump(items, default_row_style=False,allow_unicode=True)

In Python 2.x, you should use a Unicode CSV reader as Python's CSV reader doesn't support that. You can use unicodecsv for this purpose.
In your current Python 3.x code you should explicitly pass the file encoding when opening it:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
It may be that your system is already doing the right thing but you're relying on defaults in that case.
Lastly, you need to make sure the YAML file is opened with the correct encoding: open("yaml_file.yaml", "w", encoding="utf-8"). And this encoding should be used later when reading the YAML file.
I'm not sure what the yaml library does when given Python objects but you also need to check that line[0] and line[1] are Unicode strings when you're setting them inside convert_to_yaml.

Related

python cant read csv file downloaded from azure dev ops (utf-8)

I created an azure dev ops query, and chose 'download results as csv' which gave me a csv file. If I open this csv in vscode, I can see in the bottom right corner it says UTF-8 with BOM
I am trying to write some python function that will read in each value of this csv file. I can not rely parsing text myself and spitting values based on the , comma character, because I will have values that include commas inside them.
If I open my csv in excel, everything is organized perfectly. But if I try to parse the file in python, it reads in every row as a single string separated by commas (bad)
from csv import reader
import csv
# read in csv, convert to map organized by 'id' as index root parent value
def read_csv_as_map(csv_filename, id_format, encodingVar):
print('filename: '+csv_filename+', id_format: '+id_format+', encoding: '+encodingVar)
dict={}
dict['rows']={}
try:
with open(csv_filename, 'r', encoding=encodingVar) as read_obj:
csv_reader = reader(read_obj, delimiter='\t')
csv_cols = None
for row in csv_reader:
print('row=',row)
print('done')
return dict
except Exception as e:
print('err=',e)
return {}
ads_dict = read_csv_as_map(
csv_filename="csv_migration\\ads-test-direct-download.csv",
id_format='ID',
encodingVar='utf-8-sig'
)
console output:
filename: csv_migration\ads-test-direct-download.csv, id_format: ID, encoding: utf-8-sig
row= ['Title,State,Work Item Type,ID,12NC']
row= ['TITLE,WITH COMMAS,To Do,NAME,6034,"value,with,commas"']
done
How can I read this file in python so it separates each value into a list? Instead of this single string
I get the same result with encodingVar='utf-8', should I open my csv in some app like notepadd++ and convert it to utf-16? My code works great for .csv files with utf-16 encoding, it can parse each individual value into a list no problem. why wont this work with a utf-8 DOM csv, even when excel can parse the individual values perfectly fine?
csv file: https://file.io/TXh6uyXKZaug
from csv import reader
import csv
# read in csv, convert to map organized by 'id' as index root parent value
def read_csv_as_map(csv_filename, id_format, encodingVar):
print('filename: '+csv_filename+', id_format: '+id_format+', encoding: '+encodingVar)
dict={}
dict['rows']={}
try:
with open(csv_filename, 'r', encoding=encodingVar) as read_obj:
csv_reader = reader(read_obj, delimiter='\t')
csv_cols = None
for row in csv_reader:
row_as_list = row.split(",") # <-- Gets line as list!
print('row=',row_as_list)
print('done')
return dict
except Exception as e:
print('err=',e)
return {}
ads_dict = read_csv_as_map(
csv_filename="csv_migration\\ads-test-direct-download.csv",
id_format='ID',
encodingVar='utf-8-sig'
)
This snippet splits the line into a list that you can index to get the information out

How to read binary file to text in python 3.9

I have a .sql files that I want to read into my python session (python 3.9). I'm opening using the file context manager.
with open('file.sql', 'r') as f:
text = f.read()
When I print the text, I still get the binary characters, i.e., \xff\xfe\r\x00\n\x00-\x00-..., etc.
I've tried all the arguments such as 'rb', encoding='utf-8, etc., but the results are still binary text. It should be noted that I've used this very same procedure many times over in my code before and this has not been a problem.
Did something change in python 3.9?
First two bytes \xff\xfe looks like BOM (Byte Order Mark)
and table at Wikipedia page BOM shows that \xff\xfe can means encoding UTF-16-LE
So you could try
with open('file.sql', 'r', encoding='utf-16-le') as f:
EDIT:
There is module chardet which you may also try to use to detect encoding.
import chardet
with open('file.sql', 'rb') as f: # read bytes
data = f.read()
info = chardet.detect(data)
print(info['encoding'])
text = data.decode(info['encoding'])
Usually files don't have BOM but if they have then you may try to detect it using example from unicodebook.readthedocs.io/guess_encoding/check-for-bom-markers
from codecs import BOM_UTF8, BOM_UTF16_BE, BOM_UTF16_LE, BOM_UTF32_BE, BOM_UTF32_LE
BOMS = (
(BOM_UTF8, "UTF-8"),
(BOM_UTF32_BE, "UTF-32-BE"),
(BOM_UTF32_LE, "UTF-32-LE"),
(BOM_UTF16_BE, "UTF-16-BE"),
(BOM_UTF16_LE, "UTF-16-LE"),
)
def check_bom(data):
return [encoding for bom, encoding in BOMS if data.startswith(bom)]
# ---------
with open('file.sql', 'rb') as f: # read bytes
data = f.read()
encoding = check_bom(data)
print(encoding)
if encoding:
text = data.decode(encoding[0])
else:
print('unknown encoding')

Avoiding UnicodeEncodeError in python

I tried to parse an html table into csv using python with a following script:
from bs4 import BeautifulSoup
import requests
import csv
csvFile = open('log.csv', 'w', newline='')
writer = csv.writer(csvFile)
def parse():
html = requests.get('https://en.wikipedia.org/wiki/Comparison_of_text_editors')
bs = BeautifulSoup(html.text, 'lxml')
table = bs.select_one('table.wikitable')
rows = table.select('tr')
for row in rows:
csvRow = []
for cell in row.findAll(['th', 'td']):
csvRow.append(cell.getText())
writer.writerow(csvRow)
print(csvRow)
parse()
csvFile.close()
This code outputed a clear formated CSV file with no encoding issues.
All was just fine before Enrico Tröger's Geany. My script was unable to write ö
into a csv file, so i tried this:
csvRow.append(cell.text.encode('ascii', 'replace')) instead of that: csvRow.append(cell.getText())
All was fine, despite the fact that each table cell was nested in b''. So, how can i get a clear formated csv file withous encoding issues(like in the first screenshot) and replaced or ignored all
non-unicode symbols(like in the second screenshot) using my scipt?
Change this one:
csvFile = open('log.csv', 'w', newline='')
To this one:
csvFile = open('log.csv', 'w', newline='', encoding='utf8')
csv module documentation:
Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding()). To decode a file using a different encoding, use the encoding argument of open:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
The same applies to writing in something other than the system default encoding: specify the encoding argument when opening the output file.
I suppose your system default encoding is not utf8.
You can check it like this:
import locale
locale.getpreferredencoding()
Hope it helps!
Looks like the csv module expects strings, not bytes. So you could de-encode your bytes before passing them:
cell.text.encode('ascii', 'replace').decode('ascii')

Which encoding to use while reading Excel using xlrd

I am trying to read an Excel file using xlrd to write into txt files. Everything is being written fine except for some rows which has some spanish characters like 'Téd'. I can encode those using latin-1 encoding. However the code then fails for other rows which have a 'â' with unicode u'\u2013'. u'\2013' can't be encoded using latin-1. When using UTF-8 'â' are written out fine but 'Téd' is written as 'Téd' which is not acceptable. How do I correct this.
Code below :
#!/usr/bin/python
import xlrd
import csv
import sys
filePath = sys.argv[1]
with xlrd.open_workbook(filePath) as wb:
shNames = wb.sheet_names()
for shName in shNames:
sh = wb.sheet_by_name(shName)
csvFile = shName + ".csv"
with open(csvFile, 'wb') as f:
c = csv.writer(f)
for row in range(sh.nrows):
sh_row = []
cell = ''
for item in sh.row_values(row):
if isinstance(item, float):
cell=item
else:
cell=item.encode('utf-8')
sh_row.append(cell)
cell=''
c.writerow(sh_row)
print shName + ".csv File Created"
Python's csv module
doesn’t support Unicode input.
You are correctly encoding your input before writing it -- so you don't need codecs. Just open(csvFile, "wb") (the b is important) and pass that object to the writer:
with open(csvFile, "wb") as f:
writer = csv.writer(f)
writer.writerow([entry.encode("utf-8") for entry in row])
Alternatively, unicodecsv is a drop-in replacement for csv that handles encoding.
You are getting é instead of é because you are mistaking UTF-8 encoded text for latin-1. This is probably because you're encoding twice, once as .encode("utf-8") and once as codecs.open.
By the way, the right way to check the type of an xlrd cell is to do cell.ctype == xlrd.ONE_OF_THE_TYPES.

Python encoding conversion

I wrote a Python script that processes CSV files with non-ascii characters, encoded in UTF-8. However the encoding of the output is broken. So, from this in the input:
"d\xc4\x9bjin hornictv\xc3\xad"
I get this in the output:
"d\xe2\x99\xafjin hornictv\xc2\xa9\xc6\xaf"
Can you suggest where the encoding error might come from? Have you seen similar behaviour previously?
EDIT: I'm using csv standard library with the UnicodeWriter class featured in the docs. I use Python version 2.6.6.
EDIT 2: The code to reproduce the behaviour:
#!/usr/bin/env python
#-*- coding:utf-8 -*-
import csv
from pymarc import MARCReader # The pymarc package available PyPI: http://pypi.python.org/pypi/pymarc/2.71
from UnicodeWriter import UnicodeWriter # The UnicodeWriter from: http://docs.python.org/library/csv.html
def getRow(tag, record):
if record[tag].is_control_field():
row = [tag, record[tag].value()]
else:
row = [tag] + record[tag].subfields
return row
inputFile = open("input.mrc", "r")
outputFile = open("output.csv", "wb")
reader = MARCReader(inputFile, to_unicode = True)
writer = UnicodeWriter(outputFile, delimiter = ",", quoting = csv.QUOTE_MINIMAL)
for record in reader:
if bool(record["001"]):
tags = [field.tag for field in record.get_fields()]
tags.sort()
for tag in tags:
writer.writerow(getRow(tag, record))
inputFile.close()
outputFile.close()
The input data is available here (large file).
It seems adding force_utf8 = True argument to the MARCReader constructor solved the problem:
reader = MARCReader(inputFile, to_unicode = True, force_utf8 = True)
According to the inspection of the source code (via inspect) it does something like:
string.decode("utf-8", "strict")
You can try to open the file with UTF-8 encoding:
import codecs
codecs.open('myfile.txt', encoding='utf8')

Categories