Python Special Characters Encoding - python

I have a python script that reads a CSV file and writes in a XML file. I have been hitting a wall trying to find out how to read special characters such as: ç, á, é, í, etc. The script runs perfectly fine without special characters. That is the script header:
# coding=utf-8
'''
#modified by: Julierme Pinheiro
'''
import os
import sys
import unittest
from unittest import skip
import csv
import uuid
import xml
import xml.dom.minidom as minidom
import owslib
from owslib.iso import *
import pyproj
from decimal import *
import logging
The way I retrieve information from the csv file is shown bellow:
# add the title
title = data[1]
titleElement = identificationInfo[0].getElementsByTagName('gmd:title')[0]
titleNode = record.createTextNode(title)
titleElement.childNodes[1].appendChild(titleNode)
print "Title:" + title
Note: If data[1], second column in the csv file, contains a special character as found in "Navegação" the script fails (It does not write anything in the xml file).
The way a new XML file is created based on a blank Template XML is shown bellow:
# write out the gemini record
filename = '../output/%s.xml' % fileId
with open(filename,'w') as test_xml:
test_xml.write(record.toprettyxml(newl="", encoding="utf-8"))
except:
e = sys.exc_info()[1]
logging.debug("Import failed for entry %s" % data[0])
logging.debug("Specific error: %s" % e)
#skip('')
def testOWSMetadataImport(self):
raw_data = []
with open('../input/metadata_cartapapel.csv') as csvfile:
reader = csv.reader(csvfile, dialect='excel')
for columns in reader:
raw_data.append(columns)
md = MD_Metadata(etree.parse('gemini-template.xml'))
md.identification.topiccategory = ['farming','environment']
print md.identification.topiccategory
outfile = open('mdtest.xml','w')
# crap, can't update the model and write back out - this is badly needed!!
outfile.write(md.xml)
if __name__ == "__main__":
unittest.main()
Could someone help to solve this issue, please?
Thank you in advance for your time.

That's unicode. csv can't read unicode if you are in python 2.7. In python 3.x you can pass the utf-8 option while opening the file.
In python you can decode the data[1] to utf-8 like below.
title = data[1].decode('utf-8')
Some of the windows legacy windows components in english might require the 'cp1252'. If the above decoding fails, try this.
title = data[1].decode('cp1252')

Related

writing to SSol.txt was successful, but how can i read SSol.txt?

I am using python 3.6.1. I am self-taught in Python.. I am unable to understand what this error is.
Here is my code:
from tkinter import *
from tkinter import ttk
from tkinter import messagebox
import urllib.request as request
def a():
url=("https://rp5.ru/%EC%95%88%EB%8F%99%EC%9D%98_%EB%82%A0%EC%94%A8,_%EA%B2%BD%EB%B6%81")
raw_data = request.urlopen(url).read() #Bytes
text = raw_data.decode("utf-8")
where = text.find('k;">')
start_where = where +4
end_start = where +7
f = open("SSSol.txt", 'w+')
decoded = int(text[start_where:end_start])
k=f.write(str(decoded))
t = str(f.readline())
messagebox.showinfo("hello",t)
ttk.Button(win, text="?", command=a).grid()
enter image description here
No error occurred, but nothing was output.
Problem is that you're reusing the f handle after having written to it.
f = open("SSSol.txt", 'w+')
decoded = int(text[start_where:end_start])
k=f.write(str(decoded))
that part is OK. Now:
t = str(f.readline())
you're reading from the end of the file so t is an empty string. So either:
Do f.seek(0) first to rewind the file
close f and open it again, this time using r mode (read-only). (would be safer anyway)
(that said, I suppose that it's not your real code, as writing + reading the same file in the same method is rather ineffective when you already have the buffer handy)

Unable to Save Arabic Decoded Unicode to CSV File Using Python

I am working with a twitter streaming package for python. I am currently using a keyword that is written in unicode to search for tweets containing that word. I am then using python to create a database csv file of the tweets. However, I want to convert the tweets back to Arabic symbols when I save them in the csv.
The errors I am receiving are all similar to "error ondata the ASCII caracters in position ___ are not within the range of 128."
Here is my code:
class listener(StreamListener):
def on_data(self, data):
try:
#print data
tweet = (str((data.split(',"text":"')[1].split('","source')[0]))).encode('utf-8')
now = datetime.now()
tweetsymbols = tweet.encode('utf-8')
print tweetsymbols
saveThis = str(now) + ':::' + tweetsymbols.decode('utf-8')
saveFile = open('rawtwitterdata.csv','a')
saveFile.write(saveThis)
saveFile.write('\n')
saveFile.close()
return True
Excel requires a Unicode BOM character written to the beginning of a UTF-8 file to view it properly. Without it, Excel assumes "ANSI" encoding, which is OS locale-dependent.
This writes a 3-row, 3-column CSV file with Arabic:
#!python2
#coding:utf8
import io
with io.open('arabic.csv','w',encoding='utf-8-sig') as f:
s = u'إعلان يونيو وبالرغم تم. المتحدة'
s = u','.join([s,s,s]) + u'\n'
f.write(s)
f.write(s)
f.write(s)
Output:
For your specific example, just make sure to write a BOM character u'\xfeff' as the first characters of your file, encoded in UTF-8. In the example above, the 'utf-8-sig' codec ensures a BOM is written.
Also consult this answer, which shows how to wrap the csv module to support Unicode, or get the third party unicodecsv module.
Here a snippet to write arabic in text
# coding=utf-8
import codecs
from datetime import datetime
class listener(object):
def on_data(self, tweetsymbols):
# python2
# tweetsymbols is str
# tweet = (str((data.split(',"text":"')[1].split('","source')[0]))).encode('utf-8')
now = datetime.now()
# work with unicode
saveThis = unicode(now) + ':::' + tweetsymbols.decode('utf-8')
try:
saveFile = codecs.open('rawtwitterdata.csv', 'a', encoding="utf8")
saveFile.write(saveThis)
saveFile.write('\n')
finally:
saveFile.close()
return self
listener().on_data("إعلان يونيو وبالرغم تم. المتحدة")
All you must know about encoding https://pythonhosted.org/kitchen/unicode-frustrations.html

DBF - encoding cp1250

I have dbf database encoded in cp1250 and I am reading this database using folowing code:
import csv
from dbfpy import dbf
import os
import sys
filename = sys.argv[1]
if filename.endswith('.dbf'):
print "Converting %s to csv" % filename
csv_fn = filename[:-4]+ ".csv"
with open(csv_fn,'wb') as csvfile:
in_db = dbf.Dbf(filename)
out_csv = csv.writer(csvfile)
names = []
for field in in_db.header.fields:
names.append(field.name)
#out_csv.writerow(names)
for rec in in_db:
out_csv.writerow(rec.fieldData)
in_db.close()
print "Done..."
else:
print "Filename does not end with .dbf"
Problem is, that final csv file is wrong. Encoding of the file is ANSI and some characters are corrupted. I would like to ask you, if you can help me how to read dbf file correctly.
EDIT 1
I tried different code from https://pypi.python.org/pypi/simpledbf/0.2.4, there is some error.
Source 2:
from simpledbf import Dbf5
import os
import sys
dbf = Dbf5('test.dbf', codec='cp1250');
dbf.to_csv('junk.csv');
Output:
python program2.py
Traceback (most recent call last):
File "program2.py", line 5, in <module>
dbf = Dbf5('test.dbf', codec='cp1250');
File "D:\ProgramFiles\Anaconda\lib\site-packages\simpledbf\simpledbf.py", line 557, in __init__
assert terminator == b'\r'
AssertionError
I really don't know how to solve this problem.
Try using my dbf library:
import dbf
with dbf.Table('test.dbf') as table:
dbf.export(table, 'junk.csv')
I wrote simpledbf. The line that is causing you problems was from some testing I was doing when developing the module. First of all, you might want to update your installation, as 0.2.6 is the most recent. Then you can try removing that particular line (#557) from the file "D:\ProgramFiles\Anaconda\lib\site-packages\simpledbf\simpledbf.py". If that doesn't work, you can ping me at the GitHub repo for simpledbf, or you could try Ethan's suggestion for the dbf module.
You can decode and encode as necessary. dbfpy assumes strings are utf8 encoded, so you can decode as it isn't that encoding and then encode again with the right encoding.
import csv
from dbfpy import dbf
import os
import sys
filename = sys.argv[1]
if filename.endswith('.dbf'):
print "Converting %s to csv" % filename
csv_fn = filename[:-4]+ ".csv"
with open(csv_fn,'wb') as csvfile:
in_db = dbf.Dbf(filename)
out_csv = csv.writer(csvfile)
names = []
for field in in_db.header.fields:
names.append(field.name)
#out_csv.writerow(names)
for rec in in_db:
row = [i.decode('utf8').encode('cp1250') if isinstance(i, str) else i for i in rec.fieldData]
out_csv.writerow(rec.fieldData)
in_db.close()
print "Done..."
else:
print "Filename does not end with .dbf"

Python urllib-html parse

Question about parsing web-site:
My code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import os
import urllib2
import re
# Parse Web
from lxml import html
import requests
def parse():
try:
output = open('proba.xml','w')
page = requests.get('http://www.rts.rs/page/tv/sr/broadcast/22/RTS+1.html')
tree = html.fromstring(page.text)
parse = tree.xpath('//div[#class="ProgramTime"]/text()|//div[#class="ProgramName"]/text()|//a[#class="recnik"]/text()')
for line in parse:
clean = line.strip()
if clean:
print clean
except:
pass
parse()
My question is how can I write this result to file, when I try with this:
print >> output, line
I got only 6 first lines into file.
With this code:
output.write(line)
Same thing, so can you help me with this issue.
What I wanan is to output parsed content.
I am having trouble replicating the problem. Here is what I did...
import sys
import os
import urllib2
import re
from lxml import html
import requests
def parse():
output = open('proba.xml','w')
page = requests.get('http://www.rts.rs/page/tv/sr/broadcast/22/RTS+1.html')
tree = html.fromstring(page.text)
p = tree.xpath('//div[#class="ProgramTime"]/text()|//div[#class="ProgramName"]/text()|//a[#class="recnik"]/text()')
for line in p:
clean = line.strip()
if clean:
output.write(line.encode('utf-8')+'\n') # the \n adds a line break
output.close()
parse()
I think you are getting a unicode related error when writing to the file, but because you put everything in a try block and let the error pass silently you aren't getting feedback!
Try typing import this in a terminal. You will get the Zen of Python. One aphorism is "Errors should never pass silently."
Try this instead:
with file('proba.xml', 'w') as f:
f.writelines([line.strip() for line in parse]
Put this in place of for line in parse: clean = * and remove the declaration output = * above and no need for output.write again. Sorry if am not clearer am typing this on a mobile phone.

Python encoding conversion

I wrote a Python script that processes CSV files with non-ascii characters, encoded in UTF-8. However the encoding of the output is broken. So, from this in the input:
"d\xc4\x9bjin hornictv\xc3\xad"
I get this in the output:
"d\xe2\x99\xafjin hornictv\xc2\xa9\xc6\xaf"
Can you suggest where the encoding error might come from? Have you seen similar behaviour previously?
EDIT: I'm using csv standard library with the UnicodeWriter class featured in the docs. I use Python version 2.6.6.
EDIT 2: The code to reproduce the behaviour:
#!/usr/bin/env python
#-*- coding:utf-8 -*-
import csv
from pymarc import MARCReader # The pymarc package available PyPI: http://pypi.python.org/pypi/pymarc/2.71
from UnicodeWriter import UnicodeWriter # The UnicodeWriter from: http://docs.python.org/library/csv.html
def getRow(tag, record):
if record[tag].is_control_field():
row = [tag, record[tag].value()]
else:
row = [tag] + record[tag].subfields
return row
inputFile = open("input.mrc", "r")
outputFile = open("output.csv", "wb")
reader = MARCReader(inputFile, to_unicode = True)
writer = UnicodeWriter(outputFile, delimiter = ",", quoting = csv.QUOTE_MINIMAL)
for record in reader:
if bool(record["001"]):
tags = [field.tag for field in record.get_fields()]
tags.sort()
for tag in tags:
writer.writerow(getRow(tag, record))
inputFile.close()
outputFile.close()
The input data is available here (large file).
It seems adding force_utf8 = True argument to the MARCReader constructor solved the problem:
reader = MARCReader(inputFile, to_unicode = True, force_utf8 = True)
According to the inspection of the source code (via inspect) it does something like:
string.decode("utf-8", "strict")
You can try to open the file with UTF-8 encoding:
import codecs
codecs.open('myfile.txt', encoding='utf8')

Categories