Converting .arff file to .csv using Python - python

I have a file "LMD.rh.arff" which I am trying to convert to .csv file using the following code-
import pandas as pd
import matplotlib.pyplot as plt
from scipy.io import arff
# Read in .arff file-
data = arff.loadarff("LMD.rh.arff")
But this last line of code gives me the error-
--------------------------------------------------------------------------- UnicodeEncodeError Traceback (most recent call
last) in
----> 1 data = arff.loadarff("LMD.rp.arff")
~/.local/lib/python3.6/site-packages/scipy/io/arff/arffread.py in
loadarff(f)
539 ofile = open(f, 'rt')
540 try:
--> 541 return _loadarff(ofile)
542 finally:
543 if ofile is not f: # only close what we opened
~/.local/lib/python3.6/site-packages/scipy/io/arff/arffread.py in
_loadarff(ofile)
627 a = generator(ofile)
628 # No error should happen here: it is a bug otherwise
--> 629 data = np.fromiter(a, descr)
630 return data, meta
631
UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in
position 4: ordinal not in range(128)
In [6]: data = arff.loadarff("LMD.rh.arff")
--------------------------------------------------------------------------- UnicodeEncodeError Traceback (most recent call
last) in
----> 1 data = arff.loadarff("LMD.rh.arff")
~/.local/lib/python3.6/site-packages/scipy/io/arff/arffread.py in
loadarff(f)
539 ofile = open(f, 'rt')
540 try:
--> 541 return _loadarff(ofile)
542 finally:
543 if ofile is not f: # only close what we opened
~/.local/lib/python3.6/site-packages/scipy/io/arff/arffread.py in
_loadarff(ofile)
627 a = generator(ofile)
628 # No error should happen here: it is a bug otherwise
--> 629 data = np.fromiter(a, descr)
630 return data, meta
631
UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in
position 4: ordinal not in range(128)
You can download the file arff_file
Any ideas as to what's going wrong?
Thanks!

Try this
path_to_directory="./"
files = [arff for arff in os.listdir(path_to_directory) if arff.endswith(".arff")]
def toCsv(content):
data = False
header = ""
newContent = []
for line in content:
if not data:
if "#attribute" in line:
attri = line.split()
columnName = attri[attri.index("#attribute")+1]
header = header + columnName + ","
elif "#data" in line:
data = True
header = header[:-1]
header += '\n'
newContent.append(header)
else:
newContent.append(line)
return newContent
# Main loop for reading and writing files
for zzzz,file in enumerate(files):
with open(path_to_directory+file , "r") as inFile:
content = inFile.readlines()
name,ext = os.path.splitext(inFile.name)
new = toCsv(content)
with open(name+".csv", "w") as outFile:
outFile.writelines(new)

Take a look at the error trace
UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in position 4: ordinal not in range(128)
Your error suggests you have some encoding problem with the file. Consider first opening the file with the correct encoding and then loading it to the arff loader
import codecs
import arff
file_ = codecs.load('LMD.rh.arff', 'rb', 'utf-8') # or whatever encoding you have
arff.load(file_) # now this should be fine
For reference see here

Related

Enconding UTF-8 issue when trying reading a json file

I've got the error shown below when tryng to read a json file whose encode is UTF-8, can someone know how I can resolve this issue?
reviews = pd.read_csv('reviews.csv', nrows=1000)
businesses = pd.read_csv('businesses.csv', nrows=1000)
checkins = []
with open('checkins.json', encoding='utf-8') as f:
for row in f.readlines()[:1000]:
checkins.append(json.loads(row))
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-10-4f54896faeca> in <module>
3 checkins = []
4 with open('checkins.json', encoding='utf-8') as f:
----> 5 for row in f.readlines()[:1000]:
6 checkins.append(json.loads(row))
~\Anaconda3\lib\codecs.py in decode(self, input, final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 37: invalid continuation byte

Converting JSON file to CSV file

I am trying to convert a JSON file into a CSV file. My code is down below. However, I keep getting this error:
Traceback (most recent call last):
File "C:\Users\...\PythonParse.py", line 42, in <module>
writer.writerow(data)
File "C:\Documents and Settings\...\Python37\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 38409-38412: character maps to <undefined>
import json
import gzip
import csv
outfile = open("VideoGamesMeta.csv","w")
writer = csv.writer(outfile)
data = []
items = []
names = []
checkItems = False;
checkUsers = False;
numItems = []
numUsers = []
for line in open("meta_Video_Games.json","r",encoding="utf-8"):
results = (json.loads(line))
if 'title' in results:
if 'asin' in results:
name = results['title']
item = results['asin']
data = [item,name]
writer.writerow(data)
items.append(item)
names.append(name)

UnicodeDecodeError when trying to read docx file

Error occurs when opening docx file using python 3
When I tried to run:
file=open("jinuj.docx","r",encoding="utf-8").read()
below error occured
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 11: invalid start byte
python-docx can open a document from a so-called file-like object. It can also save to a file-like object:
from docx import Document
f = open('jinuj.docx', 'rb')
document = Document(f)
f.close()
OR
with open('jinuj.docx', 'rb') as f:
source_stream = StringIO(f.read())
document = Document(source_stream)
source_stream.close()
Docs

Unicode error ascii can't encode character

I am trying to import a csv file in order to train my classifier but I keep receiving this error
traceback (most recent call last):
File "updateClassif.py", line 17, in <module>
myClassif = NaiveBayesClassifier(fp, format="csv")
File "C:\Python27\lib\site-packages\textblob\classifiers.py", line 191, in __init__
super(NLTKClassifier, self).__init__(train_set, feature_extractor, format, **kwargs)
File "C:\Python27\lib\site-packages\textblob\classifiers.py", line 123, in __init__
self.train_set = self._read_data(train_set, format)
File "C:\Python27\lib\site-packages\textblob\classifiers.py", line 143, in _read_data
return format_class(dataset, **self.format_kwargs).to_iterable()
File "C:\Python27\lib\site-packages\textblob\formats.py", line 68, in __init__
self.data = [row for row in reader]
File "C:\Python27\lib\site-packages\textblob\unicodecsv\__init__.py", line 106, in next
row = self.reader.next()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 55: ordinal not in range(128)
The CSV file contains 1600000 lines of tweets so I believe some tweets contain special characters. I have tried saving it using open office as someone recommended but still the same result. I also tried using latin encoding but the same result.
This is my code :
with codecs.open('tr.csv', 'r' ,encoding='latin-1') as fp:
myClassif = NaiveBayesClassifier(fp, format="csv")
This is the code from the library I am using:
def __init__(self, csvfile, fieldnames=None, restkey=None, restval=None,
dialect='excel', encoding='utf-8', errors='strict', *args,
**kwds):
if fieldnames is not None:
fieldnames = _stringify_list(fieldnames, encoding)
csv.DictReader.__init__(self, csvfile, fieldnames, restkey, restval, dialect, *args, **kwds)
self.reader = UnicodeReader(csvfile, dialect, encoding=encoding,
errors=errors, *args, **kwds)
if fieldnames is None and not hasattr(csv.DictReader, 'fieldnames'):
# Python 2.5 fieldnames workaround. (http://bugs.python.org/issue3436)
reader = UnicodeReader(csvfile, dialect, encoding=encoding, *args, **kwds)
self.fieldnames = _stringify_list(reader.next(), reader.encoding)
self.unicode_fieldnames = [_unicodify(f, encoding) for f in
self.fieldnames]
self.unicode_restkey = _unicodify(restkey, encoding)
def next(self):
row = csv.DictReader.next(self)
result = dict((uni_key, row[str_key]) for (str_key, uni_key) in
izip(self.fieldnames, self.unicode_fieldnames))
rest = row.get(self.restkey)
In Python2, the csv module does not support unicode. So you must pass in some kind of iterator object (such as a file) which only produces byte-strings.
This means that your code should look like this:
with open('tr.csv', 'rb') as fp:
myClassif = NaiveBayesClassifier(fp, format="csv")
But note that the csv file must be encoded as UTF-8. If it's not, you will obviously need to convert it to UTF-8 first, in order for the code above to work.
Note that the traceback says EncodeError, not DecodeError. It looks like the NaiveBayesClassifier is expecting ascii. Either make it accept Unicode, or, if this is OK for your application, replace non-ascii characters with '?' or something.

how to write my terminal in a text file using python

Partial of my code is below. I want to export output of terminal in a text file but I get below error:
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-2-c7d647fa741c> in <module>()
34 text_file = open("Output.txt", "w")
35
---> 36 text_file.write(data)
37 #print (data)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 150-151: ordinal not in range(128)
# data is multi line text
data = ''.join(soup1.findAll('p', text=True))
text_file = open("Output.txt", "w")
text_file.write(data)
# print (data)
Encode your text before you write to the file:
text_file.write(data.encode("utf-8"))

Categories