Getting Null Byte using io.StringIO

Getting Null Byte using io.StringIO - python

I have the below code
stream = io.StringIO(csv_file.stream.read().decode('utf-8-sig'), newline=None) // error is here
reader = csv.DictReader(stream)
list_of_entity = []
line_no, prev_len = 1, 0,
for line in reader:
While executing the above code I got the below error.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 252862: invalid start byte
Later to fix this I tried the below.
stream = io.StringIO(csv_file.stream.read().decode('unicode_escape'), newline=None)
reader = csv.DictReader(stream)
list_of_entity = []
line_no, prev_len = 1, 0,
for line in reader:// error is here
when i change decode as unicode_escape it thrown the error "
_csv.Error: line contains NULL byte" at above highlighted comment line.
There is null byte present in csv, I want to ignore or replace it.
can anyone help on this.

Related

'charmap' codec can't decode byte 0x9d in position 4836: character maps to <undefined>

I am trying to figure out this error that pops up from this code:
filename = os.path.join(os.path.expanduser("~"), "data", "blogs",
"1005545.male.25.Engineering.Sagittarius.xml")
#filename = open('C:/Users/spenc/data/blogs/1005545.male.25.Engineering.Sagittarius.xml',
#encoding='utf-8', errors = 'ignore')
all_posts = []
allPosts = []
with open(filename) as inf:
postStart = False
post = []
for line in inf:
line = line.strip()
if line == "<post>":
postStart = True
elif line == "</post>":
postStart = False
allPosts.append("\n".join(post))
post =[]
elif postStart:
post.append(line)
print(allPosts[0])
print(len(allPosts))
filename.close()
and get this error:
File "D:\Anaconda-Python\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4836: character maps to <undefined> here
I am just trying to figure out the encoding error to make sure this works in finding the length of the posts and print the post itself, but it keeps getting caught up on the allposts.append line. Not really sure of anywork around or if there is a newer way of doing something of this sort. I was trying to follow a textbook on it, but cant continue on in the chapter until this has been worked out.

How can I replace 'æ' 'ø' and 'å' in a text without error: 'ascii' codec can't decode byte 0xc3 in position 0

I am making a program which is supposed to open a textfile, then replace letters 'æ, ø, and å' (Danish text) with 'ae, oe, aa'.
I need to open the program and run it through the mac terminal.
I tried using the replace() function, and tried writing:
# -*- coding: utf-8 -*-
#!/usr/bin/env python
in the beginning of the file.
But I keep getting error:
File "replace.py", line 20, in replace_nonascii
word = word.replace('å', 'aa')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
any suggestions? have tried googling this for days, I have no clue how to fix it.
Here is my program:
filepath = input('insert path for text')
with codecs.open(filepath, 'r', encoding = 'utf8') as file_object:
filename_cont['text1'] = file_object.read()
def replace_nonascii(word):
word = word.lower()
word = word.replace('å', 'aa')
word = word.replace('æ', 'ae')
word = word.strip('/-.,?!')
print(word)
for text in filename_cont:
newtext = filename_cont[text]
for word in newtext.split():
replace_nonascii(word)

ASCII Encode/Decode Error

I have csv files having encoding-'utf-8'. I need to convert the csv to excel workbook with same encoding but unable to do so. Tried many things but not able to fix. Here is the code snippet.
NOte: Using xlsxwriter package
def csv_to_excel(input_file_path, output_file_path):
file_path = input_file_path
excel_file_path = output_file_path
wb = Workbook(excel_file_path.encode('utf-8', 'ignore'), {'encoding': 'utf-8'})
sheet1 = wb.add_worksheet(("anyname1").encode('utf-8','ignore'))
sheet2 = wb.add_worksheet(("anyname2").encode('utf-8','ignore'))
for filename in glob.glob(file_path):
(f_path, f_name) = os.path.split(filename)
w_tab = str(f_name.split('_')[2]).split('.')[0]
if (w_tab=="anyname1"):
w_sheet = sheet1
elif (w_tab=="anyname2"):
w_sheet = sheet2
spamReader = csv.reader(open(filename, "rb"), delimiter=',',quotechar='"')
row_count = 0
for row in spamReader:
for col in range(len(row)):
w_sheet.write(row_count,col,row[col])
row_count +=1
try:
os.remove(excel_file_path)
except:
pass
wb.close()
print "Converted CSVs to Excel File"
Errors:
Case1: When I am trying to open the utf-8 encoded csv file as follows:
spamReader = csv.reader(io.open(filename, "r", encoding = 'utf-8'), delimiter=',',quotechar='"')
Then getting error while iterating over the spamReader object as
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 92: ordinal not in range(128)
Case2: When I am trying to open the same csv file as binary as mentioned in above code snippet, then I am not able to save it as utf-8 encoded excel, so while calling wb.close(), getting error as
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 12: ordinal not in range(128)
I have just started learning python so maybe this is not that big issue but Please help me on this.

How to fix UnicodeDecodeError: 'utf8' decode byte 0xc0 read in sql file

This is tx.sql
DECLARE #Cnt INT,
#ParticipantID UNIQUEIDENTIFIER
SELECT ParticipantID INTO #ids
FROM dbo.rbd_Participants
/* sun */
WHERE surname='Пупкин'
This is python script
with open('tx.sql', 'r') as f:
script = f.read().decode('utf8')
script = re.sub(r'\/\*.*?\*\/', '', script, flags=re.DOTALL)multiline comment
script = re.sub(r'--.*$', '', script, flags=re.MULTILINE) line comment
sql = []
do_execute = False
for line in script.split(u'\n'):
line = line.strip()
if not line:
continue
elif line.upper() == u'GO':
do_execute = True
else:
sql.append(line)
do_execute = line.endswith(u';')
#print line
cur.execute(u'\n'.join(sql).encode('utf8'))
Problem line: script = f.read().decode('utf8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position
134: invalid start byte
I have tried
script = f.read().decode('cp1251')
but line
cur.execute(u'\n'.join(sql).encode('utf8'))
print (u'\n'.join(sql)).encode('utf8')
DECLARE #Cnt INT,
#ParticipantID UNIQUEIDENTIFIER
SELECT ParticipantID INTO #ids
FROM dbo.rbd_Participants
WHERE surname='РџСѓРїРєРёРЅ'
How to make the correct line?
WHERE surname='РџСѓРїРєРёРЅ'
There must be a string
WHERE surname='Пупкин'

You are reading the data correctly. It is your print statement that is incorrect:
print (u'\n'.join(sql)).encode('utf8')
Your terminal or console doesn't support UTF-8, so it is showing you the wrong data. Don't encode, leave that to Python.

how to write a unicode csv in Python 2.7

I want to write data to files where a row from a CSV should look like this list (directly from the Python console):
row = ['\xef\xbb\xbft_11651497', 'http://kozbeszerzes.ceu.hu/entity/t/11651497.xml', "Szabolcs Mag '98 Kft.", 'ny\xc3\xadregyh\xc3\xa1za', 'ny\xc3\xadregyh\xc3\xa1za', '4400', 't\xc3\xbcnde utca 20.', 47.935175, 21.744975, u'Ny\xedregyh\xe1za', u'Borb\xe1nya', u'Szabolcs-Szatm\xe1r-Bereg', u'Ny\xedregyh\xe1zai', u'20', u'T\xfcnde utca', u'Magyarorsz\xe1g', u'4405']
Py2k does not do Unicode, but I had a UnicodeWriter wrapper:
import cStringIO, codecs
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
However, these lines still produce the dreaded encoding error message below:
f.write(codecs.BOM_UTF8)
writer = UnicodeWriter(f)
writer.writerow(row)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 9: ordinal not in range(128)
What is there to do? Thanks!

You are passing bytestrings containing non-ASCII data in, and these are being decoded to Unicode using the default codec at this line:
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
unicode(bytestring) with data that cannot be decoded as ASCII fails:
>>> unicode('\xef\xbb\xbft_11651497')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
Decode the data to Unicode before passing it to the writer:
row = [v.decode('utf8') if isinstance(v, str) else v for v in row]
This assumes that your bytestring values contain UTF-8 data instead. If you have a mix of encodings, try to decode to Unicode at the point of origin; where your program first sourced the data. You really want to do so anyway, regardless of where the data came from or if it already was encoded to UTF-8 as well.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting Null Byte using io.StringIO - python

Related

'charmap' codec can't decode byte 0x9d in position 4836: character maps to <undefined>

How can I replace 'æ' 'ø' and 'å' in a text without error: 'ascii' codec can't decode byte 0xc3 in position 0

ASCII Encode/Decode Error

How to fix UnicodeDecodeError: 'utf8' decode byte 0xc0 read in sql file

how to write a unicode csv in Python 2.7

Categories

Resources