DecodeError in Paramiko Remote File - python

I have a large remote file that is generated automatically each day. I have no control over how the file is generated. I'm using Paramiko to open the file and then search through it to find if a given line matches a line in the file.
However, I'm receiving the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 57: invalid start byte
My code:
self.ssh = paramiko.SSHClient()
self.ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
self.ssh.connect(host, username=user, password=pass)
self.sftp_client = self.ssh.open_sftp()
self.remote_file = self.sftp_client.open(filepath, mode='r')
def checkPhrase(self, phrase):
found = 0
self.remote_file.seek(0)
for line in self.remote_file:
if phrase in line:
found = 1
break
return found
I'm receiving the error at the line: for line in self.remote_file: Obviously there is a character in the file that is out of the range for utf8.
Is there a way to re-encode the line as it's read or to simply ignore the error?

So, files are bytes. They may or may not have some particular encoding. Additionally, paramiko is always returning bytes, since it ignores the 'b' flag that normal open functions take.
Instead, you should try and decode each line yourself. First, open the file in binary mode, then read a line, then try to decode that line as utf-8. If that fails, just skip the line.
def checkPhrase(self, phrase):
self.remote_file.seek(0)
for line in self.remote_file:
try:
decoded_line = line.decode('utf-8') # Decode from bytes to utf-8
except UnicodeDecodeError:
continue # just keep processing other lines in the file, since this one it's utf-8
if phrase in decoded_line:
return True # We found the phrase, so just return a True (instead of 1)
return False # We never found the phrase, so return False (instead of 0)
Additionally, i've found Ned Batcheldar's Unipain pycon talk immensly helpful in understanding bytes vs unicode.

Related

Python reading a PE file and changing resource section

I am trying to open a Windows PE file and alter some strings in the resource section.
f = open('c:\test\file.exe', 'rb')
file = f.read()
if b'A'*10 in file:
s = file.replace(b'A'*10, newstring)
In the resource section I have a string that is just:
AAAAAAAAAA
And I want to replace that with something else. When I read the file I get:
\x00A\x00A\x00A\x00A\x00A\x00A\x00A\x00A\x00A\x00A
I have tried opening with UTF-16 and decoding as UTF-16 but then I run into a error:
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 1604-1605: illegal encoding
Everyone I seen who had the same issue fixed by decoding to UTF-16. I am not sure why this doesn't work for me.
If resource inside binary file is encoded to utf-16, you shouldn't change encoding.
try this
f = open('c:\\test\\file.exe', 'rb')
file = f.read()
unicode_str = u'AAAAAAAAAA'
encoded_str = unicode_str.encode('UTF-16')
if encoded_str in file:
s = file.replace(encoded_str, new_utf_string.encode('UTF-16'))
inside binary file everything is encoded, keep in mind

Python: Why am I getting a UnicodeDecodeError?

I have the following code that search through files using RE's and if any matches are found it move the file into a different directory.
import os
import gzip
import re
import shutil
def regEx1():
os.chdir("C:/Users/David/myfiles")
files = os.listdir(".")
os.mkdir("C:/Users/David/NewFiles")
regex_txt = input("Please enter the string your are looking for:")
for x in (files):
inputFile = open((x), "r")
content = inputFile.read()
inputFile.close()
regex = re.compile(regex_txt, re.IGNORECASE)
if re.search(regex, content)is not None:
shutil.copy(x, "C:/Users/David/NewFiles")
When I run it i get the following error message:
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Python33\Lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 367: character maps to <undefined>
Please could someone explain why this message appears
In python 3, when you open a file for reading in text mode (r) it'll decode the contained text to unicode.
Since you didn't specify what encoding to use to read the file, the platform default (from locale.getpreferredencoding) is being used, and that fails in this case.
You need to either specify an encoding that can decode the file contents, or open the file in binary mode instead (and use b'' bytes patterns for your regular expressions).
See the Python Unicode HOWTO for more information.
I'm not too familiar with python 3x, but the below may work.
inputFile = open((x, encoding="utf8"), "r")
There's a similar question here:
Python: Traceback codecs.charmap_decode(input,self.errors,decoding_table)[0]
But you might want to try:
open((x), "r", encoding='UTF8')
Thank you very much for this solution. It helps me for another subject, I used :
exec (open ("DIP6.py").read ())
and I got this error because I have this symbol in a comment of DIP6.py :
# ● en première colonne
It works fine with :
exec (open ("DIP6.py", encoding="utf8").read ())
It also solves a problem with :
print("été") for example
in DIP6.py
I got :
été
in the console.
Thank you :-) .

Need help to figure out a solution to this UnicodeDecodeError

When I use this code (adapted from Stephen Holiday code - thanks, Stephen for your code!):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# -*- coding: utf-8 -*-
"""
USSSALoader.py
"""
import os
import re
#import urllib2
from zipfile import ZipFile
import csv
import pickle
def getNameList():
namesDict=extractNamesDict()
maleNames=list()
femaleNames=list()
for name in namesDict:
counts=namesDict[name]
tuple=(name,counts[0],counts[1])
if counts[0]>counts[1]:
maleNames.append(tuple)
elif counts[1]>counts[0]:
femaleNames.append(tuple)
names=(maleNames,femaleNames)
return names
def extractNamesDict():
zf=ZipFile('names.zip', 'r')
filenames=zf.namelist()
names=dict()
genderMap={'M':0,'F':1}
for filename in filenames:
file=zf.open(filename,'r')
rows=csv.reader(file, delimiter=',')
for row in rows:
name=row[0].upper()
# name=row[0].upper().encode('utf-8')
gender=genderMap[row[1]]
count=int(row[2])
if not names.has_key(name):
names[name]=[0,0]
names[name][gender]=names[name][gender]+count
file.close()
print '\tImported %s'%filename
return names
if __name__ == "__main__":
getNameList()
I got this error:
iterator = raw_query.Run(**kwargs)
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore.py", line 1622, in Run
itr = Iterator(self.GetBatcher(config=config))
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore.py", line 1601, in GetBatcher
return self.GetQuery().run(_GetConnection(), query_options)
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore.py", line 1490, in GetQuery
filter_predicate=self.GetFilterPredicate(),
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore.py", line 1534, in GetFilterPredicate
property_filters.append(datastore_query.make_filter(name, op, values))
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\datastore\datastore_query.py", line 107, in make_filter
properties = datastore_types.ToPropertyPb(name, values)
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore_types.py", line 1745, in ToPropertyPb
pbvalue = pack_prop(name, v, pb.mutable_value())
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore_types.py", line 1556, in PackString
pbvalue.set_stringvalue(unicode(value).encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 1: ordinal not in range(128)
This happens when I have names with non-ASCII caracters (like "Chávez" or "Barañao"). I tried to fix this problem doing this:
for row in rows:
# name=row[0].upper()
name=row[0].upper().encode('utf-8')
gender=genderMap[row[1]]
count=int(row[2])
But, then, I got this other error:
File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\USSSALoader.py", line 17, in getNameList
namesDict=extractNamesDict()
File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\USSSALoader.py", line 43, in extractNamesDict
name=row[0].upper().encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 3: ordinal not in range(128)
I also tried this:
def extractNamesDict():
zf=ZipFile('names.zip', 'r', encode='utf-8')
filenames=zf.namelist()
But ZipFile doesn't have such argument.
So, how to fix that avoiding this UnicodeDecodeError for non-ASCII names?
I'm using this code with GAE.
It looks like your first traceback is AppEngine-related. Are you building a loader that will populate the datastore? If so, seeing the code that comprises the models and does the put'ing would be helpful. I will probably be corrected by someone, but in order for that piece to work I believe you actually need to decode instead of encode (i.e. when you read the sheet prior to the put, convert the string to unicode by using decode('utf-8') or decode('latin1'), depending on your situation).
As far as your local code, I won't pretend to know the deep internals of Unicode handling, but I've generally used decode() and encode() to handle these types of situations. I believe the correct encoding to use depends on the underlying text (meaning you'd need to know if it were encoded utf-8 or latin-1, etc.). Here is a quick test with your example:
>>> s = 'Chávez'
>>> type(s)
<type 'str'>
>>> u = s.decode('latin1')
>>> type(u)
<type 'unicode'>
>>> e = u.encode('latin1')
>>> print e
Chávez
In this case, I needed to use latin1 to decode the encoded string (I was using the terminal), but in your situation using utf-8 may very well work.
Unless I'm missing something, this line in the library:
pbvalue.set_stringvalue(unicode(value).encode('utf-8'))
should be:
pbvalue.set_stringvalue(value.decode(filename_encoding).encode('utf-8'))
And the value filename_encoding passed in from your code if not stored in the zip archive somehow (and at least in the early versions of the format, I doubt it's stored). It's yet another occurrence of the classic error of assuming that bytes and "characters" are the same thing.
If you're feeling froggy, dive into the code and fix it, and maybe even contribute a patch. Otherwise, you'll have to write heroic code that checks for U+0080 and above in filenames and performs special handling.
In python 2.7 ( and linux Mint 17.1) , you must use:
hashtags=['transito','tránsito','ñandú','pingüino','fhürer']
for h in hashtags:
u=h.decode('utf-8')
print(u.encode('utf-8'))
transito
tránsito
ñandú
pingüino
fhürer

reading/writing files with umlauts in python (html to txt)

I know this has been asked several times, but I think I'm doing everything right and it still doesn't work, so before I go clinically insane I'll make a post. This is the code (It's supposed to convert HTML Files to txt files and leave out certain lines):
fid = codecs.open(htmlFile, "r", encoding = "utf-8")
if not fid:
return
htmlText = fid.read()
fid.close()
stripped = strip_tags(unicode(htmlText)) ### strip html tags (this is not the prob)
lines = stripped.split('\n')
out = []
for line in lines: # just some stuff i want to leave out of the output
if len(line) < 6:
continue
if '*' in line or '(' in line or '#' in line or ':' in line:
continue
out.append(line)
result= '\n'.join(out)
base, ext = os.path.splitext(htmlFile)
outfile = base + '.txt'
fid = codecs.open(outfile, "w", encoding = 'utf-8')
fid.write(result)
fid.close()
Thanks!
Not sure but by doing
'\n'.join(out)
Using a non-unicode string (but a plain old bytes string), you may be falling back to some non-UTF-8 codec. Try:
u'\n'.join(out)
To make sure you're using unicode objects everywhere.
You haven't specified the problem, so this is a complete guess.
What is being returned by your strip_tags() function? Is it returning a unicode object, or is it a byte string? If the latter, it would likely cause decoding issues when you attempt to write it to a file. For example, if strip_tags() is returning a utf-8 encoded byte string:
>>> s = u'This is \xe4 test\nHere is \xe4nother line.'
>>> print s
This is ä test
Here is änother line.
>>> s_utf8 = s.encode('utf-8')
>>> f=codecs.open('test', 'w', encoding='utf8')
>>> f.write(s) # no problem with this... s is unicode, but
>>> f.write(s_utf8)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/usr/lib64/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)
If this is what you are seeing you need to make sure that you pass unicode in fid.write(result), which probably means ensuring that unicode is returned by strip_tags().
Also, a couple of other things I noticed in passing:
codecs.open() will raise an IOError exception if it can not open the file. It will not return None, so the if not fid: test will not assist. You need to use try/except, ideally with with.
try:
with codecs.open(htmlFile, "r", encoding = "utf-8") as fid:
htmlText = fid.read()
except IOError, e:
# handle error
print e
And, data that you read from a file opened via codecs.open() will automatically be converted to unicode, therefore calling unicode(htmlText) achieves nothing.

Except Python codec errors?

File "/usr/lib/python3.1/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 805: invalid start byte
Hi, I get this exception. How do I catch it, and continue reading my files when I get this exception.
My program has a loop that reads a text file line-by-line and tries to do some processing. However, some files I encounter may not be text files, or have lines that are not properly formatted (foreign language etc). I want to ignore those lines.
The following is not working
for line in sys.stdin:
if line != "":
try:
matched = re.match(searchstuff, line, re.IGNORECASE)
print (matched)
except UnicodeDecodeError, UnicodeEncodeError:
continue
Look at http://docs.python.org/py3k/library/codecs.html. When you open the codecs stream, you probably want to use the additional argument errors='ignore'
In Python 3, sys.stdin is by default opened as a text stream (see http://docs.python.org/py3k/library/sys.html), and has strict error checking.
You need to reopen it as an error-tolerant utf-8 stream. Something like this will work:
sys.stdin = codecs.getreader('utf8')(sys.stdin.detach(), errors='ignore')

Categories