Python 3 Unicode not found - python

I'm aware that unicode was changed to str in python 3 but I keep getting the same issue no matter how I write this code, can anyone tell me why?
I'm using boilerpipe for a specific set of webcrawls:
for urls in allUrls:
fileW = open('article('+ str(counter)+')', 'w')
articleDate = Article(urls)
articleDate.download()
articleDate.parse()
print(articleDate.publish_date)
fileW.write(str(Extractor(extractor='ArticleExtractor', url=urls).getText() + "\n\n\n" + str(articleDate.publish_date)+"\n\n\n"))
fileW.close
counter +=1
error:
Traceback (most recent call last):
File "/Users/Adrian/anaconda3/lib/python3.6/site-packages/boilerpipe/extract/__init__.py", line 45, in __init__
self.data = unicode(self.data, encoding)
NameError: name 'unicode' is not defined
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "webcrawl.py", line 26, in <module>
fileW.write(str(Extractor(extractor='ArticleExtractor', url=urls).getText() + "\n\n\n" + str(articleDate.publish_date)+"\n\n\n"))
File "/Users/Adrian/anaconda3/lib/python3.6/site-packages/boilerpipe/extract/__init__.py", line 47, in __init__
self.data = self.data.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

The error message is pointing to a line in boilerpipe/extract/__init__.py, which makes a call to the unicode built-in function.
I assume the link below is the source code for the package you are using. If so, it appears to be written for Python 2.7, which you can see if you look near the end of this file:
https://github.com/misja/python-boilerpipe/blob/master/setup.py
You have several options as far as I can see:
Find a Python 3 port of this package. There are at least a few out there (here's one and here's another).
Port the package to Python 3 yourself (if that is the only error, you could simply change that line to use str, but later changes could cause problems with other parts of the package). This official tool should be of assistance; this official guide should, as well.
Port you project to Python 2.7 and continue using the same package.
I hope this helps!

Related

Why does python shelf module give an error at start up when opening file

I have been using the python shelf module to store face encodings from the python face-recognition module below. I did this to make the live image recognition process faster.
I then imported these encodings in another script using the shelf module again, assigned them to a variable, and use them further down my script. This all works fine in the python idle environment and when I run it from the terminal. However, on startup, the shelf module fails to import the data. Can anyone tell me why this happens at start-up? The error I get on the log file is below. I have been stuck on it for a few days now. Is there a better way of storing and reusing the encodings? Thanks in advance.
the bit of code that fails on start-up but runs fine otherwise
import shelve
shelfFile = shelve.open('face_encodings')
known_encodings = shelfFile['known_encodings']
known_names = shelfFile['known_names']
shelfFile.close()
the error on startup
Traceback (most recent call last): File "/usr/lib/python3.7/shelve.py", line 111, in __getitem__
value = self.cache[key]
KeyError: 'known_encodings'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/pi/Desktop/run/testing.py", line 4, in <module>
known_encodings = shelfFile['known_encodings']
File "/usr/lib/python3.7/shelve.py", line 113, in __getitem__
f = BytesIO(self.dict[key.encode(self.keyencoding)])
KeyError: b'known_encodings'
face-recognition module
https://pypi.org/project/face-recognition/

Encoding Error with Beautiful Soup: Character Maps to Undefined (Python)

I've written a script that is supposed to retrieve html pages off a site and update their contents. The following function looks for a certain file on my system, then attempts to open it and edit it:
def update_sn(files_to_update, sn, table, title):
paths = files_to_update['files']
print('updating the sn')
try:
sn_htm = [s for s in paths if re.search('^((?!(Default|Notes|Latest_Addings)).)*htm$', s)][0]
notes_htm = [s for s in paths if re.search('_Notes\.htm$', s)][0]
except Exception:
print('no sns were found')
pass
new_path_name = new_path(sn_htm, files_to_update['predecessor'], files_to_update['original'])
new_sn_number = sn
htm_text = open(sn_htm, 'rb').read().decode('cp1252')
content = re.findall(r'(<table>.*?<\/table>.*)(?:<\/html>)', htm_text, re.I | re.S)
minus_content = htm_text.replace(content[0], '')
table_soup = BeautifulSoup(table, 'html.parser')
new_soup = BeautifulSoup(minus_content, 'html.parser')
head_title = new_soup.title.string.replace_with(new_sn_number)
new_soup.link.insert_after(table_soup.div.next)
with open(new_path_name, "w+") as file:
result = str(new_soup)
try:
file.write(result)
except Exception:
print('Met exception. Changing encoding to cp1252')
try:
file.write(result('cp1252'))
except Exception:
print('cp1252 did\'nt work. Changing encoding to utf-8')
file.write(result.encode('utf8'))
try:
print('utf8 did\'nt work. Changing encoding to utf-16')
file.write(result.encode('utf16'))
except Exception:
pass
This works in the majority of cases, but sometimes it fails to write, at which point the exception kicks in and I try every feasible encoding without success:
updating the sn
Met exception. Changing encoding to cp1252
cp1252 did'nt work. Changing encoding to utf-8
Traceback (most recent call last):
File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 145, in update_sn
file.write(result)
File "C:\Users\Joseph\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 4006-4007: character maps to <undefined>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 149, in update_sn
file.write(result('cp1252'))
TypeError: 'str' object is not callable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "scraper.py", line 79, in <module>
get_latest(entries[0], int(num), entries[1])
File "scraper.py", line 56, in get_latest
update_files.update_sn(files_to_update, data['number'], data['table'], data['title'])
File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 152, in update_sn
file.write(result.encode('utf8'))
TypeError: write() argument must be str, not bytes
Can anyone give me any pointers on how to better handle html data that might have inconsistent encoding?
In your code you open the file in text mode, but then you attempt to write bytes (str.encode returns bytes) and so Python throws an exception:
TypeError: write() argument must be str, not bytes
If you want to write bytes, you should open the file in binary mode.
BeautifulSoup detects the document’s encoding (if it is bytes) and converts it to string automatically. We can access the encoding with .original_encoding, and use it to encode the content when writting to file. For example,
soup = BeautifulSoup(b'<tag>ascii characters</tag>', 'html.parser')
data = soup.tag.text
encoding = soup.original_encoding or 'utf-8'
print(encoding)
#ascii
with open('my.file', 'wb+') as file:
file.write(data.encode(encoding))
In order for this to work you should pass your html as bytes to BeautifulSoup, so don't decode the response content.
If BeautifulSoup fails to detect the correct encoding for some reason, then you could try a list of possible encodings, like you have done in your code.
data = 'Somé téxt'
encodings = ['ascii', 'utf-8', 'cp1252']
with open('my.file', 'wb+') as file:
for encoding in encodings:
try:
file.write(data.encode(encoding))
break
except UnicodeEncodeError:
print(encoding + ' failed.')
Alternatively, you could open the file in text mode and set the encoding in open (instead of encoding the content), but note that this option is not available in Python2.
Just out of curiosity, is this line of code a typo file.write(result('cp1252'))? Seems like it is missing .encode method.
Traceback (most recent call last):
File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 149, in update_sn
file.write(result('cp1252'))
TypeError: 'str' object is not callable
Will it work perfectly if you modify the code to: file.write(result.encode('cp1252'))
I once had this write to file with encoding problem and brewed my own solution through the following thread:
Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence
.
My problem solved by changing the html.parser parsing mode to html5lib. I root-caused my problem due to malformed HTML tag and solved it with html5lib parser. For your reference, this is the documentation for each parser provided by BeautifulSoup.
Hope this helps

tf.gfile.Glob gives me UnicodeDecodeError error anyway to fix this?

I was trying to get the list of name of txt file that was written in Korean in the specified directory with the code below
dir_list = tf.gfile.Glob(engine.TXT_DIR+"/*.txt")
However, This one gives me the following error:
Traceback (most recent call last):
File "D:/Prj_mayDay/Prj_FrankenShtine/shakespear_reborn/main.py", line 108, in <module>
dir_list = tf.gfile.Glob(engine.TXT_DIR+"/*.txt")
File "D:\KimKanna's Class\python35\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 326, in get_matching_files
compat.as_bytes(filename), status)
File "D:\KimKanna's Class\python35\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 325, in <listcomp>
for matching_filename in pywrap_tensorflow.GetMatchingFiles(
File "D:\KimKanna's Class\python35\lib\site-packages\tensorflow\python\util\compat.py", line 106, in as_str_any
return as_str(value)
File "D:\KimKanna's Class\python35\lib\site-packages\tensorflow\python\util\compat.py", line 84, in as_text
return bytes_or_text.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 19: invalid start byte
Now, throughout some research, I found out the reason
The error is because there is some non-ascii character in the dictionary and it can't be encoded/decoded
However, I do not see any way to apply the solution into my code. or is there?
**if there is alternative code for this. It should be applicable for both cloud stroage bucket / my personal hard drive as the code above did.
I'm using python3, Tensorflow version of 1.2.0-rc2
so after few hours of fiddling around with my code I finally found the solution.
Afterall one of the file inside of the directory I specified had a name in Korean. After I took that out of the directory. problem was gone.

Python 3 [TypeError: 'str' object cannot be interpreted as an integer] when working with sockets

I run the script with python3 in the terminal but when I reach a certain point in it, I get the following error:
Traceback (most recent call last):
File "client.py", line 50, in <module>
client()
File "client.py", line 45, in client
s.send(bytes((msg, 'utf-8')))
TypeError: 'str' object cannot be interpreted as an integer
This is the code it refers to.
else :
# user entered a message
msg = sys.stdin.readline()
s.send(bytes((msg, 'utf-8')))
sys.stdout.write(bytes('[Me] '))
sys.stdout.flush()
I read the official documentation for bytes() and another source
https://docs.python.org/3.1/library/functions.html#bytes
http://www.pythoncentral.io/encoding-and-decoding-strings-in-python-3-x/
but I am no closer to understanding how to fix this. I realise that my msg is a string and I need an integer, but I am confused about how to convert it. Can you please help me, or direct me to a source that will help me?
Edit 1: I changed the line
s.send(bytes((msg, 'utf-8')))
to
s.send(bytes(msg, 'utf-8'))
but now I get the following error:
Traceback (most recent call last):
File "client.py", line 50, in <module>
client()
File "client.py", line 46, in client
sys.stdout.write(bytes('[Me] '))
TypeError: string argument without an encoding
Edit 2: According to #falsetru updated answer.
Using bytes literal gives me
TypeError: must be str, not bytes
Change the following line:
s.send(bytes((msg, 'utf-8')))
as:
s.send(bytes(msg, 'utf-8'))
In other words, pass a string and an encoding name instead of a passing a tuple to bytes.
UPDATE accoridng to question change:
You need to pass a string to sys.stdout.write. Simply pass a string literal:
sys.stdout.write('[Me] ')

Can't get pythonXY to work on my laptop

I'm migrating from Matlab to python so i decided to try the pythonxy distribution, but after installation i can't open it. When i double click on icon nothing happens. i already try to submite a issue on pythonxy page but didn't get any answer.
Does anyone knows what this problem could be?
I'm using Win7 x64
this is the traceback information displayed on interactive console:
Traceback (most recent call last):
File "C:\Python27\Scripts\xyhome.pyw", line 21, in <module>
xyhome.main()
File "C:\Python27\lib\site-packages\xy\xyhome.pyw", line 689, in main
form = MainWindow(options)
File "C:\Python27\lib\site-packages\xy\xyhome.pyw", line 134, in __init__
self.scanstartup()
File "C:\Python27\lib\site-packages\xy\xyhome.pyw", line 574, in scanstartup
default_startup()
File "C:\Python27\lib\site-packages\xy\config.py", line 85, in default_startup
filename = osp.join(STARTUP_PATH, CONF.get(None, 'startup'))
File "C:\Python27\lib\ntpath.py", line 109, in join
path += "\\" + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 17: ordinal
not in range(128)
This is a known bug in pythonxy: http://code.google.com/p/pythonxy/issues/detail?id=146
The problem is that your home path contains non-ASCII characters; you would probably have to run it from a user without non-ASCII chars in the home path to make it work for now, while there are patches in the bug report comments, they do not seem to work as intended.
Do you have any non-ascii characters in your path? If so, maybe you would like to change your installation path. It seems that it has a problem with the character "7".
>>> chr(231)
'\xe7'
>>> chr(55)
'7'
My guess is that your 7 in C:\Python27\ is not really a 7.

Categories