How to get pool.py to accept non ascii characters? - python

I am using Python 2.7.18
The idea is to use python to gather songs from specified directories, then create and run the commands to run them through a bunch of converters and sound processors.
Some of my songs have characters with accents and any song with a ? in the title gets changed to a ¿ (Inverted Question Mark) in the file name.
My convert_song function works correctly when ran, but when I try to run it in a Pool and the file name or directory has a non ascii character in it, it fails with:
Traceback (most recent call last):
File "C:\StreamLine.py", line 270, in <module>
result = pool.map(convert_song, qTheStack)
File "C:\Python27\lib\multiprocessing\pool.py", line 253, in map
return self.map_async(func, iterable, chunksize).get()
File "C:\Python27\lib\multiprocessing\pool.py", line 572, in get
raise self._value
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbf' in position 27: ordinal not in range(128)
Here's my main where I set up the pool:
if __name__ == '__main__':
print('Reading artists.')
predir = 'G:\\Vault\\The Music\\'
artistfile = open('C:\\Controls\\ArtistList.txt', 'r')
artistlist = artistfile.readlines()
dirs = []
for artist in artistlist:
dirs.append(predir + artist.strip())
qTheStack = []
for currentPath in dirs:
for wFile in generate_next_file(currentPath):
print(repr(wFile))
#print(convert_song(wFile))
qTheStack.append(wFile)
print('List loaded.')
pool = Pool(12)
result = pool.map(convert_song, qTheStack)
for item in result:
print(item)
The print(repr(wFile)) looks like this when ran:
'G:\\Vault\\The Music\\Chicago\\1989 - Greatest Hits 1982-1989\\04 - Will You Still Love Me\xbf.flac'
'G:\\Vault\\The Music\\Chicago\\1989 - Greatest Hits 1982-1989\\06 - What Kind of Man Would I Be\xbf [Remix].flac'
How can I get the built-in Pool from multiprocessing to accept my input?

Change to Python 3, dude.
As much as I wanted there to be an answer that stayed on Python 2.7, I tried Python 3 and it didn't disappoint.
I did have to go back through the obscure steps I found to generate a file that will run a COM/DLL in Python, and I had to remove all the str.decode and encode calls throughout my script. After only one import change, I hit run and it ran as expected.

Related

On the wrong foot with regular python reading of a text file, error line at the exception is wrong

When reading an utf-8 text file in Python you may encounter an illegal utf character. Next you probably will try to find the line (number) containing the illegal character, but probably this will fail. This is illustrated by the code below.
Step 1: Create a file containing an illegal utf-8 character (a1 hex = 161 decimal)
filename=r"D:\wrong_utf8.txt"
longstring = "test just_a_text"*10
with open(filename, "wb") as f:
for lineno in range(1,100):
if lineno==85:
f.write(f"{longstring}\terrrocharacter->".encode('utf-8')+bytes.fromhex('a1')+"\r\n".encode('utf-8'))
else:
f.write(f"{longstring}\t{lineno}\r\n".encode('utf-8'))
Step 2: Read the file and catch the error:
print("First pass, regular Python textline read.")
with open(filename, "r",encoding='utf8') as f:
lineno=0
while True:
try:
lineno+=1
line=f.readline()
if not line:
break
print(lineno)
except UnicodeDecodeError:
print (f"UnicodeDecodeError at line {lineno}\n")
break
It prints: UnicodeDecodeError at line 50
I would expect the errorline to be line 85. However, lineno 50 is printed! So, the customer who send the file to us was unable to find the illegal character. I tried to find additional parameters to modify the open statement (including buffering) but was unable to get the right error line number.
Note: if you sufficiently shorten the longstring, the problem goes away. So the problem probably has to do with python's internal buffering.
I succeeded by using the following code to find the error line:
print("Second pass, Python byteline read.")
with open(filename,'rb') as f:
lineno=0
while True:
try:
lineno+=1
line = f.readline()
if not line:
break
lineutf8=line.decode('utf8')
print(lineno)
except UnicodeDecodeError: #Exception as e:
mybytelist=line.split(b'\t')
for index,field in enumerate(mybytelist):
try:
fieldutf8=field.decode('utf8')
except UnicodeDecodeError:
print(f'UnicodeDecodeError in line {lineno}, field {index+1}, offending field: {field}!')
break
break
Now it prints the right lineno: UnicodeDecodeError in line 85, field 2, offending field: b'errrocharacter->\xa1\r\n'!
Is this the pythonic way of finding the error line? It works all right but I somehow have the feeling that a better method should be available where it is not required to read the file twice and/or use a binary read.
The actual cause is indeed the way Python internally processes text files.They are read in chunks, each chunk is decoded according the the specified encoding, and they if you use readline or iterate the file object, the decoded buffer is split in lines which are returned one at a time.
You can have an evidence of that by examining the UnicodeDecodeError object at the time of the error:
....
except UnicodeDecodeError as e:
print (f"UnicodeDecodeError at line {lineno}\n")
print(repr(e)) # or err = e to save the object and examine it later
break
With your example data, you can find that Python was trying to decode a buffer of 8149 bytes, and that the offending character occurs at position 5836 in that buffer.
This processing is deep inside the Python io library because Text files have to be buffered and the binary buffer is decode before being splitted in lines. So IMHO little can be done here, and the best way is probably your second try: read the file as a binary file and decode the lines one at a time.
Alternatively, you could use errors='replace' to replace any offending byte with a REPLACEMENT CHARACTER (U+FFFD). But then, you would no longer test for an error, but search for that character in the line:
with open(filename, "r",encoding='utf8', errors='replace') as f:
lineno=0
while True:
lineno+=1
line=f.readline()
if not line:
break
if chr(0xfffd) in line:
print (f"UnicodeDecodeError at line {lineno}\n")
break
print(lineno)
This one also gives as expected:
...
80
81
82
83
84
UnicodeDecodeError at line 85
The UnicodeDecodeError has information about the error that can be used to improve the reporting of the error.
My proposal would be to decode the whole file in one go. If the content is good then there is no need to iterate around a loop. Especially as reading a binary file doesn't have the concept of lines.
If there is an error raised with the decode, then the UnicodeDecodeError has the start and end values of the bad content.
Only docoding up to the that bad character allows the lines to be counted efficiently with len and splitlines.
If you want to display the bad line then doing the decode with replace errors set might be useful along with the line number from the previous step.
I would also consider raising a custom exception with the new information.
Here is an example:
from pathlib import Path
def create_bad(filename):
longstring = "test just_a_text" * 10
with open(filename, "wb") as f:
for lineno in range(1, 100):
if lineno == 85:
f.write(f"{longstring}\terrrocharacter->".encode('utf-8') + bytes.fromhex('a1') + "\r\n".encode('utf-8'))
else:
f.write(f"{longstring}\t{lineno}\r\n".encode('utf-8'))
class BadUnicodeInFile(Exception):
"""Add information about line numbers"""
pass
def new_read_bad(filename):
file = Path(filename)
data = file.read_bytes()
try:
file_content = data.decode('utf8')
except UnicodeDecodeError as err:
bad_line_no = len(err.object[:err.start].decode('utf8').splitlines())
bad_line_content = err.object.decode('utf8', 'replace').splitlines()[bad_line_no - 1]
bad_content = err.object[err.start:err.end]
raise BadUnicodeInFile(
f"{filename} has bad content ({bad_content}) on: line number {bad_line_no}\n"
f"\t{bad_line_content}")
return file_content
if __name__ == '__main__':
create_bad("/tmp/wrong_utf8.txt")
new_read_bad("/tmp/wrong_utf8.txt")
This gave the following output:
Traceback (most recent call last):
File "/home/user1/stack_overflow/wrong_utf8.py", line 39, in new_read_bad
file_content = data.decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 14028: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/user1/stack_overflow/wrong_utf8.py", line 52, in <module>
new_read_bad("/tmp/wrong_utf8.txt")
File "/home/user1/stack_overflow/wrong_utf8.py", line 44, in new_read_bad
raise BadUnicodeInFile(
__main__.BadUnicodeInFile: /tmp/wrong_utf8.txt has bad content (b'\xa1') on: line number 85
test just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_text errrocharacter->�

Python Camelot PDF - UnicodeEncodeError when using Stream flavor, on Windows

Python 3.7 on Windows 10. Camelot 0.8.2
I'm using the following code to convert a pdf file to HTML:
import camelot
import os
def CustomScript(args):
path_to_pdf = "C:\PDFfolder\abc.pdf"
folder_to_pdf = os.path.dirname(path_to_pdf)
tables = camelot.read_pdf(os.path.normpath(path_to_pdf), flavor='stream', pages='1-end')
tables.export(os.path.normpath(os.path.join(folder_to_pdf,"temp","foo.html")), f='html')
return CustomScriptReturn.Empty();
I receive the following error at the tables.export line:
"UnicodeEncodeError -'charmap' codec can't encode character '\u2010'
in position y: character maps to undefined.
This code runs without issue on Mac. This error seems to pertain to Windows, which is the environment I will need to run this on.
I have now spent two entire days researching this error ad nauseum - I have tried many of the solutions offered here on Stack Overflow from the several posts related to this. The error persists. The problem with adding the lines of code suggested in all the solutions is that they're all arguments to be added to vanilla Python methods. These arguments are not available to the Camelot's export method.
EDIT 1: Updated post to specify which line is throwing the error.
EDIT 2: PDF file used: http://tsbde.texas.gov/78i8ljhbj/Fiscal-Year-2014-Disciplinary-Actions.pdf
EDIT 3: Here is the full Traceback from Windows console:
> Traceback (most recent call last): File "main.py", line 18, in
> <module>
> tables.export(os.path.normpath(os.path.join(folder_to_pdf, "foo.html")), f='html') File
> "C:\Users\stpete\AppData\Local\Programs\Python\Python37\lib\site-packages\camelot\core.py",
> line 737, in export
> self._write_file(f=f, **kwargs) File "C:\Users\stpete\AppData\Local\Programs\Python\Python37\lib\site-packages\camelot\core.py",
> line 699, in _write_file
> to_format(filepath) File "C:\Users\stpete\AppData\Local\Programs\Python\Python37\lib\site-packages\camelot\core.py",
> line 636, in to_html
> f.write(html_string) File "C:\Users\stpete\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py",
> line 19, in encode
> return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u2010' in
> position 5737: character maps to <undefined>
The problem you are facing is related to the method camelot.core.Table.to_html:
def to_html(self, path, **kwargs):
"""Writes Table to an HTML file.
For kwargs, check :meth:`pandas.DataFrame.to_html`.
Parameters
----------
path : str
Output filepath.
"""
html_string = self.df.to_html(**kwargs)
with open(path, "w") as f:
f.write(html_string)
Here, the file to be written should be opened with UTF-8 encoding and it is not.
This is my solution, which uses a monkey patch to replace original camelot method:
import camelot
import os
# here I define the corrected method
def to_html(self, path, **kwargs):
"""Writes Table to an HTML file.
For kwargs, check :meth:`pandas.DataFrame.to_html`.
Parameters
----------
path : str
Output filepath.
"""
html_string = self.df.to_html(**kwargs)
with open(path, "w", encoding="utf-8") as f:
f.write(html_string)
# monkey patch: I replace the original method with the corrected one
camelot.core.Table.to_html=to_html
def CustomScript(args):
path_to_pdf = "C:\PDFfolder\abc.pdf"
folder_to_pdf = os.path.dirname(path_to_pdf)
tables = camelot.read_pdf(os.path.normpath(path_to_pdf), flavor='stream', pages='1-end')
tables.export(os.path.normpath(os.path.join(folder_to_pdf,"temp","foo.html")), f='html')
return CustomScriptReturn.Empty();
I tested this solution and it works for Python 3.7, Windows 10, Camelot 0.8.2.
You're getting UnicodeEncodeError, which in this case means that the output to be written to file contains a character than cannot be encoded in the default encoding for your platform, cp1252.
camelot does not seem to handle setting an encoding when writing to an html file.
A workaround might be to set the PYTHONIOENCODING environment variable to "UTF-8" when running your program:
C:\> set PYTHONIOENCODING=UTF-8 && python myprog.py
to force outputting the file(s) with UTF-8 encoding.

Strange UnicodeEncodeError/AttributeError in my script

Currently I am writing a script in Python 2.7 that works fine except for after running it for a few seconds it runs into an error:
Enter Shopify website URL (without HTTP): store.highsnobiety.com
Scraping! Check log file # z:\shopify_output.txt to see output.
!!! Also make sure to clear file every hour or so !!!
Copper Bracelet - 3mm - Polished ['3723603267']
Traceback (most recent call last):
File "shopify_sitemap_scraper.py", line 38, in <module>
print(prod, variants).encode('utf-8')
AttributeError: 'NoneType' object has no attribute 'encode'
The script is to get data from a Shopify website and then print it to console. Code here:
# -*- coding: utf-8 -*-
from __future__ import print_function
from lxml.html import fromstring
import requests
import time
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
# Log file location, change "z://shopify_output.txt" to your location.
logFileLocation = "z:\shopify_output.txt"
log = open(logFileLocation, "w")
# URL of Shopify website from user input (for testing, just use store.highsnobiety.com during input)
url = 'http://' + raw_input("Enter Shopify website URL (without HTTP): ") + '/sitemap_products_1.xml'
print ('Scraping! Check log file # ' + logFileLocation + ' to see output.')
print ("!!! Also make sure to clear file every hour or so !!!")
while True :
page = requests.get(url)
tree = fromstring(page.content)
# skip first url tag with no image:title
url_tags = tree.xpath("//url[position() > 1]")
data = [(e.xpath("./image/title//text()")[0],e.xpath("./loc/text()")[0]) for e in url_tags]
for prod, url in data:
# add xml extension to url
page = requests.get(url + ".xml")
tree = fromstring(page.content)
variants = tree.xpath("//variants[#type='array']//id[#type='integer']//text()")
print(prod, variants).encode('utf-8')
The most crazy part about it is that when I take out the .encode('utf-8') it gives me a UnicodeEncodeError seen here:
Enter Shopify website URL (without HTTP): store.highsnobiety.com
Scraping! Check log file # z:\shopify_output.txt to see output.
!!! Also make sure to clear file every hour or so !!!
Copper Bracelet - 3mm - Polished ['3723603267']
Copper Bracelet - 5mm - Brushed ['3726247811']
Copper Bracelet - 7mm - Polished ['3726253635']
Highsnobiety x EARLY - Leather Pouch ['14541472963', '14541473027', '14541473091']
Traceback (most recent call last):
File "shopify_sitemap_scraper.py", line 38, in <module>
print(prod, variants)
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xae' in position 13: character maps to <undefined>'
Any ideas? Have no idea what else to try after hours of googling.
snakecharmerb almost got it, but missed the cause of your first error. Your code
print(prod, variants).encode('utf-8')
means you print the values of the prod and variants variables, then try to run the encode() function on the output of print. Unfortunately, print() (as a function in Python 2 and always in Python 3) returns None. To fix it, use the following instead:
print(prod.encode("utf-8"), variants)
Your console has a default encoding of cp437, and cp437 is unable to represent the character u'\xae'.
>>> print (u'\xae')
®
>>> print (u'\xae'.encode('utf-8'))
b'\xc2\xae'
>>> print (u'\xae'.encode('cp437'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/encodings/cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\xae' in position 0: character maps to <undefined>
You can see that it's trying to convert to cp437 in the traceback:
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
(I reproduced the problem in Python3.5, but it's the same issue in both versions of Python)

nltk NERTagger UnicodeDecodeError in python

I am writing a program in python 2.7.6 that uses nltk with Stanford named entity tagger in Window 7 professional to tag a text and print the result as follows:
import re
from nltk.tag.stanford import NERTagger
WORD = re.compile(r'\w+')
st = NERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", "stanford-ner.jar")
text = "title Wienfilm 1896-1976 (1976)"
words = WORD.findall(text )
print words
answer = st.tag(words )
print answer
The last print statement in the program suppose to return a tuple consisting of five lists as:
[(u'title', u'O'), (u'Wienfilm', u'O'), (u'1896', u'O'), (u'1976', u'O'), (u'1976', u'O')]
However when I run the program, it gives me the following error message:
['title', 'Wienfilm', '1896', '1976', '1976']
Traceback (most recent call last):
File "E:\Google Drive\myPyPrgs\testNLP.py", line 27, in <module>
answer = st.tag(words )
File "C:\Python27\lib\site-packages\nltk\tag\stanford.py", line 59, in tag
return self.tag_sents([tokens])[0]
File "C:\Python27\lib\site-packages\nltk\tag\stanford.py", line 82, in tag_sents
stanpos_output = stanpos_output.decode(encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 23: ordinal not in
range(128)
Note that if I remove the number, '-1976' from the text string the program tags and prints the correct answer. But if the number '-1976' is within the text, I always have the above error.
In this forum, somebody suggested to me to change the default encoding in the stanford.py of the nltk. When I changed the default encoding in the stanford.py from ascii to UTF-16 and replaced the the last print statement of the above code with the following looping:
for i, word_pos in enumerate(answer):
word, pos = word_pos
print i , word.encode('utf-16'), pos.encode('utf-16')
I got the following incorrect output:
0 ÿþ ÿþtitle/O Wienfilm/O 1896 1976 1976/O
Please any clues on how to deal with this issue? Thanks in advance.
This worked for me: specify the encoding argument as UTF-8 when you create NERTagger object
st = NERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", "stanford-ner.jar", encoding='utf-8')
Open terminal(cmd), and write;
chcp
It should return something like;
active code page: 857
Then, write;
chcp 1254
After then, in your .py script, to the top of your script write;
# -*- coding: cp1254 -*-
This should solve your problem.If it's not, copy these codes and paste to the top of your script.
# -*-coding:utf-8-*-
import locale
locale.setlocale(locale.LC_ALL, '')
I had many problems with decoding before, these methods solved.
ASCII can decode only 2^7 = 128 characters, that's why you getting that error.As you see in the error sentence ordinal not in range(128) .
And check this website please.Use arrow keys for switching pages :-) I believe it's going to solve your problem.
At the top of your app add:
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
I was dealing with the same problem and I solved it by adding the encoding options on internals.py in nltk.
You must open internals.py saven on:
%YourPythonFolder%\Lib\site-packages\nltk\internals.py
Then go to the method java and adding this line after #construct the full command string (about line 147)
cmd = cmd + ['-inputEncoding', 'utf-8', '-outputEncoding', 'utf-8']
This section code must look like:
# Construct the full command string.
cmd = list(cmd)
cmd = ['-cp', classpath] + cmd
cmd = [_java_bin] + _java_options + cmd
cmd = cmd + ['-inputEncoding', 'utf-8', '-outputEncoding', 'utf-8']
Hope it helps.

Getting error when trying to rename multiple files with python

I have 112 music files in a folder. All of them start with the type of music like 【House】Supermans Feinde Shine.
All of them start with that 【 and i want to rename like House - Supermans Feinde Shine
I have tried:
import os
for filename in os.listdir("C:/MYMUSICSFOLDER"):
if filename.startswith("【"):
os.rename(filename, filename[7:])
but I get:
Error : sys:1: DeprecationWarning: Non-ASCII character '\xe3' in file C:\MYPROGRAMSFOLDER\ne11.py on line 6,but no enconding declared
How do I do that? Rename all of the music files this way?
I tried various code ... but I can't do that.
I have a program thats execute a music when I say "songs" but when I try to do it I get an error; all other functions work perfectly.
Here's the code ...
import os,sys,random
import webbrowser
import speech
import sys
def callback(phrase, listener):
print ": %s" % phrase
if phrase == "songs":
folder = os.listdir("C:/users/william/desktop/music/xkito music")
file = random.choice(folder)
ext3= ['.mp3','.mp4','.wmv']
while file[-4:] not in ext3 :
file = random.choice(folder)
else:
os.startfile(file)
speech.say('Playing Music')
if phrase == "open opera":
webbrowser.open('http://www.google.com')
speech.say("Opening opera")
if phrase == "turn off":
speech.say("Goodbye.")
listener.stoplistening()
sys.exit()
print "Anything you type, speech will say back."
print "Anything you say, speech will print out."
print "Say or type 'turn off' to quit."
print
listener= speech.listenforanything(callback)
while listener.islistening():
text = raw_input("> ")
if text == "turn off":
listener.stoplistening()
sys.exit()
else:
speech.say(text)
And I'm getting this error when trying to execute the music:
pythoncom error: Python error invoking COM method.
Traceback (most recent call last):
File "C:\Python24\Lib\site-packages\win32com\server\policy.py", line 277, in _Invoke_
return self._invoke_(dispid, lcid, wFlags, args)
File "C:\Python24\Lib\site-packages\win32com\server\policy.py", line 282, in _invoke_
return S_OK, -1, self._invokeex_(dispid, lcid, wFlags, args, None, None)
File "C:\Python24\Lib\site-packages\win32com\server\policy.py", line 585, in _invokeex_
return func(*args)
File "C:\Users\william\Desktop\speech-0.5.2\speech.py", line 138, in OnRecognition
self._callback(phrase, self._listener)
File "C:\Users\william\Desktop\speech-0.5.2\example.py", line 21, in callback
os.startfile(file)
WindowsError: [Errno 2] The system can not find the specified file: '?Glitch Hop?Chinese Man - I Got That Tune (Tha Trickaz Remix) [Free Download].mp4
That ? in the beginning of the name is 【 and 】
The error is about handling unicode UTF-8 characters.
I would think that the filename[7:] even splits a UTF-8 character between two of its bytes, so that rename() sees a partial character.
The right way to fix it is to handle UTF-8 correctly, of course.
Bt one way to work around it altogether is to not work with individual bytes of the string, but with strings only, in a way that the encoding is not relevant:
To convert 【House】Superfoo to House - Superfoo, you can
replace 【 by the empty string, and 】 by -.
Use the result of that as the new file name. If the original name is not of the expected format, the name is not changed, and nothing happens. It's not an error, the program does not even notice.

Categories