Python cPickle unable to load an OCR model library - python

I have just installed ocropus OCR with all dependencies in my windows 7 machine. (I am using 32bit python 2.7) It seems to be working fine except that I cannot load the default OCR model: en-default.pyrnn.gz. , and receiving a Traceback. I am using the following syntax:
python ocropus-rpred -m en-default.pyrnn.gz book\0001\*.png
here is the error
INFO: #inputs47
# loading object /usr/local/share/ocropus/en-default.pyrnn.gz
Traceback (most recent call last):
File "ocropus-rpred" line 109, in <module>
network = ocrolib.load_object(args.model,verbose=1)
File "C:\anaconda32\lib\site-packages\ocrolib\common.py", line 513, in load_object
return unpickler.load()
EOFError
I have checked the file is not empty; also double checked the binary mode flag enabled i.e. "wb" and "rb"; also converted the newlines of common.py using dos2unix. I am being unable to unable to solve this problem. If anyone have expereinced similar issues, kindly share.
import cPickle
import gzip
def save_object(fname,obj,zip=0):
if zip==0 and fname.endswith(".gz"):
zip = 1
if zip>0:
# with gzip.GzipFile(fname,"wb") as stream:
with os.popen("gzip -9 > '%s'"%fname,"wb") as stream:
cPickle.dump(obj,stream,2)
else:
with open(fname,"wb") as stream:
cPickle.dump(obj,stream,2)
def unpickle_find_global(mname,cname):
if mname=="lstm.lstm":
return getattr(lstm,cname)
if not mname in sys.modules.keys():
exec "import "+mname
return getattr(sys.modules[mname],cname)
def load_object(fname,zip=0,nofind=0,verbose=0):
"""Loads an object from disk. By default, this handles zipped files
and searches in the usual places for OCRopus. It also handles some
class names that have changed."""
if not nofind:
fname = ocropus_find_file(fname)
if verbose:
print "# loading object",fname
if zip==0 and fname.endswith(".gz"):
zip = 1
if zip>0:
# with gzip.GzipFile(fname,"rb") as stream:
with os.popen("gunzip < '%s'"%fname,"rb") as stream:
unpickler = cPickle.Unpickler(stream)
unpickler.find_global = unpickle_find_global
return unpickler.load()
else:
with open(fname,"rb") as stream:
unpickler = cPickle.Unpickler(stream)
unpickler.find_global = unpickle_find_global
return unpickler.load()
UPDATE: Hi, please note that I have used Python's native gzip, and it is working fine. Thank you for pointing that out. Here is the correct syntax that is working on Windows: {with gzip.GzipFile(fname,"rb") as stream:}

Your use of gunzip (in the load_object function) is incorrect. Unless passed the -c argument, gunzip writes the decompressed data to a new file, not to its stdout (which is what you seem to be attempting to do).
As a result, it doesn't write anything to its stdout, and your stream variable contains no data, hence the EOFError.
A quick fix is to change your gunzip command line to give it the -c argument.
More info here: http://linux.die.net/man/1/gzip
That said, why are you even shelling out to gunzip to decompress your data? Python's built-in gzip module should handle that without problems.

Related

Specify a download path usig wget module in python

I'm trying to download files from a site using the wget module.
The code is really simple:
image = 'linkoftheimage'
wget.download(image)
This works fine, but it saves the file in the folder with the python script. My goal is to download it in a different folder, but I can't find a way to specify it.
I tried a different approach with os module .
os.system(f'wget -O {directory} {image}')
This metod gives me an error: sh: -c: line 0: syntax error near unexpected token `('
So I tried another method:
with open(f'{directory}/photo %s.jpg' %a,'wb') as handler:
handler.write(image)
This also didn't worked out.
Does anyone have an idea on how could I solve this?
the package you specified has not been updated since 2015, it's repository is gone and so should probably be avoided. you can download files using the built-in requests module like so:
import requests
image_url = 'https://www.fillmurray.com/200/300'
file_destination = 'desired/destination/file.jpg'
res = requests.get(image_url)
if res.status_code == 200: # http 200 means success
with open(file_destination, 'wb') as file_handle: # wb means Write Binary
file_handle.write(res.content)

WSAStartup error (10093) when calling exiftool through PyExifTool

I have installed PyExifTool (https://smarnach.github.io/pyexiftool/). The installation was successful. However, when I try to run the example code provided there:
import exiftool
files = ["test.jpg"]
with exiftool.ExifTool() as et:
metadata = et.get_metadata_batch(files)
for d in metadata:
print("{:20.20} {:20.20}".format(d["SourceFile"],
d["EXIF:DateTimeOriginal"]))
I am getting this error:
Traceback (most recent call last):
File "extract_metadata_03.py", line 5, in <module>
metadata = et.get_metadata_batch(files)
File "c:\Python38\lib\site-packages\exiftool.py", line 264, in get_metadata_batch
return self.execute_json(*filenames)
File "c:\Python38\lib\site-packages\exiftool.py", line 256, in execute_json
return json.loads(self.execute(b"-j", *params).decode("utf-8"))
File "c:\Python38\lib\site-packages\exiftool.py", line 227, in execute
inputready,outputready,exceptready = select.select([fd],[],[])
OSError: [WinError 10093] Either the application has not called WSAStartup, or WSAStartup failed
I have tried with exiftool.exe Version 11.91 stand-alone Windows executable (from https://exiftool.org/) in my path as well as installing exiftool using Oliver Betz's ExifTool Windows installer (https://oliverbetz.de/pages/Artikel/ExifTool-for-Windows)
I have tried two separate Python installations (Python 3.8 and also Python 2.7) with the same behaviour.
Any assistance with this or suggestions for troubleshooting would be greatly appreciated.
You get the error because the select.select() used in exiftool.py is not compatible with Windows. To solve this you can manually add the following to exiftool.py:
if sys.platform == 'win32':
# windows does not support select() for anything except sockets
# https://docs.python.org/3.7/library/select.html
output += os.read(fd, block_size)
else:
# this does NOT work on windows... and it may not work on other systems... in that case, put more things to use the original code above
inputready,outputready,exceptready = select.select([fd],[],[])
for i in inputready:
if i == fd:
output += os.read(fd, block_size)
Source: https://github.com/sylikc/pyexiftool/commit/03a8595a2eafc61ac21deaa1cf5e109c6469b17c

Python mrjob mapreduce how to preprocess the input file

I am trying to pre-process a XML file to extract certain nodes before putting into mapreduce. I have the following code:
from mrjob.compat import jobconf_from_env
from mrjob.job import MRJob
from mrjob.util import cmd_line, bash_wrap
class MRCountLinesByFile(MRJob):
def configure_options(self):
super(MRCountLinesByFile, self).configure_options()
self.add_file_option('--filter')
def mapper_cmd(self):
cmd = cmd_line([self.options.filter, jobconf_from_env('mapreduce.map.input.file'])
return cmd
if __name__ == '__main__':
MRCountLinesByFile.run()
And on the command line, I type:
python3 test_job_conf.py --filter ./filter.py -r local < test.txt
test.txt is a normal XML file like here. While filter.py is a script to find all title information.
However, I am getting the following errors:
Creating temp directory /tmp/test_job_conf.vagrant.20160406.042648.689625
Running step 1 of 1...
Traceback (most recent call last):
File "./filter.py", line 8, in <module>
with open(filename) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'None'
Step 1 of 1 failed: Command '['./filter.py', 'None']' returned non-zero exit status 1
It looks like mapreduce.map.input.file render None in this case. How can I ask the mapper_cmd function to read the file that mrjob is currently reading?
As per my understanding goes in the your self.add_file_option should have the path to your file.
self.add_file_option('--items', help='Path to u.item')
I do not quite get your scenario right but here is my understanding.
You use the configure option to make sure a given file is sent to all the mappers for processing for example when you want to do an ancillary lookup on data in another file other than the source. This ancillary lookup file is made available by self.add_file_option('--items', help='Path to u.item').
To preprocess something say before a reducer or a mapper phase, you use the reducer_init or the mapper_init. These init or the processing steps also need to be mentioned in your step function like shown below for example.
def steps(self):
return [
MRStep(mapper=self.mapper_get_name,
reducer_init=self.reducer_init,
reducer=self.reducer_count_name),
MRStep(reducer = self.reducer_find_maxname)
]
Within your init function you do the actual pre-processing of what you need to done before sending to mapper or reducer. Say for example open a file xyz and copy the values in the first field in another field which I would be using in my reducer and output the same.
def reducer_init(self):
self.movieNames = {}
with open("xyz") as f:
for line in f:
fields = line.split('|')
self.myNames[fields[0]] = fields[1]
Hope this helps!!

How to know which file is calling which file, filesystem

How to know which file is calling which file in filesystem, like file1.exe is calling file2.exe
so file2.exe is modified,
and file1.exe is entered in log file.
winos
I have searched INTERNET but not able to find any samples.
In order know which file is calling which file you can use the Trace module
exp: if you have 2 files
***file1.py***
import file2
def call1():
file2.call2()
***file2.py***
def call2():
print "---------"
u can use it using console:
$ python -m trace --trackcalls path/to/file1.py
or within a program using a Trace object
****tracefile.py***
import trace,sys
from file1 import call1
#specify what to trace here
tracer = trace.Trace(ignoredirs=[sys.prefix, sys.exec_prefix], trace=0, count=1)
tracer.runfunc(call1) #call the function call1 in fille1
results = tracer.results()
results.write_results(summary=True, coverdir='.')

Pickle cross platform __dict__ attribute error

I'm having an issue with pickle. Things work fine between OSX and Linux, but not Windows and Linux. All pickled strings are stored in memory and sent via an SSL socket. To be 100% clear I have replaced all '\n's with ":::" and all '\r's with "===" (there were none). Scenario:
Client-Win: Small Business Server 2011 running Python 2.7
Client-Lin: Fedora Linux running Python 2.7
Server: Fedora Linux running Python 2.7
Client-Lin sends a pickled object to Server:
ccopy_reg:::_reconstructor:::p0:::(c__main__:::infoCollection:::p1:::c__builtin__:::tuple:::p2:::(VSTRINGA:::p3:::VSTRINGB:::p4:::VSTRINGC:::p5:::tp6:::tp7:::Rp8:::.
Client-Win sends a picked object to Server:
ccopy_reg:::_reconstructor:::p0:::(c__main__:::infoCollection:::p1:::c__builtin__:::tuple:::p2:::(VSTRINGA:::p3:::VSTRINGB:::p4:::VSTRINGC:::p5:::tp6:::tp7:::Rp8:::ccollections:::OrderedDict:::p9:::((lp10:::(lp11:::S'string_a':::p12:::ag3:::aa(lp13:::S'string_b':::p14:::ag4:::aa(lp15:::S'string_c':::p16:::ag5:::aatp17:::Rp18:::b.
For some reason the Windows client sends extra information along with the pickle, and when the Linux client tries to load the pickle string I get:
Unhandled exception in thread started by <function TestThread at 0x107de60>
Traceback (most recent call last):
File "./test.py", line 212, in TestThread
info = pickle.loads(p_string)
File "/usr/lib64/python2.7/pickle.py", line 1382, in loads
return Unpickler(file).load()
File "/usr/lib64/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/lib64/python2.7/pickle.py", line 1224, in load_build
d = inst.__dict__
AttributeError: 'infoCollection' object has no attribute '__dict__'
Any ideas?
EDIT
Adding additional requested information.
The infoCollection class is defined the same way:
infoCollection = collections.namedtuple('infoCollection', 'string_a, string_b, string_c')
def runtest():
info = infoCollection('STRINGA', 'STRINGB', 'STRINGC')
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
ssl_sock = ssl.wrap_socket(s, ssl_version=ssl.PROTOCOL_TLSv1)
ssl_sock.connect((server, serverport))
ssl_sock.write(pickle.dumps(info))
ssl_sock.close()
And the receiving function is much the same but does a
p_string = ssl_sock.read()
info = pickle.loads(p_string)
Are you using different minor versions of Python? There's a bug in 2.7.3 that makes pickling namedtuples incompatible with older versions. See this:
http://ronrothman.com/public/leftbraned/python-2-7-3-bug-broke-my-namedtuple-unpickling/
A hack, but the cross-platform issue appears to be due to namedtuples and pickle together in a cross-platform environment. I have replaced the namedtuple with my own class and all works well.
class infoClass(object):
pass
def infoCollection(string_a, string_b, string_c):
i = infoClass()
i.string_a = string_a
i.string_b = string_b
i.string_c = string_c
return i
Have you tried saving as a binary pickle file?
with open('pickle.file', 'wb') as po:
pickle.dump(obj, po)
Also - if you're porting between various OS, and if info is just a namedtuple have you looked at JSON (it's generally considered safer than pickle)?
with open('pickle.json', 'w') as po:
json.dump(obj, po)
Edit
From the ssl .read() docs it seems that .read() will only read at most 1024 bytes by default, I'll wager that your info namedtuple is going to be larger than that. It would be difficult to know how big info is a-priori I don't know if just setting nbytes=HUGE NUMBER would do the trick (I think perhaps not).
What happens if you
p_string = ssl_sock.read(nbytes=1.0E6)
info = pickle.loads(p_string)
Just install Pyhton 2.7.8 from https://www.python.org/ftp/python/2.7.8/python-2.7.8.amd64.msi

Categories