spaCy: errors attempting to load serialized Doc - python

I am trying to serialize/deserialize spaCy documents (setup is Windows 7, Anaconda) and am getting errors. I haven't been able to find any explanations. Here is a snippet of code and the error it generates:
import spacy
nlp = spacy.load('en')
text = 'This is a test.'
doc = nlp(text)
fout = 'test.spacy' # <-- according to the API for Doc.to_disk(), this needs to be a directory (but for me, spaCy writes a file)
doc.to_disk(fout)
doc.from_disk(fout)
Traceback (most recent call last):
File "<ipython-input-7-aa22bf1b9689>", line 1, in <module>
doc.from_disk(fout)
File "doc.pyx", line 763, in spacy.tokens.doc.Doc.from_disk
File "doc.pyx", line 806, in spacy.tokens.doc.Doc.from_bytes
ValueError: [E033] Cannot load into non-empty Doc of length 5.
I have also tried creating a new Doc object and loading from that, as shown in the example ("Example: Saving and loading a document") in the spaCy docs, which results in a different error:
from spacy.tokens import Doc
from spacy.vocab import Vocab
new_doc = Doc(Vocab()).from_disk(fout)
Traceback (most recent call last):
File "<ipython-input-16-4d99a1199f43>", line 1, in <module>
Doc(Vocab()).from_disk(fout)
File "doc.pyx", line 763, in spacy.tokens.doc.Doc.from_disk
File "doc.pyx", line 838, in spacy.tokens.doc.Doc.from_bytes
File "stringsource", line 646, in View.MemoryView.memoryview_cwrapper
File "stringsource", line 347, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only
EDIT:
As pointed out in the replies, the path provided should be a directory. However, the first code snippet creates a file. Changing this to a non-existing directory path doesn't help as spaCy still creates a file. Attempting to write to an existing directory causes an error too:
fout = 'data'
doc.to_disk(fout) Traceback (most recent call last):
File "<ipython-input-8-6c30638f4750>", line 1, in <module>
doc.to_disk(fout)
File "doc.pyx", line 749, in spacy.tokens.doc.Doc.to_disk
File "C:\Users\Username\AppData\Local\Continuum\anaconda3\lib\pathlib.py", line 1161, in open
opener=self._opener)
File "C:\Users\Username\AppData\Local\Continuum\anaconda3\lib\pathlib.py", line 1015, in _opener
return self._accessor.open(self, flags, mode)
File "C:\Users\Username\AppData\Local\Continuum\anaconda3\lib\pathlib.py", line 387, in wrapped
return strfunc(str(pathobj), *args)
PermissionError: [Errno 13] Permission denied: 'data'
Python has no problem writing at this location via standard file operations (open/read/write).
Trying with a Path object yields the same results:
from pathlib import Path
import os
fout = Path(os.path.join(os.getcwd(), 'data'))
doc.to_disk(fout)
Traceback (most recent call last):
File "<ipython-input-17-6c30638f4750>", line 1, in <module>
doc.to_disk(fout)
File "doc.pyx", line 749, in spacy.tokens.doc.Doc.to_disk
File "C:\Users\Username\AppData\Local\Continuum\anaconda3\lib\pathlib.py", line 1161, in open
opener=self._opener)
File "C:\Users\Username\AppData\Local\Continuum\anaconda3\lib\pathlib.py", line 1015, in _opener
return self._accessor.open(self, flags, mode)
File "C:\Users\Username\AppData\Local\Continuum\anaconda3\lib\pathlib.py", line 387, in wrapped
return strfunc(str(pathobj), *args)
PermissionError: [Errno 13] Permission denied: 'C:\\Users\\Username\\workspace\\data'
Any ideas why this might be happening?

doc.to_disk(fout)
must be
a path to a directory, which will be created if it doesn't exist.
Paths may be either strings or Path-like objects.
as the documentation for spaCy states in https://spacy.io/api/doc
Try changing fout to a directory, it might do the trick.
EDIT:
Examples from the spacy documentation:
for doc.to_disk:
doc.to_disk('/path/to/doc')
and for doc.from_disk:
from spacy.tokens import Doc
from spacy.vocab import Vocab
doc = Doc(Vocab()).from_disk('/path/to/doc')

Related

Create binary of spaCy with PyInstaller

I would like to create a binary of my python code, that contains spaCy.
# main.py
import spacy
import en_core_web_sm
def main() -> None:
nlp = spacy.load("en_core_web_sm")
# nlp = en_core_web_sm.load()
doc = nlp("This is an example")
print([(w.text, w.pos_) for w in doc])
if __name__ == "__main__":
main()
Besides my code, I created two PyInstaller-hooks, as described here
To create the binary I use the following command pyinstaller main.py --additional-hooks-dir=..
On the execution of the binary I get the following error message:
Traceback (most recent call last):
File "main.py", line 19, in <module>
main()
File "main.py", line 12, in main
nlp = spacy.load("en_core_web_sm")
File "spacy/__init__.py", line 47, in load
File "spacy/util.py", line 329, in load_model
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
If I use nlp = en_core_web_sm.load() instead if nlp = spacy.load("en_core_web_sm") to load the spacy model, I get the following error:
Traceback (most recent call last):
File "main.py", line 19, in <module>
main()
File "main.py", line 13, in main
nlp = en_core_web_sm.load()
File "en_core_web_sm/__init__.py", line 10, in load
File "spacy/util.py", line 514, in load_model_from_init_py
File "spacy/util.py", line 389, in load_model_from_path
File "spacy/util.py", line 426, in load_model_from_config
File "spacy/language.py", line 1662, in from_config
File "spacy/language.py", line 768, in add_pipe
File "spacy/language.py", line 659, in create_pipe
File "thinc/config.py", line 722, in resolve
File "thinc/config.py", line 771, in _make
File "thinc/config.py", line 826, in _fill
File "thinc/config.py", line 825, in _fill
File "thinc/config.py", line 1016, in make_promise_schema
File "spacy/util.py", line 137, in get
catalogue.RegistryError: [E893] Could not find function 'spacy.Tok2Vec.v1' in function registry 'architectures'. If you're using a custom function, make sure the code is available. If the function is provided by a third-party package, e.g. spacy-transformers, make sure the package is installed in your environment.
I had this same issue. After the error message you posted above, did you see an "Available names: ..." message? This message suggested that spacy.Tok2Vec.v2 was available but not v1. I was able to edit the config file for en_core_web_sm (for me at dist<name>\en_core_web_sm\en_core_web_sm-3.0.0\config.cfg) and change all references for spacy.Tok2Vec.v1 -> spacy.Tok2Vec.v2. I also had to do this for spacy.MaxoutWindowEncoder.v1. It's still a mystery to me as to why I'm having the issue only in the pyinstaller distributable and not my non-compiled script.
I encountered the same issue and nailed it by copying the spacy-legacy package to the compiled destination directory.
You can also hook it up by Pyinstaller but I did not really try that.
I hope my answer helps.

TypeError: expected str, bytes or os.PathLike object, not WindowsPath while converting .py using Pyinstaller

While trying to build an .exe using Pyinstaller, this error is thrown:
133235 INFO: Loading module hook 'hook-matplotlib.backends.py' from 'c:\\users\\jimit vaghela\\appdata\\local\\programs\\python\\python37\\lib\\site-packages\\PyInstaller\\hooks'...
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "c:\users\jimit vaghela\appdata\local\programs\python\python37\lib\site-packages\matplotlib\__init__.py", line 901, in <module>
fail_on_error=True)
File "c:\users\jimit vaghela\appdata\local\programs\python\python37\lib\site-packages\matplotlib\__init__.py", line 796, in _rc_params_in_file
with _open_file_or_url(fname) as fd:
File "c:\users\jimit vaghela\appdata\local\programs\python\python37\lib\contextlib.py", line 112, in __enter__
return next(self.gen)
File "c:\users\jimit vaghela\appdata\local\programs\python\python37\lib\site-packages\matplotlib\__init__.py", line 770, in _open_file_or_url
fname = os.path.expanduser(fname)
File "c:\users\jimit vaghela\appdata\local\programs\python\python37\lib\ntpath.py", line 291, in expanduser
path = os.fspath(path)
TypeError: expected str, bytes or os.PathLike object, not WindowsPath
134074 INFO: Loading module hook 'hook-matplotlib.py' from 'c:\\users\\jimit vaghela\\appdata\\local\\programs\\python\\python37\\lib\\site-packages\\PyInstaller\\hooks'...
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "c:\users\jimit vaghela\appdata\local\programs\python\python37\lib\site-packages\matplotlib\__init__.py", line 901, in <module>
fail_on_error=True)
File "c:\users\jimit vaghela\appdata\local\programs\python\python37\lib\site-packages\matplotlib\__init__.py", line 796, in _rc_params_in_file
with _open_file_or_url(fname) as fd:
File "c:\users\jimit vaghela\appdata\local\programs\python\python37\lib\contextlib.py", line 112, in __enter__
return next(self.gen)
File "c:\users\jimit vaghela\appdata\local\programs\python\python37\lib\site-packages\matplotlib\__init__.py", line 770, in _open_file_or_url
fname = os.path.expanduser(fname)
File "c:\users\jimit vaghela\appdata\local\programs\python\python37\lib\ntpath.py", line 291, in expanduser
path = os.fspath(path)
TypeError: expected str, bytes or os.PathLike object, not WindowsPath
I have found a solution posted on the Stackoverflow which states that there needs to be inserted a code into the backend.py in the Pyinstaller folder. But that does not work either.
What is going wrong here?
So, I found out that matplotlib was the issue. Excluding that in the module argument in Pyinstaller fixed it.
Like this:
pyinstaller --exclude-module matplotlib main.py
That is currently an issue with matplotlib.
I solved it by editing the source file. If you don't want to edit source file, it is said that lower versions of matplotlib like 3.0.3 can be installed to forgo this issue.
I my case it didn't work. Anyways, below are the steps I took.
First open Python interpreter & copy the path which you get as output.
>>> import matplotlib
>>> matplotlib.get_data_path() # copy the below path
'C:\\<Python Path>\\lib\\site-packages\\matplotlib\\mpl-data'
Next, open file C:\<Python Path>\lib\site-packages\PyInstaller\hooks\hook-matplotlib.py. Just to be on safe side make a back-up of this file if you need
from PyInstaller.utils.hooks import exec_statement
# old line; delete this
mpl_data_dir = exec_statement(
"import matplotlib; print(matplotlib.get_data_path())")
# Add this line
mpl_data_dir = 'C:\\<Python Path>\\lib\\site-packages\\matplotlib\\mpl-data'
assert mpl_data_dir, "Failed to determine matplotlib's data directory!"
datas = [
(mpl_data_dir, "matplotlib/mpl-data"),
]
And don't forget to save.

Traceback (most recent call last) when docx.Document() is used

import docx
f = open('~/Desktop/python/test/draft.docx','rb')
document = docx.Document(f)
Traceback (most recent call last):
File "./test.py", line 56, in <module>
document = docx.Document(f)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/docx/api.py", line 25, in Document
document_part = Package.open(docx).main_document_part
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/docx/opc/package.py", line 116, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/docx/opc/pkgreader.py", line 32, in from_file
phys_reader = PhysPkgReader(pkg_file)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/docx/opc/phys_pkg.py", line 101, in __init__
self._zipf = ZipFile(pkg_file, 'r')
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py", line 1200, in __init__
self._RealGetContents()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py", line 1267, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Uninstalled docx and installed python-docx
Uninstalled lxml too
None of it works.
Any help would be appreciated
Running python3.7 on OS X10.13
Don't open the file before you pass it to Document(). Just give it the path like you did in the open() call above.
It needs to be an actual Word .docx file. Note that you can just call document = Document() to get started. The "save as" file name is provided in the document.save() call. The file (if any) provided in the Document() call is just the starting-point "template" to use.
See the related documentation here:
https://python-docx.readthedocs.io/en/latest/user/documents.html

“git.exc.GitCommandNotFound: [WinError 2] The system cannot find the file specified” error in Python 3.5

I am trying to lemmatize a Latin text using Python 3.5 in Pycharm 5.0.4 with the CLTK library, but there seems to be a problem with Git. I get the error git.exc.GitCommandNotFound: [WinError 2] The system cannot find the file specified among other errors I believe are related—see below for the full output. I have tried adding a Git repository to the project folder and adding the git.exe path to version control but that seems to have done nothing. What can I do to get Git to work properly—please keep in mind that I am a complete neophyte when it comes to Python in particular and not very experienced with programming in general.
Code:
from cltk.stem.lemma import LemmaReplacer
from cltk.stem.latin.j_v import JVReplacer
from cltk.corpus.utils.importer import CorpusImporter
corpus_importer = CorpusImporter('latin')
corpus_importer.import_corpus('latin_text_latin_library')
corpus_importer.import_corpus('latin_models_cltk')
#corpus_importer.import_corpus('phi5', '~/PHI5/')
#t.convert_corpus(corpus='phi5')
j = JVReplacer()
lemmatizer = LemmaReplacer('latin')
In = open("CIC.txt","rt")
Out = open("CIC4.txt","wt")
text = In.read()
text = text.lower()
text = j.replace(text)
Out.write(str(lemmatizer.lemmatize(text)))
In.close()
Out.close()
Output:
C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\python.exe "C:\Program Files (x86)\JetBrains\PyCharm 5.0.4\helpers\pydev\pydevd.py" --multiproc --qt-support --client 127.0.0.1 --port 58508 --file C:/Users/Rune/PycharmProjects/untitled/Pucker.py
pydev debugger: process 14648 is connecting
Connected to pydev debugger (build 143.1919)
--- Logging error ---
Traceback (most recent call last):
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\site-packages\git\cmd.py", line 604, in execute
**subprocess_kwargs
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\subprocess.py", line 950, in __init__
restore_signals, start_new_session)
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\subprocess.py", line 1220, in _execute_child
startupinfo)
File "C:\Program Files (x86)\JetBrains\PyCharm 5.0.4\helpers\pydev\pydev_monkey.py", line 387, in new_CreateProcess
return getattr(_subprocess, original_name)(appName, patch_arg_str_win(commandLine), *args)
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\site-packages\cltk\corpus\utils\importer.py", line 134, in import_corpus
Repo.clone_from(git_uri, target_dir, depth=1)
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\site-packages\git\repo\base.py", line 885, in clone_from
return cls._clone(git, url, to_path, GitCmdObjectDB, progress, **kwargs)
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\site-packages\git\repo\base.py", line 826, in _clone
v=True, **add_progress(kwargs, git, progress))
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\site-packages\git\cmd.py", line 450, in <lambda>
return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\site-packages\git\cmd.py", line 878, in _call_process
return self.execute(make_call(), **_kwargs)
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\site-packages\git\cmd.py", line 607, in execute
raise GitCommandNotFound(str(err))
git.exc.GitCommandNotFound: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\logging\__init__.py", line 980, in emit
msg = self.format(record)
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\logging\__init__.py", line 830, in format
return fmt.format(record)
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\logging\__init__.py", line 567, in format
record.message = record.getMessage()
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\logging\__init__.py", line 330, in getMessage
msg = msg % self.args
TypeError: not enough arguments for format string
Call stack:
File "C:\Program Files (x86)\JetBrains\PyCharm 5.0.4\helpers\pydev\pydevd.py", line 2411, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "C:\Program Files (x86)\JetBrains\PyCharm 5.0.4\helpers\pydev\pydevd.py", line 1802, in run
launch(file, globals, locals) # execute the script
File "C:\Program Files (x86)\JetBrains\PyCharm 5.0.4\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/Rune/PycharmProjects/untitled/Pucker.py", line 5, in <module>
corpus_importer.import_corpus('latin_text_latin_library')
File "C:\Users\Rune\AppData\Local\Programs\Python\Python35-32\lib\site-packages\cltk\corpus\utils\importer.py", line 136, in import_corpus
logger.error("Git clone of '%s' failed: '%s'", (git_uri, e))
Message: "Git clone of '%s' failed: '%s'"
Arguments: (('https://github.com/cltk/latin_text_latin_library.git', GitCommandNotFound('[WinError 2] The system cannot find the file specified',)),)
Process finished with exit code 0
You can manually download the corpus from https://github.com/cltk/latin_models_cltk and place it in the
~/cltk_data/latin/model/ folder (so ~/cltk_data/latin/model/latin_models_cltk/lemmata/ is an existing folder afterwards). Then you should be able to run the following just fine:
from cltk.stem.lemma import LemmaReplacer
LemmaReplacer('latin').lemmatize('some_latin_here')
For the same in Greek, just replace the 'latin' by 'greek' everywhere in these instructions. I imagine (but haven't tried) that it works the same for other languages too.
I am pretty sure that CLTK does not work with any Python version below 3.6, at least as of August 2017. I had a devil of a time getting 3.6 installed on Ubuntu (Ubuntu linus running dual boot from my PC laptop) but eventually got it to work. Also I specifically solved the problem that OP articulates above by following the instructions at https://disiectamembra.wordpress.com/2016/07/01/current-state-of-the-cltk-latin-lemmatizer/. Good luck!

Python Scipy library: saving images

I am learning image processing with scipy. I experience some diffuculties with rather basic opeartions as saving an image. Here is my code:
import scipy
from scipy import misc
img=misc.imread("C:\\..\\name.jpg")
misc.imsave("image.jpg",img)
I obtain error message:
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
misc.imsave("image.jpg",img)
File "C:\Python27\lib\site-packages\scipy\misc\pilutil.py", line 158, in imsave
im.save(name)
File "C:\Python27\lib\site-packages\PIL\Image.py", line 1461, in save
fp = builtins.open(fp, "wb")
IOError: [Errno 13] Permission denied: 'image.jpg'
Try to use the full path when saving:
misc.imsave(r'C:\path\image.jpg', img)
your error is a permission error, so probably you do not have access to write in the current directory. You can also change the current directory using os.chdir( newpath ).
The code above creates a file in a given directory but it is empty (0 bytes) and produces error in the IDLE:
Traceback (most recent call last):
File "<pyshell#9>", line 1, in <module>
misc.imsave(r"D:\Darek\back3.jpg", img)
File "C:\Python27\lib\site-packages\scipy\misc\pilutil.py", line 158, in imsave
im.save(name)
File "C:\Python27\lib\site-packages\PIL\Image.py", line 1467, in save
save_handler(self, fp, filename)
File "C:\Python27\lib\site-packages\PIL\JpegImagePlugin.py", line 557, in _save
ImageFile._save(im, fp, [("jpeg", (0,0)+im.size, 0, rawmode)])
File "C:\Python27\lib\site-packages\PIL\ImageFile.py", line 466, in _save
e = Image._getencoder(im.mode, e, a, im.encoderconfig)
File "C:\Python27\lib\site-packages\PIL\Image.py", line 395, in _getencoder
return encoder(mode, *args + extra)
TypeError: function takes at most 11 arguments (13 given)
Hmm, your code works fine for me in the dreampie shell
import scipy
from scipy import misc
img = misc.imread("C:/folder/name.jpg")
misc.imsave("C:/folder2/image.jpg",img)
I don't know PIL well enough, but there seems to be an encoder issue involved. Have you tried your code with different image files?

Categories