So I have been trying to get into visualizing proteins in python, so I went on youtube and found some tutorials I ended up on a tutorial that was teaching you how to visualize a protein from the COVID-19 virus, so I went and setup anaconda, got jupyter notebook working vscode, and downloaded the necessary files from the PDB database, and made sure they were in the same directory as my notebook but when I run the the nglview.show_biopython(structure) function I get an ValueError: I/O opertation on a closed file. I'm stummed this is my first time using jupyter notebook so maybe there is something I'm missing, I don't know.
This what the code looks like
from Bio.PDB import *
import nglview as nv
parser = PDBParser()
structure = parser.get_structure("6YYT", "6YYT.pdb")
view = nv.show_biopython(structure)
This is the error
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_1728\2743687014.py in <module>
----> 1 view = nv.show_biopython(structure)
c:\Users\jerem\anaconda3\lib\site-packages\nglview\show.py in show_biopython(entity, **kwargs)
450 '''
451 entity = BiopythonStructure(entity)
--> 452 return NGLWidget(entity, **kwargs)
453
454
c:\Users\jerem\anaconda3\lib\site-packages\nglview\widget.py in __init__(self, structure, representations, parameters, **kwargs)
243 else:
244 if structure is not None:
--> 245 self.add_structure(structure, **kwargs)
246
247 if representations:
c:\Users\jerem\anaconda3\lib\site-packages\nglview\widget.py in add_structure(self, structure, **kwargs)
1111 if not isinstance(structure, Structure):
1112 raise ValueError(f'{structure} is not an instance of Structure')
-> 1113 self._load_data(structure, **kwargs)
1114 self._ngl_component_ids.append(structure.id)
1115 if self.n_components > 1:
...
--> 200 return io_str.getvalue()
201
202
ValueError: I/O operation on closed file
I only get this error when using nglview.show_biopython, when I run the get_structure() function it can read the file just fine. I can visualize other molucles just fine, or maybe that's because I was using the ASE library instead of a file. I don't know, that's why I'm here.
Update: Recently I found out that I can visualize the protein using nglview.show_file() instead of using nglview.show_biopython(). Even though I can visualize proteins now and techincally my problem has been solved I would still like to know why the show_biopython() function isn't working properly.
I also figured out another way to fix this problem. After going back to the tutorial I was talking about I saw that it was made back in 2021. After seeing this I wonder if we were using the same verions of each package, turns out we were not. I'm not sure what version of nglview they were using, but they were using biopython 1.79 which was the latest verion back in 2021 and I was using biopython 1.80. While using biopython 1.80 I was getting the error seen above. But now that I'm using biopython 1.79 I get this:
file = "6YYT.pdb"
parser = PDBParser()
structure = parser.get_structure("6YYT", file)
structure
view = nv.show_biopython(structure)
view
Output:
c:\Users\jerem\anaconda3\lib\site-packages\Bio\PDB\StructureBuilder.py:89:
PDBConstructionWarning: WARNING: Chain A is discontinuous at line 12059.
warnings.warn(
So I guess there is something going on with biopython 1.80, so I'm going to stick with 1.79
I had a similar problem with:
from Bio.PDB import *
import nglview as nv
parser = PDBParser(QUIET = True)
structure = parser.get_structure("2ms2", "2ms2.pdb")
save_pdb = PDBIO()
save_pdb.set_structure(structure)
save_pdb.save('pdb_out.pdb')
view = nv.show_biopython(structure)
view
error was like in question:
.................site-packages/nglview/adaptor.py:201, in BiopythonStructure.get_structure_string(self)
199 io_str = StringIO()
200 io_pdb.save(io_str)
--> 201 return io_str.getvalue()
ValueError: I/O operation on closed file
I modified site-packages/nglview/adaptor.py:201, in BiopythonStructure.get_structure_string(self):
def get_structure_string(self):
from Bio.PDB import PDBIO
from io import StringIO
io_pdb = PDBIO()
io_pdb.set_structure(self._entity)
io_str = StringIO()
io_pdb.save(io_str)
return io_str.getvalue()
with :
def get_structure_string(self):
from Bio.PDB import PDBIO
import mmap
io_pdb = PDBIO()
io_pdb.set_structure(self._entity)
mo = mmap_str()
io_pdb.save(mo)
return mo.read()
and added this new class mmap_str() , in same file:
import mmap
import copy
class mmap_str():
import mmap #added import at top
instance = None
def __init__(self):
self.mm = mmap.mmap(-1, 2)
self.a = ''
b = '\n'
self.mm.write(b.encode(encoding = 'utf-8'))
self.mm.seek(0)
#print('self.mm.read().decode() ',self.mm.read().decode(encoding = 'utf-8'))
self.mm.seek(0)
def __new__(cls, *args, **kwargs):
if not isinstance(cls.instance, cls):
cls.instance = object.__new__(cls)
return cls.instance
def write(self, string):
self.a = str(copy.deepcopy(self.mm.read().decode(encoding = 'utf-8'))).lstrip('\n')
self.mm.seek(0)
#print('a -> ', self.a)
len_a = len(self.a)
self.mm = mmap.mmap(-1, len(self.a)+len(string))
#print('a :', self.a)
#print('len self.mm ', len(self.mm))
#print('lenght string : ', len(string))
#print(bytes((self.a+string).encode()))
self.mm.write(bytes((self.a+string).encode()))
self.mm.seek(0)
#print('written once ')
#self.mm.seek(0)
def read(self):
self.mm.seek(0)
a = self.mm.read().decode().lstrip('\n')
self.mm.seek(0)
return a
def __enter__(self):
return self
def __exit__(self, *args):
pass
if I uncomment the print statements I'll get the :
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
error , but commenting them out I get:
while using thenglview.show_file(filename) I get:
tha's because, as could be seen looking at the pdb_out.pdb file
outputted by my code, Biopytho.PDB.PDBParser.get_structure(name , filename) doesnt retrieve the pdb header responsible for generate full CRYSTALLOGRAPHIC SYMMETRY/or biopython can't handle it (not sure about this, help if you know better), but just the coordinates.
Still don't understand what is going on with the :
--> 201 return io_str.getvalue()
ValueError: I/O operation on closed file
it could be something related to jupiter ipykernal ? hope somebody could shed more light into this, dont know how the framework runs, but is definitely different from a normal python interpreter. As an example:
Same code in one of my Python virtualenv, will run forever, so it could be ipykernel dont like StringIO()s or do something strange to them ?
OK thanks to the hint in the answer below, I went inspecting PDBIO.py in github repo for version Biopython 1.80 and compared the save method of PDBIO : def save(self, file, select=_select, write_end=True, preserve_atom_numbering=False): with the one in Biopython 1.79,
see first bit:
and last bit:
so apparently the big difference is the with fhandle: block in version 1.80.
So I realized that changing adaptor.py with adding a subclass of StringIO that looks like:
from io import StringIO
class StringIO(StringIO):
def __exit__(self, *args, **kwargs):
print('exiting from subclassed StringIO !!!!!')
pass
and modifying def get_structure_string(self): like this:
def get_structure_string(self):
from Bio.PDB import PDBIO
#from io import StringIO
io_pdb = PDBIO()
io_pdb.set_structure(self._entity)
io_str = StringIO()
io_pdb.save(io_str)
return io_str.getvalue()
was enough to get my Biopython 1.80 work in jupiter with nglview.
That told I am not sure what are the pitfalls of not closing the StringIO object we use for the visualization, but apparently its what Biopython 1.79 was doing like my first answer using a mmap object was doing too (not closing the mmap_str)
Another way to solve the probelm:
tried to understand git, I ended up with this, seems more coherent with the previous habits in the biopython project, but cant push it.
It makes use of as_handle from BIO.file :https://github.com/biopython/biopython/blob/e1902d1cdd3aa9325b4622b25d82fbf54633e251/Bio/File.py#L28
#contextlib.contextmanager
def as_handle(handleish, mode="r", **kwargs):
r"""Context manager to ensure we are using a handle.
Context manager for arguments that can be passed to SeqIO and AlignIO read, write,
and parse methods: either file objects or path-like objects (strings, pathlib.Path
instances, or more generally, anything that can be handled by the builtin 'open'
function).
When given a path-like object, returns an open file handle to that path, with provided
mode, which will be closed when the manager exits.
All other inputs are returned, and are *not* closed.
Arguments:
- handleish - Either a file handle or path-like object (anything which can be
passed to the builtin 'open' function, such as str, bytes,
pathlib.Path, and os.DirEntry objects)
- mode - Mode to open handleish (used only if handleish is a string)
- kwargs - Further arguments to pass to open(...)
Examples
--------
>>> from Bio import File
>>> import os
>>> with File.as_handle('seqs.fasta', 'w') as fp:
... fp.write('>test\nACGT')
...
10
>>> fp.closed
True
>>> handle = open('seqs.fasta', 'w')
>>> with File.as_handle(handle) as fp:
... fp.write('>test\nACGT')
...
10
>>> fp.closed
False
>>> fp.close()
>>> os.remove("seqs.fasta") # tidy up
"""
try:
with open(handleish, mode, **kwargs) as fp:
yield fp
except TypeError:
yield handleish
Anyone could pass it along ? [of course needs to be checked out, my tests are OK, but I am a novice].
Related
New to forensics, but we're thinking in order to pull registries from many machines for baseline analysis, we would use PowerShell to run:
reg export HKLM hklm.reg
on every machine and then parse the exported hklm.reg files (same with HKCU, etc).
Seemed simple enough, so I tried using yarp to parse it like:
from yarp import *
# A primary file is specified here.
primary_path = './data/registry/hklm.reg'
# Open the primary file and each transaction log file discovered.
primary_file = open(primary_path, 'rb')
# Open the hive and recover it, if required.
hive = Registry.RegistryHive(primary_file)
and got this error:
---------------------------------------------------------------------------
BaseBlockException Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_28960\903824968.py in <module>
9
10 # Open the hive and recover it, if required.
---> 11 hive = Registry.RegistryHive(primary_file)
12
13 '''
~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\yarp\Registry.py in __init__(self, file_object, tolerate_minor_errors)
205
206 def __init__(self, file_object, tolerate_minor_errors = True):
--> 207 self.registry_file = RegistryFile.PrimaryFile(file_object, tolerate_minor_errors)
208 self.tolerate_minor_errors = tolerate_minor_errors
209 self.effective_slack = set()
~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\yarp\RegistryFile.py in __init__(self, file_object, tolerate_minor_errors)
1107 self.last_sequence_number = None
1108
-> 1109 self.baseblock = BaseBlock(self.file_object)
1110 if not self.baseblock.is_primary_file:
1111 raise NotSupportedException('Invalid file type')
~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\yarp\RegistryFile.py in __init__(self, file_object, no_hive_bins)
307 signature = self.get_signature()
308 if signature != b'regf': # This is the only check possible before we validate the base block.
--> 309 raise BaseBlockException('Invalid signature: {}'.format(signature))
310
311 # We have to trust these fields even if the base block is not valid. We can adjust these values later (according to the log file).
BaseBlockException: "Invalid signature: b'\\xff\\xfeW\\x00'"
Am I misusing the library? How else would I parse these exported registries (into python dictionaries or DataFrames)?
From the owner of the repo:
Hello.
You are trying to use the library against a text file containing
exported registry data. This is not supported.
And such text files do not contain deleted data, as well as some
important metadata for allocated keys and values. For analysis, a
better option is "reg save " (see:
https://dfir.ru/2020/10/03/exporting-registry-hives-from-a-live-system/).
This was obviously the correct answer.
I have a pickle file that was created with python 2.7 that I'm trying to port to python 3.6. The file is saved in py 2.7 via pickle.dumps(self.saved_objects, -1)
and loaded in python 3.6 via loads(data, encoding="bytes") (from a file opened in rb mode). If I try opening in r mode and pass encoding=latin1 to loads I get UnicodeDecode errors. When I open it as a byte stream it loads, but literally every string is now a byte string. Every object's __dict__ keys are all b"a_variable_name" which then generates attribute errors when calling an_object.a_variable_name because __getattr__ passes a string and __dict__ only contains bytes. I feel like I've tried every combination of arguments and pickle protocols already. Apart from forcibly converting all objects' __dict__ keys to strings I'm at a loss. Any ideas?
** Skip to 4/28/17 update for better example
-------------------------------------------------------------------------------------------------------------
** Update 4/27/17
This minimum example illustrates my problem:
From py 2.7.13
import pickle
class test(object):
def __init__(self):
self.x = u"test ¢" # including a unicode str breaks things
t = test()
dumpstr = pickle.dumps(t)
>>> dumpstr
"ccopy_reg\n_reconstructor\np0\n(c__main__\ntest\np1\nc__builtin__\nobject\np2\nNtp3\nRp4\n(dp5\nS'x'\np6\nVtest \xa2\np7\nsb."
From py 3.6.1
import pickle
class test(object):
def __init__(self):
self.x = "xyz"
dumpstr = b"ccopy_reg\n_reconstructor\np0\n(c__main__\ntest\np1\nc__builtin__\nobject\np2\nNtp3\nRp4\n(dp5\nS'x'\np6\nVtest \xa2\np7\nsb."
t = pickle.loads(dumpstr, encoding="bytes")
>>> t
<__main__.test object at 0x040E3DF0>
>>> t.x
Traceback (most recent call last):
File "<pyshell#15>", line 1, in <module>
t.x
AttributeError: 'test' object has no attribute 'x'
>>> t.__dict__
{b'x': 'test ¢'}
>>>
-------------------------------------------------------------------------------------------------------------
Update 4/28/17
To re-create my issue I'm posting my actual raw pickle data here
The pickle file was created in python 2.7.13, windows 10 using
with open("raw_data.pkl", "wb") as fileobj:
pickle.dump(library, fileobj, protocol=0)
(protocol 0 so it's human readable)
To run it you'll need classes.py
# classes.py
class Library(object): pass
class Book(object): pass
class Student(object): pass
class RentalDetails(object): pass
And the test script here:
# load_pickle.py
import pickle, sys, itertools, os
raw_pkl = "raw_data.pkl"
is_py3 = sys.version_info.major == 3
read_modes = ["rb"]
encodings = ["bytes", "utf-8", "latin-1"]
fix_imports_choices = [True, False]
files = ["raw_data_%s.pkl" % x for x in range(3)]
def py2_test():
with open(raw_pkl, "rb") as fileobj:
loaded_object = pickle.load(fileobj)
print("library dict: %s" % (loaded_object.__dict__.keys()))
return loaded_object
def py2_dumps():
library = py2_test()
for protcol, path in enumerate(files):
print("dumping library to %s, protocol=%s" % (path, protcol))
with open(path, "wb") as writeobj:
pickle.dump(library, writeobj, protocol=protcol)
def py3_test():
# this test iterates over the different options trying to load
# the data pickled with py2 into a py3 environment
print("starting py3 test")
for (read_mode, encoding, fix_import, path) in itertools.product(read_modes, encodings, fix_imports_choices, files):
py3_load(path, read_mode=read_mode, fix_imports=fix_import, encoding=encoding)
def py3_load(path, read_mode, fix_imports, encoding):
from traceback import print_exc
print("-" * 50)
print("path=%s, read_mode = %s fix_imports = %s, encoding = %s" % (path, read_mode, fix_imports, encoding))
if not os.path.exists(path):
print("start this file with py2 first")
return
try:
with open(path, read_mode) as fileobj:
loaded_object = pickle.load(fileobj, fix_imports=fix_imports, encoding=encoding)
# print the object's __dict__
print("library dict: %s" % (loaded_object.__dict__.keys()))
# consider the test a failure if any member attributes are saved as bytes
test_passed = not any((isinstance(k, bytes) for k in loaded_object.__dict__.keys()))
print("Test %s" % ("Passed!" if test_passed else "Failed"))
except Exception:
print_exc()
print("Test Failed")
input("Press Enter to continue...")
print("-" * 50)
if is_py3:
py3_test()
else:
# py2_test()
py2_dumps()
put all 3 in the same directory and run c:\python27\python load_pickle.py first which will create 1 pickle file for each of the 3 protocols. Then run the same command with python 3 and notice that it version converts the __dict__ keys to bytes. I had it working for about 6 hours, but for the life of me I can't figure out how I broke it again.
In short, you're hitting bug 22005 with datetime.date objects in the RentalDetails objects.
That can be worked around with the encoding='bytes' parameter, but that leaves your classes with __dict__ containing bytes:
>>> library = pickle.loads(pickle_data, encoding='bytes')
>>> dir(library)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'str' and 'bytes'
It's possible to manually fix that based on your specific data:
def fix_object(obj):
"""Decode obj.__dict__ containing bytes keys"""
obj.__dict__ = dict((k.decode("ascii"), v) for k, v in obj.__dict__.items())
def fix_library(library):
"""Walk all library objects and decode __dict__ keys"""
fix_object(library)
for student in library.students:
fix_object(student)
for book in library.books:
fix_object(book)
for rental in book.rentals:
fix_object(rental)
But that's fragile and enough of a pain you should be looking for a better option.
1) Implement __getstate__/__setstate__ that maps datetime objects to a non-broken representation, for instance:
class Event(object):
"""Example class working around datetime pickling bug"""
def __init__(self):
self.date = datetime.date.today()
def __getstate__(self):
state = self.__dict__.copy()
state["date"] = state["date"].toordinal()
return state
def __setstate__(self, state):
self.__dict__.update(state)
self.date = datetime.date.fromordinal(self.date)
2) Don't use pickle at all. Along the lines of __getstate__/__setstate__, you can just implement to_dict/from_dict methods or similar in your classes for saving their content as json or some other plain format.
A final note, having a backreference to library in each object shouldn't be required.
You should treat pickle data as specific to the (major) version of Python that created it.
(See Gregory Smith's message w.r.t. issue 22005.)
The best way to get around this is to write a Python 2.7 program to read the pickled data, and write it out in a neutral format.
Taking a quick look at your actual data, it seems to me that an SQLite database is appropriate as an interchange format, since the Books contain references to a Library and RentalDetails. You could create separate tables for each.
Question: Porting pickle py2 to py3 strings become bytes
The given encoding='latin-1' below, is ok.
Your Problem with b'' are the result of using encoding='bytes'.
This will result in dict-keys being unpickled as bytes instead of as str.
The Problem data are the datetime.date values '\x07á\x02\x10', starting at line 56 in raw-data.pkl.
It's a konwn Issue, as pointed already.
Unpickling python2 datetime under python3
http://bugs.python.org/issue22005
For a workaround, I have patched pickle.py and got unpickled object, e.g.
book.library.books[0].rentals[0].rental_date=2017-02-16
This will work for me:
t = pickle.loads(dumpstr, encoding="latin-1")
Output:
<main.test object at 0xf7095fec>
t.__dict__={'x': 'test ¢'}
test ¢
Tested with Python:3.4.2
I'm writing a piece of software over on github. It's basically a tray icon with some extra features. I want to provide a working piece of code without actually having to make the user install what are essentially dependencies for optional features and I don't actually want to import things I'm not going to use so I thought code like this would be "good solution":
---- IN LOADING FUNCTION ----
features = []
for path in sys.path:
if os.path.exists(os.path.join(path, 'pynotify')):
features.append('pynotify')
if os.path.exists(os.path.join(path, 'gnomekeyring.so')):
features.append('gnome-keyring')
#user dialog to ask for stuff
#notifications available, do you want them enabled?
dlg = ConfigDialog(features)
if not dlg.get_notifications():
features.remove('pynotify')
service_start(features ...)
---- SOMEWHERE ELSE ------
def service_start(features, other_config):
if 'pynotify' in features:
import pynotify
#use pynotify...
There are some issues however. If a user formats his machine and installs the newest version of his OS and redeploys this application, features suddenly disappear without warning. The solution is to present this on the configuration window:
if 'pynotify' in features:
#gtk checkbox
else:
#gtk label reading "Get pynotify and enjoy notification pop ups!"
But if this is say, a mac, how do I know I'm not sending the user on a wild goose chase looking for a dependency they can never fill?
The second problem is the:
if os.path.exists(os.path.join(path, 'gnomekeyring.so')):
issue. Can I be sure that the file is always called gnomekeyring.so across all the linux distros?
How do other people test these features? The problem with the basic
try:
import pynotify
except:
pynotify = disabled
is that the code is global, these might be littered around and even if the user doesn't want pynotify....it's loaded anyway.
So what do people think is the best way to solve this problem?
The try: method does not need to be global — it can be used in any scope and so modules can be "lazy-loaded" at runtime. For example:
def foo():
try:
import external_module
except ImportError:
external_module = None
if external_module:
external_module.some_whizzy_feature()
else:
print("You could be using a whizzy feature right now, if you had external_module.")
When your script is run, no attempt will be made to load external_module. The first time foo() is called, external_module is (if available) loaded and inserted into the function's local scope. Subsequent calls to foo() reinsert external_module into its scope without needing to reload the module.
In general, it's best to let Python handle import logic — it's been doing it for a while. :-)
You might want to have a look at the imp module, which basically does what you do manually above. So you can first look for a module with find_module() and then load it via load_module() or by simply importing it (after checking the config).
And btw, if using except: I always would add the specific exception to it (here ImportError) to not accidently catch unrelated errors.
Not sure if this is good practice, but I created a function that does the optional import (using importlib) and error handling:
def _optional_import(module: str, name: str = None, package: str = None):
import importlib
try:
module = importlib.import_module(module)
return module if name is None else getattr(module, name)
except ImportError as e:
if package is None:
package = module
msg = f"install the '{package}' package to make use of this feature"
raise ValueError(msg) from e
If an optional module is not available, the user will at least get the idea what to do. E.g.
# code ...
if file.endswith('.json'):
from json import load
elif file.endswith('.yaml'):
# equivalent to 'from yaml import safe_load as load'
load = _optional_import('yaml', 'safe_load', package='pyyaml')
# code using load ...
The main disadvantage with this approach is that your imports have to be done in-line and are not all on the top of your file. Therefore, it might be considered better practice to use a slight adaptation of this function (assuming that you are importing a function or the like):
def _optional_import_(module: str, name: str = None, package: str = None):
import importlib
try:
module = importlib.import_module(module)
return module if name is None else getattr(module, name)
except ImportError as e:
if package is None:
package = module
msg = f"install the '{package}' package to make use of this feature"
import_error = e
def _failed_import(*args):
raise ValueError(msg) from import_error
return _failed_import
Now, you can make the imports with the rest of your imports and the error will only be raised when the function that failed to import is actually used. E.g.
from utils import _optional_import_ # let's assume we import the function
from json import load as json_load
yaml_load = _optional_import_('yaml', 'safe_load', package='pyyaml')
# unimportant code ...
with open('test.txt', 'r') as fp:
result = yaml_load(fp) # will raise a value error if import was not successful
PS: sorry for the late answer!
Another option is to use #contextmanager and with. In this situation, you do not know beforehand which dependencies are needed:
from contextlib import contextmanager
#contextmanager
def optional_dependencies(error: str = "ignore"):
assert error in {"raise", "warn", "ignore"}
try:
yield None
except ImportError as e:
if error == "raise":
raise e
if error == "warn":
msg = f'Missing optional dependency "{e.name}". Use pip or conda to install.'
print(f'Warning: {msg}')
Usage:
with optional_dependencies("warn"):
import module_which_does_not_exist_1
import module_which_does_not_exist_2
z = 1
print(z)
Output:
Warning: Missing optional dependency "module_which_does_not_exist_1". Use pip or conda to install.
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In [43], line 5
3 import module_which_does_not_exist_2
4 z = 1
----> 5 print(z)
NameError: name 'z' is not defined
Here, you should define all your imports immediately after with. The first module which is not installed will throw ImportError, which is caught by optional_dependencies. Depending on how you want to handle this error, it will either ignore it, print a warning, or raise it again.
The entire code will only run if all the modules are installed.
Here's a production-grade solution, using importlib and Pandas's import_optional_dependency as suggested by #dre-hh
from typing import *
import importlib, types
def module_exists(
*names: Union[List[str], str],
error: str = "ignore",
warn_every_time: bool = False,
__INSTALLED_OPTIONAL_MODULES: Dict[str, bool] = {}
) -> Optional[Union[Tuple[types.ModuleType, ...], types.ModuleType]]:
"""
Try to import optional dependencies.
Ref: https://stackoverflow.com/a/73838546/4900327
Parameters
----------
names: str or list of strings.
The module name(s) to import.
error: str {'raise', 'warn', 'ignore'}
What to do when a dependency is not found.
* raise : Raise an ImportError.
* warn: print a warning.
* ignore: If any module is not installed, return None, otherwise,
return the module(s).
warn_every_time: bool
Whether to warn every time an import is tried. Only applies when error="warn".
Setting this to True will result in multiple warnings if you try to
import the same library multiple times.
Returns
-------
maybe_module : Optional[ModuleType, Tuple[ModuleType...]]
The imported module(s), if all are found.
None is returned if any module is not found and `error!="raise"`.
"""
assert error in {"raise", "warn", "ignore"}
if isinstance(names, (list, tuple, set)):
names: List[str] = list(names)
else:
assert isinstance(names, str)
names: List[str] = [names]
modules = []
for name in names:
try:
module = importlib.import_module(name)
modules.append(module)
__INSTALLED_OPTIONAL_MODULES[name] = True
except ImportError:
modules.append(None)
def error_msg(missing: Union[str, List[str]]):
if not isinstance(missing, (list, tuple)):
missing = [missing]
missing_str: str = ' '.join([f'"{name}"' for name in missing])
dep_str = 'dependencies'
if len(missing) == 1:
dep_str = 'dependency'
msg = f'Missing optional {dep_str} {missing_str}. Use pip or conda to install.'
return msg
missing_modules: List[str] = [name for name, module in zip(names, modules) if module is None]
if len(missing_modules) > 0:
if error == "raise":
raise ImportError(error_msg(missing_modules))
if error == "warn":
for name in missing_modules:
## Ensures warning is printed only once
if warn_every_time is True or name not in __INSTALLED_OPTIONAL_MODULES:
print(f'Warning: {error_msg(name)}')
__INSTALLED_OPTIONAL_MODULES[name] = False
return None
if len(modules) == 1:
return modules[0]
return tuple(modules)
Usage: ignore errors (error="ignore", default behavior)
Suppose we want to run certain code only if the required libraries exists:
if module_exists("pydantic", "sklearn"):
from pydantic import BaseModel
from sklearn.metrics import accuracy_score
class AccuracyCalculator(BaseModel):
num_decimals: int = 5
def calculate(self, y_pred: List, y_true: List) -> float:
return round(accuracy_score(y_true, y_pred), self.num_decimals)
print("Defined AccuracyCalculator in global context")
If either dependencies pydantic or skelarn do not exist, then the class AccuracyCalculator will not be defined and the print statement will not run.
Usage: raise ImportError (error="raise")
Alternatively, you can raise a error if any module does not exist:
if module_exists("pydantic", "sklearn", error="raise"):
from pydantic import BaseModel
from sklearn.metrics import accuracy_score
class AccuracyCalculator(BaseModel):
num_decimals: int = 5
def calculate(self, y_pred: List, y_true: List) -> float:
return round(accuracy_score(y_true, y_pred), self.num_decimals)
print("Defined AccuracyCalculator in global context")
Output:
line 60, in module_exists(error, __INSTALLED_OPTIONAL_MODULES, *names)
58 if len(missing_modules) > 0:
59 if error == "raise":
---> 60 raise ImportError(error_msg(missing_modules))
61 if error == "warn":
62 for name in missing_modules:
ImportError: Missing optional dependencies "pydantic" "sklearn". Use pip or conda to install.
Usage: print a warning (error="warn")
Alternatively, you can print a warning if the module does not exist.
if module_exists("pydantic", "sklearn", error="warn"):
from pydantic import BaseModel
from sklearn.metrics import accuracy_score
class AccuracyCalculator(BaseModel):
num_decimals: int = 5
def calculate(self, y_pred: List, y_true: List) -> float:
return round(accuracy_score(y_true, y_pred), self.num_decimals)
print("Defined AccuracyCalculator in global context")
if module_exists("pydantic", "sklearn", error="warn"):
from pydantic import BaseModel
from sklearn.metrics import roc_auc_score
class RocAucCalculator(BaseModel):
num_decimals: int = 5
def calculate(self, y_pred: List, y_true: List) -> float:
return round(roc_auc_score(y_true, y_pred), self.num_decimals)
print("Defined RocAucCalculator in global context")
Output:
Warning: Missing optional dependency "pydantic". Use pip or conda to install.
Warning: Missing optional dependency "sklearn". Use pip or conda to install.
Here, we ensure that only one warning is printed for each missing module, otherwise you would get a warning each time you try to import.
This is very useful for Python libraries where you might try to import the same optional dependencies many times, and only want to see one Warning.
You can pass warn_every_time=True to always print the warning when you try to import.
I'm really excited to share this new technique I came up with to handle optional dependencies!
The concept is to produce the error when the uninstalled package is used not imported.
Just add a single call before your imports. You don't need to change any code at all. No more using try: when importing. No more using conditional skip decorators when writing tests.
Main components
An importer to return a fake module for missing imports
A fake module that raises an exception when it's used
A custom Exception that will skip tests automatically if raised within one
Minimal Code Example
import sys
import importlib
from unittest.case import SkipTest
from _pytest.outcomes import Skipped
class MissingOptionalDependency(SkipTest, Skipped):
def __init__(self, msg=None):
self.msg = msg
def __repr__(self):
return f"MissingOptionalDependency: {self.msg}" if self.msg else f"MissingOptionalDependency"
class GeneralImporter:
def __init__(self, *names):
self.names = names
sys.meta_path.insert(0, self)
def find_spec(self, fullname, path=None, target=None):
if fullname in self.names:
return importlib.util.spec_from_loader(fullname, self)
def create_module(self, spec):
return FakeModule(name=spec.name)
def exec_module(self, module):
pass
class FakeModule:
def __init__(self, name):
self.name = name
def __call__(self, *args, **kwargs):
raise MissingOptionalDependency(f"Optional dependency '{self.name}' was used but it isn't installed.")
GeneralImporter("notinstalled")
import notinstalled # No error
print(notinstalled) # <__main__.FakeModule object at 0x0000014B7F6D9E80>
notinstalled() # MissingOptionalDependency: Optional dependency 'notinstalled' was used but it isn't installed.
Package
The technique above has some shortcomings that my package fixes.
It's open-source, lightweight, and has no dependencies!
Some key differences to the example above:
Covers more than 100 dunder methods (All tested)
Covers 15 common dunder attribute lookups
Entry function is generalimport which returns an ImportCatcher
ImportCatcher holds names, scope, and caught names
It can be enabled and disabled
The scope prevents external packages from being affected
Wildcard support to allow any package to be imported
Puts the importer first in sys.meta_path
Lets it catch namespace imports (Usually occurs with uninstalled packages)
Generalimport on GitHub
pip install generalimport
Minimal example
from generalimport import generalimport
generalimport("notinstalled")
from notinstalled import missing_func # No error
missing_func() # Error occurs here
The readme on GitHub goes more in-depth
One way to handle the problem of different dependencies for different features is to implement the optional features as plugins. That way the user has control over which features are activated in the app but isn't responsible for tracking down the dependencies herself. That task then gets handled at the time of each plugin's installation.
I have an interesting problem. I am mocking urllib2.urlopen with the python mock library as follows:
def mock_url_open_conn_for_json_feed():
json_str = """
{"actions":[{"causes":[{"shortDescription":"Started by user anonymous","userId":null,"userName":"anonymous"}]}],"artifacts":[],"building":false,"description":null,"duration":54,"estimatedDuration":54,
"fullDisplayName":"test3#1",
"id":"2012-08-24_14-10-34","keepLog":false,"number":1,"result":"SUCCESS","timestamp":1345842634000,
"url":"http://localhost:8080/job/test3/1/","builtOn":"","changeSet":{"items":[],"kind":null},"culprits":[]}
"""
return StringIO(json_str)
def test_case_foo(self):
io = mock_url_open_conn_for_json_feed()
io.seek(0)
mylib.urllib2.urlopen = Mock(return_value=io)
test_obj.do_your_thing()
def test_case_foo_bar(self)
io = mock_url_open_conn_for_json_feed()
io.seek(0)
mylib.urllib2.urlopen = Mock(return_value=io)
test_obj.param = xyz
test_obj.do_your_thing()
class ObjUnderTest():
def do_your_thing(self):
conn = urllib2.urlopen(url)
simplejson.load(conn)
the first unit test "test_case_foo" runs without a problem. But simplejson.load closes the StringIO, so "test_case_foo_bar" calls on do_your_thing() and it tries to simplejson.load the same StringIO object (even though I return the constructor of StringIO), and it's already been closed. I get the following error:
json = simplejson.load(conn)
File "/Users/sam/Library/Python/2.7/lib/python/site-packages/simplejson/__init__.py", line 391, in load
return loads(fp.read(),
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/StringIO.py", line 127, in read
_complain_ifclosed(self.closed)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/StringIO.py", line 40, in _complain_ifclosed
raise ValueError, "I/O operation on closed file"
ValueError: I/O operation on closed file
I have two questions:
1) Why is the StringIO constructor not returning a new object?
2) Is there a work around for this? Or a better way to achieve what I'm trying to achieve?
I'm trying to extract text from a large number of PDFs using PDFMiner python bindings. The module I wrote works for many PDFs, but I get this somewhat cryptic error for a subset of PDFs:
ipython stack trace:
/usr/lib/python2.7/dist-packages/pdfminer/pdfparser.pyc in set_parser(self, parser)
331 break
332 else:
--> 333 raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
334 if self.catalog.get('Type') is not LITERAL_CATALOG:
335 if STRICT:
PDFSyntaxError: No /Root object! - Is this really a PDF?
Of course, I immediately checked to see whether or not these PDFs were corrupted, but they can be read just fine.
Is there any way to read these PDFs despite the absence of a root object? I'm not too sure where to go from here.
Many thanks!
Edit:
I tried using PyPDF in an attempt to get some differential diagnostics. The stack trace is below:
In [50]: pdf = pyPdf.PdfFileReader(file(fail, "rb"))
---------------------------------------------------------------------------
PdfReadError Traceback (most recent call last)
/home/louist/Desktop/pdfs/indir/<ipython-input-50-b7171105c81f> in <module>()
----> 1 pdf = pyPdf.PdfFileReader(file(fail, "rb"))
/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in __init__(self, stream)
372 self.flattenedPages = None
373 self.resolvedObjects = {}
--> 374 self.read(stream)
375 self.stream = stream
376 self._override_encryption = False
/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in read(self, stream)
708 line = self.readNextEndLine(stream)
709 if line[:5] != "%%EOF":
--> 710 raise utils.PdfReadError, "EOF marker not found"
711
712 # find startxref entry - the location of the xref table
PdfReadError: EOF marker not found
Quonux suggested that perhaps PDFMiner stopped parsing after reaching the first EOF character. This would seem to suggest otherwise, but I'm very much clueless. Any thoughts?
The solution in slate pdf is use 'rb' --> read binary mode.
Because slate pdf is depends on the PDFMiner and I have the same problem, this should solve your problem.
fp = open('C:\Users\USER\workspace\slate_minner\document1.pdf','rb')
doc = slate.PDF(fp)
print doc
interesting problem. i had performed some kind of research:
function which parsed pdf (from miners source code):
def set_parser(self, parser):
"Set the document to use a given PDFParser object."
if self._parser: return
self._parser = parser
# Retrieve the information of each header that was appended
# (maybe multiple times) at the end of the document.
self.xrefs = parser.read_xref()
for xref in self.xrefs:
trailer = xref.get_trailer()
if not trailer: continue
# If there's an encryption info, remember it.
if 'Encrypt' in trailer:
#assert not self.encryption
self.encryption = (list_value(trailer['ID']),
dict_value(trailer['Encrypt']))
if 'Info' in trailer:
self.info.append(dict_value(trailer['Info']))
if 'Root' in trailer:
# Every PDF file must have exactly one /Root dictionary.
self.catalog = dict_value(trailer['Root'])
break
else:
raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
if self.catalog.get('Type') is not LITERAL_CATALOG:
if STRICT:
raise PDFSyntaxError('Catalog not found!')
return
if you will be have problem with EOF another exception will be raised:
'''another function from source'''
def load(self, parser, debug=0):
while 1:
try:
(pos, line) = parser.nextline()
if not line.strip(): continue
except PSEOF:
raise PDFNoValidXRef('Unexpected EOF - file corrupted?')
if not line:
raise PDFNoValidXRef('Premature eof: %r' % parser)
if line.startswith('trailer'):
parser.seek(pos)
break
f = line.strip().split(' ')
if len(f) != 2:
raise PDFNoValidXRef('Trailer not found: %r: line=%r' % (parser, line))
try:
(start, nobjs) = map(long, f)
except ValueError:
raise PDFNoValidXRef('Invalid line: %r: line=%r' % (parser, line))
for objid in xrange(start, start+nobjs):
try:
(_, line) = parser.nextline()
except PSEOF:
raise PDFNoValidXRef('Unexpected EOF - file corrupted?')
f = line.strip().split(' ')
if len(f) != 3:
raise PDFNoValidXRef('Invalid XRef format: %r, line=%r' % (parser, line))
(pos, genno, use) = f
if use != 'n': continue
self.offsets[objid] = (int(genno), long(pos))
if 1 <= debug:
print >>sys.stderr, 'xref objects:', self.offsets
self.load_trailer(parser)
return
from wiki(pdf specs):
A PDF file consists primarily of objects, of which there are eight types:
Boolean values, representing true or false
Numbers
Strings
Names
Arrays, ordered collections of objects
Dictionaries, collections of objects indexed by Names
Streams, usually containing large amounts of data
The null object
Objects may be either direct (embedded in another object) or indirect. Indirect objects are numbered with an object number and a generation number. An index table called the xref table gives the byte offset of each indirect object from the start of the file. This design allows for efficient random access to the objects in the file, and also allows for small changes to be made without rewriting the entire file (incremental update). Beginning with PDF version 1.5, indirect objects may also be located in special streams known as object streams. This technique reduces the size of files that have large numbers of small indirect objects and is especially useful for Tagged PDF.
i thk the problem is your "damaged pdf" have a few 'root elements' on the page.
Possible solution:
you can download sources and write `print function' in each places where xref objects retrieved and where parser tried to parse this objects. it will be possible to determine full stack of error(before this error is appeared).
ps: i think it some kind of bug in product.
I have had this same problem in Ubuntu. I have a very simple solution. Just print the pdf-file as a pdf. If you are in Ubuntu:
Open a pdf file using the (ubuntu) document viewer.
Goto File
Goto print
Choose print as file and check the mark "pdf"
If you want to make the process automatic, follow for instance this, i.e., use this script to print automatically all your pdf files. A linux script like this also works:
for f in *.pdfx
do
lowriter --headless --convert-to pdf "$f"
done
Note I called the original (problematic) pdf files as pdfx.
I got this error as well and kept trying
fp = open('example','rb')
However, I still got the error OP indicated. What I found is that I had bug in my code where the PDF was still open by another function.
So make sure you don't have the PDF open in memory elsewhere as well.
An answer above is right. This error appears only in windows, and workaround is to replace
with open(path, 'rb')
to
fp = open(path,'rb')