Pipeline code spanning multiple files in Apache Beam / Dataflow

Pipeline code spanning multiple files in Apache Beam / Dataflow - python

After a lengthy search, I haven't found an example of a Dataflow / Beam pipeline that spans several files. Beam docs do suggest a file structure (under the section "Multiple File Dependencies"), but the Juliaset example they give has in effect a single code/source file (and the main file that calls it). Based on the Juliaset example, I need a similar file structure:
juliaset/__init__.py
juliaset/juliaset.py # actual code
juliaset/some_conf.py
__init__.py
juliaset_main.py
setup.py
Now I want to import .some_conf from juliaset/juliaset.py, which works when run locally but gives me an error when run on Dataflow
INFO:root:2017-12-15T17:34:09.333Z: JOB_MESSAGE_ERROR: (8cdf3e226105b90a): Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 706, in run
self._load_main_session(self.local_staging_directory)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 446, in _load_main_session
pickler.load_session(session_file)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 247, in load_session
return dill.load_session(file_path)
File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 363, in load_session
module = unpickler.load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 1133, in load_reduce
value = func(*args)
File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 767, in _import_module
return getattr(__import__(module, None, None, [obj]), obj)
ImportError: No module named package_name.juliaset.some_conf
A full working example would be very much appreciated!

Can you verify your setup.py containing a structure like:
import setuptools
setuptools.setup(
name='My Project',
version='1.0',
install_requires=[],
packages=setuptools.find_packages(),
)
Import your modules like from juliaset.juliaset import SomeClass
And when you call the Python script, use python -m juliaset_main (without the .py)
Not sure if you already tried this, but just to be sure.

Related

executable generated from pyinstaller it's not working

I'm doing a program responsible for managing a bookstore and I'm at the end of it. I'm creating the executable from it, but it's giving an error when running I don't know what it is.
Error:
Traceback (most recent call last):
File "main.py", line 3494, in <module>
File "db_manager.py", line 278, in titulo_livros
File "pandas\io\parsers.py", line 605, in read_csv
File "pandas\io\parsers.py", line 457, in _read
File "pandas\io\parsers.py", line 814, in __init__
File "pandas\io\parsers.py", line 1045, in _make_engine
File "pandas\io\parsers.py", line 1862, in __init__
File "pandas\io\parsers.py", line 1357, in _open_handles
File "pandas\io\common.py", line 642, in get_handle
FileNotFoundError: [Errno 2] No such file or directory: 'livros.csv'
[432] Failed to execute script main
The command I am using to generate the exe file is "pyinstaller --onefile main.py".
And that is my tree folder:
my tree folder project
Please help me, i have no idea of what is going on.
Thank you very much in advance.

somewhere you are doing pandas.read_csv(fname) where fname='livros.csv'
you need to give it the right path to the csv (or bundle the csv into the executable ... but that probably doesnt make sense, im not sure why you would ever bundle the csv into the executable)
after alot of back and forth I think this is what you want
import os
import pandas
import sqlalchemy
from sqlalchemy import create_engine
db_path = os.path.expanduser('~/my_file.db')
engine = create_engine('sqlite:///'+db_path, echo=False)
try:
existing = pandas.read_sql('SELECT title, author FROM books', engine)
except:
existing = pandas.DataFrame({'title':['Title 1','Title 2'],'author':['Bob Roberts','Sam Spade']})
print("DBPATH:",db_path)
# ... do some stuff (add/edit/remove items from your dataframe)
existing.to_sql("books",engine)

Create binary of spaCy with PyInstaller

I would like to create a binary of my python code, that contains spaCy.
# main.py
import spacy
import en_core_web_sm
def main() -> None:
nlp = spacy.load("en_core_web_sm")
# nlp = en_core_web_sm.load()
doc = nlp("This is an example")
print([(w.text, w.pos_) for w in doc])
if __name__ == "__main__":
main()
Besides my code, I created two PyInstaller-hooks, as described here
To create the binary I use the following command pyinstaller main.py --additional-hooks-dir=..
On the execution of the binary I get the following error message:
Traceback (most recent call last):
File "main.py", line 19, in <module>
main()
File "main.py", line 12, in main
nlp = spacy.load("en_core_web_sm")
File "spacy/__init__.py", line 47, in load
File "spacy/util.py", line 329, in load_model
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
If I use nlp = en_core_web_sm.load() instead if nlp = spacy.load("en_core_web_sm") to load the spacy model, I get the following error:
Traceback (most recent call last):
File "main.py", line 19, in <module>
main()
File "main.py", line 13, in main
nlp = en_core_web_sm.load()
File "en_core_web_sm/__init__.py", line 10, in load
File "spacy/util.py", line 514, in load_model_from_init_py
File "spacy/util.py", line 389, in load_model_from_path
File "spacy/util.py", line 426, in load_model_from_config
File "spacy/language.py", line 1662, in from_config
File "spacy/language.py", line 768, in add_pipe
File "spacy/language.py", line 659, in create_pipe
File "thinc/config.py", line 722, in resolve
File "thinc/config.py", line 771, in _make
File "thinc/config.py", line 826, in _fill
File "thinc/config.py", line 825, in _fill
File "thinc/config.py", line 1016, in make_promise_schema
File "spacy/util.py", line 137, in get
catalogue.RegistryError: [E893] Could not find function 'spacy.Tok2Vec.v1' in function registry 'architectures'. If you're using a custom function, make sure the code is available. If the function is provided by a third-party package, e.g. spacy-transformers, make sure the package is installed in your environment.

I had this same issue. After the error message you posted above, did you see an "Available names: ..." message? This message suggested that spacy.Tok2Vec.v2 was available but not v1. I was able to edit the config file for en_core_web_sm (for me at dist<name>\en_core_web_sm\en_core_web_sm-3.0.0\config.cfg) and change all references for spacy.Tok2Vec.v1 -> spacy.Tok2Vec.v2. I also had to do this for spacy.MaxoutWindowEncoder.v1. It's still a mystery to me as to why I'm having the issue only in the pyinstaller distributable and not my non-compiled script.

I encountered the same issue and nailed it by copying the spacy-legacy package to the compiled destination directory.
You can also hook it up by Pyinstaller but I did not really try that.
I hope my answer helps.

Package level logging in python

I am trying to set up package level logging via my init.py file.
I have been testing this in virtual environment but cannot seem to understand the error.
My package structure is as so:
extracts
extracts
__init__.py
logconfig.ini
methods.py
foo.py
setup.py
requirements.txt
setup.py looks like so:
setup(name='extracts',
version='0.0.1',
description='Extracts',
packages=['extracts'],
package_data={'extracts': ['*.ini']},
install_requires=requirements,
zip_safe=False
)
init.py is causing me problems.
logfile = resource_stream(__name__, 'logconfig.ini')
logging.config.fileConfig(logfile)
log = logging.getLogger(__name__)
log.info('Importing extracts 0.0.1')
foo.py is the script i'm trying to run. It import functions from module 'methods.py'. When that occurs, init.py should be triggered and set up logging for foo.py. Unfortunately everytime I try this, I get the following error:
Traceback (most recent call last):
File "test.py", line 2, in <module>
from extracts import methods
File "build/bdist.linux-i686/egg/extracts/__init__.py", line 19, in <module>
logfile = resource_stream(__name__, 'logconfig.ini')
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 942, in resource_stream
self, resource_name
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1346, in get_resource_stream
return BytesIO(self.get_resource_string(manager, resource_name))
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1349, in get_resource_string
return self._get(self._fn(self.module_path, resource_name))
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1425, in _get
return self.loader.get_data(path)
IOError: [Errno 0] Error: 'extracts/logconfig.ini'
https://pythonhosted.org/setuptools/pkg_resources.html#basic-resource-access mentions that the package_or_requirement in resource_stream(package_or_requirement, resource_name) needs to be importable. But what I don't understand is that the extracts pacakge is importable since
from extracts import methods
does not fail. Any ideas are greatly appreciated. Thank you.

When trying to build a Python executable with Pyinstaller, fails to find an existing scipy module

I have a build script for one of my established Python applications that uses Pyinstaller. This script has been working fine for over a year. Then today, I added to one of the source files for this application the line
import scipy.stats
because I want to use scipy.stats.linregress. This now causes the build script to crash with a long error traceback (apparently going back through a sequence of modules that import each other) ending with
File "C:\Users\462974\Documents\Local Sandbox\fof\TRUNK\programs\CDFParsing\build\pyi.win32\CDFGUI\outPYZ1.pyz/scipy.sparse.csgraph", line 148, in <module>
File "C:\Python27\pyinstaller-1.5\iu.py", line 436, in importHook
mod = _self_doimport(nm, ctx, fqname)
File "C:\Python27\pyinstaller-1.5\iu.py", line 495, in doimport
mod = importfunc(nm)
File "C:\Python27\pyinstaller-1.5\iu.py", line 297, in getmod
mod = owner.getmod(nm)
File "C:\Python27\pyinstaller-1.5\archive.py", line 468, in getmod
return iu.DirOwner.getmod(self, self.prefix+'.'+nm)
File "C:\Python27\pyinstaller-1.5\iu.py", line 109, in getmod
mod = imp.load_module(nm, fp, attempt, (ext, mode, typ))
File "_shortest_path.pyx", line 18, in init scipy.sparse.csgraph._shortest_path (scipy\sparse\csgraph\_shortest_path.c:14224)
File "C:\Python27\pyinstaller-1.5\iu.py", line 455, in importHook
raise ImportError, "No module named %s" % fqname
ImportError: No module named scipy.sparse.csgraph._validation
This is puzzling because the module located at C:\Python27\Lib\site-packages\scipy\sparse\csgraph_validation.py very much exists. Why did adding scipy to my build break it (importing numpy works just fine), could it be failing to find it?

Not entirely sure why, but including the following definition in my code after the import statement fixed it:
def fix_dependencies():
from scipy.sparse.csgraph import _validation

What could cause a python module to be imported twice?

As far as I understand, a python module is never imported twice, i.e. the code in the module only gets executed the first time it is imported. Subsequent import statements just add the module to the scope of the import.
I have a module called "TiledConvC3D.py" that seems to be imported multiple times though. I use pdb to print the stack at the top of the code for this module.
Here is the end of the stack trace from the first time the module is executed:
File "<anonymized>/python_modules/Theano/theano/gof/cmodule.py", line 328, in refresh
key = cPickle.load(open(key_pkl, 'rb'))
File "<anonymized>/ops/TiledConvG3D.py", line 565, in <module>
import TiledConvC3D
File "<anonymized>/ops/TiledConvC3D.py", line 18, in <module>
pdb.traceback.print_stack()
It goes on to be executed several more times. However, the complete stack trace for the second time it is called does not show any calls to reload, so these executions should not be occurring:
File "sup_train_conj_grad.py", line 103, in <module>
dataset = Config.get_dataset(dataset_node)
File "<anonymized>/Config.py", line 279, in get_dataset
from datasets import NewWiskott
File "<anonymized>/datasets/NewWiskott.py", line 16, in <module>
normalizer_train = video.ContrastNormalizer3D(sigma, global_per_frame = False, input_is_5d = True)
File "<anonymized>/util/video.py", line 204, in __init__
self.f = theano.function([input],output)
File "<anonymized>/python_modules/Theano/theano/compile/function.py", line 105, in function
allow_input_downcast=allow_input_downcast)
File "<anonymized>/python_modules/Theano/theano/compile/pfunc.py", line 270, in pfunc
accept_inplace=accept_inplace, name=name)
File "<anonymized>/python_modules/Theano/theano/compile/function_module.py", line 1105, in orig_function
fn = Maker(inputs, outputs, mode, accept_inplace = accept_inplace).create(defaults)
File "/u/goodfeli/python_modules/Theano/theano/compile/function_module.py", line 982, in create
_fn, _i, _o = self.linker.make_thunk(input_storage = input_storage_lists)
File "<anonymized>/python_modules/Theano/theano/gof/link.py", line 321, in make_thunk
output_storage = output_storage)[:3]
File "<anonymized>/python_modules/Theano/theano/gof/cc.py", line 1178, in make_all
output_storage = node_output_storage)
File "<anonymized>/python_modules/Theano/theano/gof/cc.py", line 774, in make_thunk
cthunk, in_storage, out_storage, error_storage = self.__compile__(input_storage, output_storage)
File "<anonymized>/python_modules/Theano/theano/gof/cc.py", line 723, in __compile__
output_storage)
File "<anonymized>/python_modules/Theano/theano/gof/cc.py", line 1037, in cthunk_factory
module = get_module_cache().module_from_key(key=key, fn=self.compile_cmodule)
File "<anonymized>/python_modules/Theano/theano/gof/cc.py", line 59, in get_module_cache
return cmodule.get_module_cache(config.compiledir)
File "<anonymized>/python_modules/Theano/theano/gof/cmodule.py", line 576, in get_module_cache
_module_cache = ModuleCache(dirname, force_fresh=force_fresh)
File "<anonymized>/python_modules/Theano/theano/gof/cmodule.py", line 268, in __init__
self.refresh()
File "<anonymized>/python_modules/Theano/theano/gof/cmodule.py", line 326, in refresh
key = cPickle.load(open(key_pkl, 'rb'))
File "<anonymized>/ops/TiledConvV3D.py", line 504, in <module>
import TiledConvG3D
File "<anonymized>/ops/TiledConvG3D.py", line 565, in <module>
import TiledConvC3D
File "<anonymized>/ops/TiledConvC3D.py", line 22, in <module>
pdb.traceback.print_stack()
Moreover, I also check the id of __builtin__.__import__ . At the very start of my main script, I import __builtin__ and print id(__builtin__.__import__) before doing any other imports. I also print id(__builtin__.import__) from inside my module that is being imported multiple times, and the value of the id does not change.
Are there other mechanisms besides calling reload and overriding __builtin__.__import__ that could explain my module getting loaded multiple times?

A Python module can be imported twice if the module is found twice in the path. For example, say your project is laid out like so:
src/
package1/
spam.py
eggs.py
Suppose your PYTHONPATH (sys.path) includes src and src/package1:
PYTHONPATH=/path/to/src:/path/to/src/package1
If that's the case, you can import the same module twice like this:
from package1 import spam
import spam
And Python will think they are different modules. Is that what's going on?
Also, per the discussion below (for users searching this question), another way a module can be imported twice is if there is an exception midway through the first import. For example, if spam imports eggs, but importing eggs results in an exception inside the module, it can be imported again.

In case this might help anyone, if you're running Flask in debug mode it might load modules not just twice, but several times. It happened to me and I just couldn't wrap my head around it, until I found this question. Here's more info:
Why does running the Flask dev server run itself twice?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pipeline code spanning multiple files in Apache Beam / Dataflow - python

Related

executable generated from pyinstaller it's not working

Create binary of spaCy with PyInstaller

Package level logging in python

When trying to build a Python executable with Pyinstaller, fails to find an existing scipy module

What could cause a python module to be imported twice?

Categories

Resources