Google Dataflow - Failed to import custom python modules - python

My Apache beam pipeline implements custom Transforms and ParDo's python modules which further imports other modules written by me. On Local runner this works fine as all the available files are available in the same path. In case of Dataflow runner, pipeline fails with module import error.
How do I make custom modules available to all the dataflow workers? Please advise.
Below is an example:
ImportError: No module named DataAggregation
at find_class (/usr/lib/python2.7/pickle.py:1130)
at find_class (/usr/local/lib/python2.7/dist-packages/dill/dill.py:423)
at load_global (/usr/lib/python2.7/pickle.py:1096)
at load (/usr/lib/python2.7/pickle.py:864)
at load (/usr/local/lib/python2.7/dist-packages/dill/dill.py:266)
at loads (/usr/local/lib/python2.7/dist-packages/dill/dill.py:277)
at loads (/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py:232)
at apache_beam.runners.worker.operations.PGBKCVOperation.__init__ (operations.py:508)
at apache_beam.runners.worker.operations.create_pgbk_op (operations.py:452)
at apache_beam.runners.worker.operations.create_operation (operations.py:613)
at create_operation (/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py:104)
at execute (/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py:130)
at do_work (/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py:642)

The issue is probably that you haven't grouped your files as a package. The Beam documentation has a section on it.
Multiple File Dependencies
Often, your pipeline code spans multiple files. To run your project remotely, you must group these files as a Python package and specify the package when you run your pipeline. When the remote workers start, they will install your package. To group your files as a Python package and make it available remotely, perform the following steps:
Create a setup.py file for your project. The following is a very basic setup.py file.
setuptools.setup(
name='PACKAGE-NAME'
version='PACKAGE-VERSION',
install_requires=[],
packages=setuptools.find_packages(),
)
Structure your project so that the root directory contains the setup.py file, the main workflow file, and a directory with the rest of the files.
root_dir/
setup.py
main.py
other_files_dir/
See Juliaset for an example that follows this required project structure.
Run your pipeline with the following command-line option:
--setup_file /path/to/setup.py
Note: If you created a requirements.txt file and your project spans multiple files, you can get rid of the requirements.txt file and instead, add all packages contained in requirements.txt to the install_requires field of the setup call (in step 1).

I ran into the same issue and unfortunately, the docs are not as verbose as they need to be.
So, the problem as it turns out is that both the root_dir and the other_files_dir must contain an __init__.py file. When a directory contains an __init__.py file (even if it's empty) python will treat that directory as a package, which in this instance is what we want. So, your final folder structure should look something like this:
root_dir/
__init__.py
setup.py
main.py
other_files_dir/
__init__.py
module_1.py
module_2.py
And what you'll find is that python will build an .egg-info folder that describes your package including all pip dependencies. It will also contain the top_level.txt file which contains the name of the directory that holds the modules (i.e other_files_dir)
Then you would simply call the modules in main.py as below
from other_files_dir import module_1

Related

Google Cloud Buildpack custom source directory for Python app

I am experimenting with Google Cloud Platform buildpacks, specifically for Python. I started with the Sample Functions Framework Python example app, and got that running locally, with commands:
pack build --builder=gcr.io/buildpacks/builder sample-functions-framework-python
docker run -it -ePORT=8080 -p8080:8080 sample-functions-framework-python
Great, let's see if I can apply this concept on a legacy project (Python 3.7 if that matters).
The legacy project has a structure similar to:
.gitignore
source/
main.py
lib
helper.py
requirements.txt
tests/
<test files here>
The Dockerfile that came with this project packaged the source directory contents without the "source" directory, like this:
COPY lib/ /app/lib
COPY main.py /app
WORKDIR /app
... rest of Dockerfile here ...
Is there a way to package just the contents of the source directory using the buildpack?
I tried to add this config to the project.toml file:
[[build.env]]
name = "GOOGLE_FUNCTION_SOURCE"
value = "./source/main.py"
But the Python modules/imports aren't set up correctly for that, as I get this error:
File "/workspace/source/main.py", line 2, in <module>
from source.lib.helper import mymethod
ModuleNotFoundError: No module named 'source'
Putting both main.py and /lib into the project root dir would make this work, but I'm wondering if there is a better way.
Related question, is there a way to see what project files are being copied into the image by the buildpack? I tried using verbose logging but didn't see anything useful.
Update:
The python module error:
File "/workspace/source/main.py", line 2, in <module>
from source.lib.helper import mymethod
ModuleNotFoundError: No module named 'source'
was happening because I moved the lib dir into source in my test project, and when I did this, Intellij updated the import statement in main.py without me catching it. I fixed the import, then applied the solution listed below and it worked.
I had been searching the buildpack and Google cloud function documentation, but I discovered the option I need on the pack build documentation page: option --path.
This command only captures the source directory contents:
pack build --builder=gcr.io/buildpacks/builder --path source sample-functions-framework-python
If changing the path, the project.toml descriptor needs to be in that directory too (or specify with --descriptor on command line).

Irerate over package_data files and copy them to current working directory

Background
I'm developing a python package with roughly the following directory structure:
mossutils/
setup.py
mossutils/
__init__.py
init.py
data/
style.css
script.js
...
My package's setup.py declares console_scripts and includes package_data files:
setup(
name='mossutils',
packages=['mossutils'],
package_data={"mossutils": ["data/*"]},
entry_points = {
"console_scripts": ['mu-init = mossutils.init:main']
},
...)
Installing the package via pip install works as expected: everything is installed in my Python's Lib\site-packages, including the data directory and all files in it, and script mu-init can be executed from the shell (or rather, command prompt, since I'm using Windows).
Goal
Script mu-init is supposed to do some kind of project scaffolding in the current working directory it is invoked from. In particular, it should copy all package_data files (data/style.css, data/script.js, ...) to the current directory.
Solution Attempt
Using module pkgutil, I can read the content of a file, e.g.
import pkgutil
...
data = pkgutil.get_data(__name__, "data/style.css")
Questions
Is there a way for my init.py script to iterate over the contents of the data directory, without hard-coding the file names (in init.py)?
Can the files from the data directory be copied to the current working directory, without opening the source file, reading the content, and writing it to a destination file?
You can get the list of files in the directory using pkg_resources library, which is distributed together with setuptools.
import pkg_resources
pkg_resources.resource_listdir("mossutils", "data")

Python module by SWIG in conda: Where do I have to place which file?

I am trying to generate Python bindings for a C++ shared library with SWIG and distribute the project with conda. The build process seems to work as I can execute
import mymodule as m
ret = m.myfunctioninmymodule()
in my build directory. Now I want to install files that are created (namely, mymodule.py and _mymodule.pyd) in my conda environment on Windows so that I can access them from everywhere. But where do I have to place the files?
What I have tried so far is to put both files in a package together with a __init__.py (which is empty, however) and write a setup.py as suggested here. The package has the form
- mypackage
|- __init__.py
|- mymodule.py
|- _mymodule.pyd
and is installed under C:\mypathtoconda\conda\envs\myenvironmentname\Lib\site-packages\mypackage-1.0-py3.6-win-amd64.egg. However, the python import (see above) fails with
ImportError: cannot import name '_mymodule'
It should be noted that under Linux this approach with the package works perfectly fine.
Edit: The __init__.py is empty because this is sufficient to build a package. I am not sure, however, what belongs in there. Do I have to give a path to certain components?

How to submit a job with more python scripts for training to ml cloud

I have a project with more than one file of python code.
I have a file for model, one for data utility, one for training the model.
I know how to submit a model with all the code is in one file.
How can I indicate that T have more file in my project?
Maybe something need to added in the setup.py file or __init__.py.
My directory looks like this:
setup.py
trainer/
__init__.py
task.py
model/
seq2seq.py
model.py
data_utli.py
You do not need to manually create your own package, though you're welcome to if you want.
There are two important steps in getting the package to work automatically:
Create a proper python package
Ensure your setup.py is correct.
In your case, the model subdirectory is causing issues. The quick fix is to move trainer/model/* to trainer/. Otherwise, you probably want to make model a proper sub-package by adding a (probably blank) __init__.py file in the model/ subdirectory.
Next, ensure your setup.py file is correctly specified. A sample script is provided in this documentation, repeated here for convenience:
from setuptools import find_packages
from setuptools import setup
setup(name='trainer',
version='0.1',
include_package_data=True,
description='blah',
packages=find_packages()
)
You can verify that it worked by running:
python setup.py sdist
That will create a dist subdirectory with a file trainer-0.1.tar.gz. Extracting the contents of that file shows that all of the files were correctly included:
$ cd dist
$ tar -xvf trainer-0.1.tgz
$ find trainer-0.1/
trainer-0.1/
trainer-0.1/setup.py
trainer-0.1/setup.cfg
trainer-0.1/trainer
trainer-0.1/trainer/data_util.py
trainer-0.1/trainer/task.py
trainer-0.1/trainer/__init__.py
trainer-0.1/trainer/model
trainer-0.1/trainer/model/__init__.py
trainer-0.1/trainer/model/model.py
trainer-0.1/trainer/model/seq2seq.py
trainer-0.1/PKG-INFO
trainer-0.1/trainer.egg-info
trainer-0.1/trainer.egg-info/dependency_links.txt
trainer-0.1/trainer.egg-info/PKG-INFO
trainer-0.1/trainer.egg-info/SOURCES.txt
trainer-0.1/trainer.egg-info/top_level.txt
i found the answer in the ml cloud documentation
https://cloud.google.com/ml-engine/docs/how-tos/packaging-trainer#to_use_the_gcloud_tool_to_use_an_existing_package_already_in_the_cloud

Running a script from a package

I'm new to python coming from java. I created a folder called: 'Project'. In 'Project' I created many packages (with __init__.py files) like 'test1' and 'tests2'. 'test1' contains a python script file .py that uses scripts from 'test2' (import a module from test2). I want to run a script x.py in 'test1' from command line. How can I do that?
Edit: if you have better recommendations on how I can better organize my files I would be thankful. (notice my java mentality)
Edit: I need to run the script from a bash script, so I need to provide full paths.
There are probably several ways to achieve what you want.
One thing that I do when I need to make sure the module paths are correct in an executable scripts is to get the parent directory and insert in the module search path (sys.path):
import sys, os
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.realpath(__file__))))
import test1 # next imports go here...
from test2 import something
# any import what works from the parent dir will work here
This way you are safe to run your scripts without worrying how the script is called.
Python code is organized into modules and packages. A module is just a .py file that can contain class definitions, function definitions and variables. A package is a directory with a __init__.py file.
A standard Python project might look something like this:
thingsproject/
README
setup.py
doc/
...
things/
__init__.py
animals.py
vegetables.py
minerals.py
test/
test_animals.py
test_vegetables.py
test_minerals.py
The setup.py file describes the metadata about your project. See Writing the Setup Script and particularly the section on installing scripts.
Entry points exist to help distribute command line tools in Python. An entry point is defined in setup.py like this:
setup(
name='thingsproject',
....
entry_points = {
'console_scripts': ['dog = things.animals:dog_main_function']
},
...
)
The effect is that when the package is installed using python setup.py install a script is automatically created in some reasonable place according to your OS, such as /usr/local/bin. The script then calls the dog_main_function in the animals module of the things package.
Yet another Python convention to consider is have a __main__.py file. This signifies the "main" script within a directory or zip file full of python code. This is a good place to define a command line interface to your code using the argparse parser for command line arguments.
Good and up-to-date information on the somewhat muddled world of Python packaging can be found in the Python Packaging User Guide.

Categories