As a part of upgrading Kedro from 0.16.2 to 0.17.3 in our organization, I've made changes to Kedro related files in our codebase based on Kedro starter pyspark-iris on 0.17.3.
Now I get an error of Error: No such command 'run' on kedro run.
setup.py
from setuptools import find_packages, setup
entry_point = "kedro-project = kedro-package.__main__:main"
# get the dependencies and installs
with open("requirements.txt", "r", encoding="utf-8") as f:
# Make sure we strip all comments and options (e.g "--extra-index-url")
# that arise from a modified pip.conf file that configure global options
# when running kedro build-reqs
requires = []
for line in f:
req = line.split("#", 1)[0].strip()
if req and not req.startswith("--"):
requires.append(req)
setup(
name="kedro-package",
version="0.1",
packages=find_packages(exclude=["tests"]),
entry_points={"console_scripts": [entry_point]},
install_requires=requires,
extras_require={
"docs": [
"sphinx~=3.4.3",
"sphinx_rtd_theme==0.5.1",
"nbsphinx==0.8.1",
"nbstripout==0.3.3",
"recommonmark==0.7.1",
"sphinx-autodoc-typehints==1.11.1",
"sphinx_copybutton==0.3.1",
"jupyter_client>=5.1.0, <6.0",
"tornado>=4.2, <6.0",
"ipykernel~=5.3",
]
},
)
main.py
from pathlib import Path
from kedro.framework.project import configure_project
import logging
from .cli import run
def main():
package_name = str(Path(__file__).resolve().parent.name)
logging.getLogger(__name__).info(f"package name is: {package_name}")
configure_project(package_name=package_name)
run()
if __name__ == "__main__":
main()
and cli.py is at the same level as main.py which are directly inside the package (altered to kedro-package here for anonymity)
This only happens when performing kedro run on the EMR. When we run locally we don't see that error. Rather it errors out because it can't connect to S3, which is expected.
Additionally, I've tried running
Related
I am using pytest coverage and following I have the tests scripts in the command line that will generate the coverage reports for me:
"""Manager script to run the commands on the Flask API."""
import os
import click
from api import create_app
from briqy.logHelper import b as logger
import pytest
app = create_app()
#app.cli.command("tests")
#click.argument("option", required=False)
def run_test_with_option(option: str = None):
from subprocess import run
from shlex import split
if option is None:
raise SystemExit(
pytest.main(
[
"--disable-pytest-warnings",
"--cov=lib",
"--cov-config=.coveragerc",
"--cov-report=term",
"--cov-report=xml",
"--cov-report=html",
"--junitxml=./tests/coverage/junit.xml",
"--cov-append",
"--no-cov-on-fail",
]
)
)
elif option == "watch":
run(
split(
'ptw --runner "python3 -m pytest tests --durations=5 '
'--disable-pytest-warnings"'
)
)
elif option == "debug":
run(
split(
'ptw --pdb --runner "python3 -m pytest tests --durations=5 '
'--disable-pytest-warnings"'
)
)
if __name__ == "__main__":
HOST = os.getenv("FLASK_RUN_HOST", default="127.0.0.1")
PORT = os.getenv("FLASK_RUN_PORT", default=5000)
DEBUG = os.getenv("FLASK_DEBUG", default=False)
app.run(
host=HOST,
port=PORT,
debug=DEBUG,
)
if bool(DEBUG):
logger.info(f"Flask server is running in {os.getenv('ENV')}")
However, after running the tests, it shows the imports on top of the file being uncovered by tests. Does anyone know how to get rid of those uncovered lines?
You haven't shown the whole program, but I will guess that you are importing your product code at the top of this file. That means all of the top-level statements in the product code (import, class, def, etc) will have already run by the time the pytest-cov plugin starts the coverage measurement.
There are a few things you can do:
Run pytest from the command line instead of in-process in your product code.
Change this code to start pytest in a subprocess so that everything will be re-imported by pytest.
What you should not do: exclude import lines from coverage measurement. The .coveragerc you showed will ignore any line with the string "from" or "import" in it, which isn't what you want ("import" is in "do_important_thing()" for example!)
I am trying to execute a apache beam pipeline as a dataflow job in Google Cloud Platform.
My project structure is as follows:
root_dir/
__init__.py
setup.py
main.py
utils/
__init__.py
log_util.py
config_util.py
Here's my setup.py
setuptools.setup(
name='dataflow_example',
version='1.0',
install_requires=[
"google-cloud-tasks==2.2.0",
"google-cloud-pubsub>=0.1.0",
"google-cloud-storage==1.39.0",
"google-cloud-bigquery==2.6.2",
"google-cloud-secret-manager==2.0.0",
"google-api-python-client==2.3.0",
"oauth2client==4.1.3",
"apache-beam[gcp]>=2.20.0",
"wheel>=0.36.2"
],
packages=setuptools.find_packages()
)
Here's my pipeline code:
import math
import apache_beam as beam
from datetime import datetime
from apache_beam.options.pipeline_options import PipelineOptions
from utils.log_util import LogUtil
from utils.config_util import ConfigUtil
class DataflowExample:
config = {}
def __init__(self):
self.config = ConfigUtil.get_config(module_config=["config"])
self.project = self.config['project']
self.region = self.config['location']
self.bucket = self.config['core_bucket']
self.batch_size = 10
def execute_pipeline(self):
try:
LogUtil.log_n_notify(log_type="info", msg=f"Dataflow started")
query = "SELECT id, name, company FROM `<bigquery_table>` LIMIT 10"
beam_options = {
"project": self.project,
"region": self.region,
"job_name": "dataflow_example",
"runner": "DataflowRunner",
"temp_location": f"gs://{self.bucket}/temp_location/"
}
options = PipelineOptions(**beam_options, save_main_session=True)
with beam.Pipeline(options=options) as pipeline:
data = (
pipeline
| 'Read from BQ ' >> beam.io.Read(beam.io.ReadFromBigQuery(query=query, use_standard_sql=True))
| 'Count records' >> beam.combiners.Count.Globally()
| 'Print ' >> beam.ParDo(PrintCount(), self.batch_size)
)
LogUtil.log_n_notify(log_type="info", msg=f"Dataflow completed")
except Exception as e:
LogUtil.log_n_notify(log_type="error", msg=f"Exception in execute_pipeline - {str(e)}")
class PrintCount(beam.DoFn):
def __init__(self):
self.logger = LogUtil()
def process(self, row_count, batch_size):
try:
current_date = datetime.today().date()
total = int(math.ceil(row_count / batch_size))
self.logger.log_n_notify(log_type="info", msg=f"Records pulled from table on {current_date} is {row_count}")
self.logger.log_n_notify(log_type="info", msg=f"Records per batch: {batch_size}. Total batches: {total}")
except Exception as e:
self.logger.log_n_notify(log_type="error", msg=f"Exception in PrintCount.process - {str(e)}")
if __name__ == "__main__":
df_example = DataflowExample()
df_example.execute_pipeline()
Functionality of pipeline is
Query against BigQuery Table.
Count the total records fetched from querying.
Print using the custom Log module present in utils folder.
I am running the job using cloud shell using command - python3 - main.py
Though the Dataflow job starts, the worker nodes throws error after few mins saying "ModuleNotFoundError: No module named 'utils'"
"utils" folder is available and the same code works fine when executed with "DirectRunner".
log_util and config_util files are custom util files for logging and config fetching respectively.
Also, I tried running with setup_file options as python3 - main.py --setup_file </path/of/setup.py> which makes the job to just freeze and does not proceed even after 15 mins.
How do I resolve the ModuleNotFoundError with "DataflowRunner"?
Posting as community wiki. As confirmed by #GopinathS the error and fix are as follows:
The error encountered by the workers is Beam SDK base version 2.32.0 does not match Dataflow Python worker version 2.28.0. Please check Dataflow worker startup logs and make sure that correct version of Beam SDK is installed.
To fix this "apache-beam[gcp]>=2.20.0" is removed from install_requires of setup.py since, the '>=' is assigning the latest available version (2.32.0 as of this writing) while the workers version are only 2.28.0.
Updated setup.py:
setuptools.setup(
name='dataflow_example',
version='1.0',
install_requires=[
"google-cloud-tasks==2.2.0",
"google-cloud-pubsub>=0.1.0",
"google-cloud-storage==1.39.0",
"google-cloud-bigquery==2.6.2",
"google-cloud-secret-manager==2.0.0",
"google-api-python-client==2.3.0",
"oauth2client==4.1.3", # removed apache-beam[gcp]>=2.20.0
"wheel>=0.36.2"
],
packages=setuptools.find_packages()
)
Updated beam_options in the pipeline code:
beam_options = {
"project": self.project,
"region": self.region,
"job_name": "dataflow_example",
"runner": "DataflowRunner",
"temp_location": f"gs://{self.bucket}/temp_location/",
"setup_file": "./setup.py"
}
Also make sure that you pass all the pipeline options at once and not partially.
If you pass --setup_file </path/of/setup.py> in the command then make sure to read and append the setup file path into the already defined beam_options variable using argument_parser in your code.
To avoid parsing the argument and appending into beam_options I instead added it directly in beam_options as "setup_file": "./setup.py"
Dataflow might have problems with installing packages that are platform locked in isolated network.
It won't be able to compile them if no network is there.
Or maybe it tries installing them but since cannot compile downloads wheels? No idea.
Still to be able to use packages like psycopg2 (binaries), or google-cloud-secret-manager (no binaries BUT dependencies have binaries), you need to install everything that has no binaries (none-any) AND no dependencies with binaries, by requirements.txt and the rest by --extra_packages param with wheel. Example:
--extra_packages=package_1_needed_by_2-manylinux.whl \
--extra_packages=package_2_needed_by_3-manylinux.whl \
--extra_packages=what-you-need_needing_3-none-any.whl
I'm trying to test file parsing with pytest. I have a directory tree that looks something like this for my project:
project
project/
cool_code.py
setup.py
setup.cfg
test/
test_read_files.py
test_files/
data_file1.txt
data_file2.txt
My setup.py file looks something like this:
from setuptools import setup
setup(
name = 'project',
description = 'The coolest project ever!',
setup_requires = ['pytest-runner'],
tests_require = ['pytest'],
)
My setup.cfg file looks something like this:
[aliases]
test=pytest
I've written several unit tests with pytest to verify that files are properly read. They work fine when I run pytest from within the "test" directory. However, if I execute any of the following from my project directory, the tests fail because they cannot find data files in test_files:
>> py.test
>> python setup.py pytest
The test seems to be sensitive to the directory from which pytest is executed.
How can I get pytest unit tests to discover the files in "data_files" for parsing when I call it from either the test directory or the project root directory?
One solution is to define a rootdir fixture with the path to the test directory, and reference all data files relative to this. This can be done by creating a test/conftest.py (if not already created) with some code like this:
import os
import pytest
#pytest.fixture
def rootdir():
return os.path.dirname(os.path.abspath(__file__))
Then use os.path.join in your tests to get absolute paths to test files:
import os
def test_read_favorite_color(rootdir):
test_file = os.path.join(rootdir, 'test_files/favorite_color.csv')
data = read_favorite_color(test_file)
# ...
One solution is to try multiple paths to find the files.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from coolprogram import *
import os
def test_file_locations():
"""Possible locations where test data could be found."""
return(['./test_files',
'./tests/test_files',
])
def find_file(filename):
""" Searches for a data file to use in tests """
for location in test_file_locations():
filepath = os.path.join(location, filename)
if os.path.exists(filepath):
return(filepath)
raise IOError('Could not find test file.')
def test_read_favorite_color():
""" Test that favorite color is read properly """
filename = 'favorite_color.csv'
test_file = find_file(filename)
data = read_favorite_color(test_file)
assert(data['first_name'][1] == 'King')
assert(data['last_name'][1] == 'Arthur')
assert(data['correct_answers'][1] == 2)
assert(data['cross_bridge'][1] == True)
assert(data['favorite_color'][1] == 'green')
One way is to pass a dictionary of command name and custom command class to cmdclass argument of setup function.
Another way is like here, posted it here for quick reference.
pytest-runner will install itself on every invocation of setup.py. In some cases, this causes delays for invocations of setup.py that will never invoke pytest-runner. To help avoid this contingency, consider requiring pytest-runner only when pytest is invoked:
pytest = {'pytest', 'test', 'ptr'}.intersection(sys.argv)
pytest_runner = ['pytest-runner'] if needs_pytest else []
# ...
setup(
#...
setup_requires=[
#... (other setup requirements)
] + pytest_runner,
)
Make sure all the data you read in your test module is relative to the location of setup.py directory.
In OP's case data file path would be test/test_files/data_file1.txt,
I made a project with same structure and read the data_file1.txt with some text in it and it works for me.
I have the following directory structure to my python project:
eplusplus/
|
|
----__main__.py
----model/
----exception/
----controller/
----view/
The directories: model, exception, controller and view each one has its
__init__.py. When I run the program at my machine I always use this following command: py -m eplusplus. But when I tried to use py2exe or pytinstaller the the points to: permission denied. For what I found, this is because its a directory I trying to compile, but when I compiled the __main__.py it compiled normally, but when I try to execute it says: Error! No eplusplus module founded!
I have no setup.py file and I don't know how they worked.
After some very intensive research and error and try I succeeded by doing this:
I added an empty __init__.py at the eplusplus folder
Out of the eplusplus folder, I had to write a compilation.py file (the file doesn't necessary must have this) to include all libraries I was using (I will post the file at the end of this answer)
Finally, at the PowerShell, all I have to type was py compilation.py py2exe
Thanks for all that tried to help me!
compilation.py file:
#To compile we need to run: python compilation.py py2exe
from distutils.core import setup
from glob import glob
import os
import py2exe
import pyDOE
VERSION=1.0
includes = [
"sip",
"PyQt5",
"PyQt5.QtCore",
"PyQt5.QtGui",
"PyQt5.QtWidgets",
"scipy.linalg.cython_blas",
"scipy.linalg.cython_lapack",
"pyDOE"
]
platforms = ["C:\\Python34\\Lib\\site-packages\\PyQt5\\plugins" +
"\\platforms\\qwindows.dll"]
dll = ["C:\\windows\\syswow64\\MSVCP100.dll",
"C:\\windows\\syswow64\\MSVCR100.dll"]
media = ["C:\\Users\\GUSTAVO\\EPlusPlus\\media\\title.png",
"C:\\Users\\GUSTAVO\\EPlusPlus\\media\\icon.png"]
documents = ["C:\\Users\\GUSTAVO\\EPlusPlus\\docs\\"+
"documentacaoEPlusPlus.pdf"]
examples = ["C:\\Users\\GUSTAVO\\EPlusPlus\\files\\"+
"\\examples\\baseline2A.idf",
"C:\\Users\\GUSTAVO\\EPlusPlus\\files\\"+
"\\examples\\vectors.csv",
"C:\\Users\\GUSTAVO\\EPlusPlus\\files\\"+
"\\examples\\BRA_SC_Florianopolis.838970_INMET.epw"]
datafiles = [("platforms", platforms),
("", dll),
("media", media),
("docs", documents),
("Examples", examples)]
imageformats = glob("C:\\Python34\\Lib\\site-packages\\PyQt5\\"+
"plugins\\imageformats\\*")
datafiles.append(("imageformats", imageformats))
setup(
name="eplusplus",
version=VERSION,
packages=["eplusplus"],
url="",
license="",
windows=[{"script": "eplusplus/__main__.py"}],
scripts=[],
data_files = datafiles,
options={
"py2exe": {
"includes": includes,
}
}
)
I am running the following script inside AWS Lambda:
#!/usr/bin/python
from __future__ import print_function
import json
import os
import ansible.inventory
import ansible.playbook
import ansible.runner
import ansible.constants
from ansible import utils
from ansible import callbacks
print('Loading function')
def run_playbook(**kwargs):
stats = callbacks.AggregateStats()
playbook_cb = callbacks.PlaybookCallbacks(verbose=utils.VERBOSITY)
runner_cb = callbacks.PlaybookRunnerCallbacks(
stats, verbose=utils.VERBOSITY)
# use /tmp instead of $HOME
ansible.constants.DEFAULT_REMOTE_TMP = '/tmp/ansible'
out = ansible.playbook.PlayBook(
callbacks=playbook_cb,
runner_callbacks=runner_cb,
stats=stats,
**kwargs
).run()
return out
def lambda_handler(event, context):
return main()
def main():
out = run_playbook(
playbook='little.yml',
inventory=ansible.inventory.Inventory(['localhost'])
)
return(out)
if __name__ == '__main__':
main()
However, I get the following error: failed=True msg='boto required for this module'
However, according to this comment(https://github.com/ansible/ansible/issues/5734#issuecomment-33135727), it works.
But, I'm not understanding how do I mention that in my script? Or, can I have a separate hosts file, and include it in the script, like how I call my playbook?
If so, then how?
[EDIT - 1]
I have added the line inventory=ansible.inventory.Inventory('hosts')
with hosts file as:
[localhost]
127.0.0.1 ansible_python_interpreter=/usr/local/bin/python
But, I get this error: /bin/sh: /usr/local/bin/python: No such file or directory
So, where is python located inside AWS Lambda?
I installed boto just like I installed other packages in the Lambda's deployment package: pip install boto -t <folder-name>
The bash command which python will usually give the location of the Python binary. There's an example of how to call a bash script from AWS Lambda here.