We have a Databricks platform where repos and files in repos are enabled. As such, we can have .py files within the repos which can be called by Databricks notebooks.
We are currently testing the viability of running our unit tests on Databricks clusters instead of using a (PySpark) image in our Git / CI environment.
The repo within Databricks looks like
| - notebook
| - mycode.py
| - mycode_test.py
Here, mycode.py contains a function that applies a transformation on a Spark Dataframe. The file mycode_test.py contains an unit test build with pytest (and some fixtures to create test data and handling the Spark session / Spark context).
We run pytest from the notebook, instead of from the command line. Hence, the Databricks notebook looks like:
import pytest
retcode = pytest.main(['-k', 'mycode_test',
'-o', 'cache_dir=/dbfs/FileStore/',
'--junitxml', '/dbfs/FileStore/pytestreport.xml',
'-v'
]
)
This code snippet runs fine on a standard Databricks cluster (with runtime 10.4 LTS and pytest installed) and the results of the unit testing are printed out below the cell.
However, no output is stored at the cache directory or the pointer for the junit xml file.
Questions:
Are we missing something here?
Can we assume that it actually generated output at an unknown location because the pytest.main did not crash?
Are the .fuse-mounts within Databricks causing the issue here?
It seemed that I made some mistakes during my initial setup of the paths in the pytest.main command. I have updated these paths now and they work.
Thus, the snippet below generates the XML and caching files in the databricks FileStore.
Again, this probably only works when you are working within a Databricks Repo with files in repos enabled.
import pytest
retcode = pytest.main(['-k', 'mycode_test',
'-o', 'cache_dir=/dbfs/FileStore/',
'--junitxml', '/dbfs/FileStore/pytestreport.xml',
'-v'
]
)
Related
I have a python file (myfile.py) that I typically run by running a command like
spark2-submit --master yarn --deploy-mode client myfile.py arg1 arg2
I need to get coverage on this file, and I have been trying things like
coverage run myfile.py arg1 arg2
coverage xml -o coverage-myfile.xml
This works fine and gives me the coverage xml, but so lines don't work properly as it needs to be run using spark-submit, not simply python. Therefore, my coverage is a little lower than I would like.
Is there a way to do this but using spark?
In a test environment (so it's not perfectly one-to-one, but should be similar), I've done this by implementing a session-wide fixture that builds a SparkSession. Something like:
#pytest.fixture(scope="session")
def spark():
spark = (
# In place for Spark 3.x to work.
SparkSession.builder.config(
"spark.driver.extraJavaOptions",
"-Dio.netty.tryReflectionSetAccessible=true",
)
.config(
"spark.executor.extraJavaOptions",
"-Dio.netty.tryReflectionSetAccessible=true",
)
.appName("pytest-provider-test")
.master("local[2]")
.getOrCreate()
)
return spark
Then, for all the functions I need to test (and coverage on), use the spark fixture.
In your case, you may need to have a test/coverage-specific module call a relevant starting function in myfile.py with a SparkSession object built as above, and pass that down through your codebase. Your coverage would still be accurate for those functions and any submodules.
This can then be run as a regular python module.
I'm trying to submit an experiment to Azure ML using a Python script.
The Environment being initialised uses a custom Dockerfile.
env = Environment(name="test")
env.docker.base_image = None
env.docker.base_dockerfile = './Docker/Dockerfile'
env.docker.enabled = True
However the DockerFile needs a few COPY statements but those fail as follow:
Step 9/23 : COPY requirements-azure.txt /tmp/requirements-azure.txt
COPY failed: stat /var/lib/docker/tmp/docker-builder701026190/requirements-azure.txt: no such file or directory
The Azure host environment responsible to build the image does not contain the files the Dockerfile requires, those exist in my local development machine from where I initiate the python script.
I've been searching for the whole day of a way to add to the environment these files but without success.
Below an excerpt from the Dockerfile and the python script that submits the experiment.
FROM mcr.microsoft.com/azureml/base:intelmpi2018.3-ubuntu16.04 as base
COPY ./Docker/requirements-azure.txt /tmp/requirements-azure.txt # <- breaks here
[...]
Here is how I'm submitting the experiment:
from azureml.core.environment import Environment
from azureml.core import Workspace
from azureml.core.model import Model
from azureml.core import Workspace, Experiment
from azureml.core.compute import ComputeTarget
from azureml.core import Experiment, Workspace
from azureml.train.estimator import Estimator
import os
ws = Workspace.from_config(path='/mnt/azure/config/workspace-config.json')
env = Environment(name="test")
env.docker.base_image = None
env.docker.base_dockerfile = './Docker/Dockerfile'
env.docker.enabled = True
compute_target = ComputeTarget(workspace=ws, name='GRComputeInstance')
estimator = Estimator(
source_directory='/workspace/',
compute_target=compute_target,
entry_script="./src/ml/train/main.py",
environment_definition=env
)
experiment = Experiment(workspace=ws, name="estimator-test")
run = experiment.submit(estimator)
run.wait_for_completion(show_output=True, wait_post_processing=True)
Any idea?
I think the correct way to setup the requirements.txt for your project is using Define an inference configuration as:
name: project_environment
dependencies:
- python=3.6.2
- scikit-learn=0.20.0
- pip:
# You must list azureml-defaults as a pip dependency
- azureml-defaults>=1.0.45
- inference-schema[numpy-support]
See this
I think you need to look for "using your own base image", e.g. in the Azure docs here. For building the actual Docker image you have two options:
Build on Azure build servers. Here you need to upload all required files together with your Dockerfile to the build environment. (Alternatively, you could consider making the requirements-azure.txt file available via HTTP, such that the build environment can fetch it from anywhere.)
Build locally with your own Docker-installation and upload the final image to the correct Azure registry.
This is just the broad outline, at the moment I can't give more detailed recommendations. Hope it helps anyway.
I have a problem running pyspark script through oozie, using hue. I can run the same code included in a script through a notebook or with spark-submit without error, leading me to suspect that something in my oozie workflow is misconfigured. The spark action part of generated for my workflow xml is:
<action name="spark-51d9">
<spark xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn</master>
<mode>cluster</mode>
<name>MySpark</name>
<jar>myapp.py</jar>
<file>/path/to/local/spark/hue-oozie-1511868018.89/lib/MyScript.py#MyScript.py</file>
</spark>
<ok to="hive2-07c2"/>
<error to="Kill"/>
</action>
The only message I find in my logs is:
Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.SparkMain], exit code [1]
This is what I have tried so far without solving the problem:
I have tried running it both in yarn client and cluster modes. I have also both tried using paths to a separate directory, and to the hue-generated oozie workflow directory's lib directory in which I have my script. I think that it can find the script, because if I specify another directory I get a message that it is not found. Any help with this is greatly appreciated.
The way this works for me is:
First you create an sh file that will run your python script.
The file should have the sumbit command:
....spark-submit
then all the flags you need:
--master yarn-cluster......--conf executer-cores 3 .......conf spark.executor.extraClassPath=jar1.jar:jar2.jar --driver-class-path jar1.jar:jar2.jar:jar3.jar
and at the end:
..... my_pyspark_script.py
Then you create a workflow and you choose the shell option and add your sh file as the "shell command" and in "files"
From here it's a bit of work to make sure everything is connected properly.
For example I had to add "export" in my sh file so that my "spark/conf" will be properly added.
I have a repo "A" with shared python build scripts which I today run in various "Execute shell" build steps in Jenkins. I seed this steps/scripts from job-dsl groovy code.
Using the newer Jenkins 2 Pipeline-concept in a repo "B" (where my app source code resides) what must my Jenkinsfile in this repo look like to keep it DRY and reuse my existing python build scripts?
I have studied the plugin 'workflow-cps-global-lib' and I have tried to setup "Pipeline Libraries" on my Jenkins master but since this setup groovy-oriented it does not just feel like the right way to go or I just does not get hang of the correct syntax. I cannot find any examples on this specific use case.
Basically I just want to to this in my Jenkinsfile:
Clone my source repo ('B') for my app
Make my shared python build scripts from my repo "A" available
Execute the python build scripts from various "execute shell" steps
Etcetera...
workflow-cps-global-lib is the way to go. Install it and setup in 'Manage Jenkins -> Configure System -> Global Pipeline Libraries to use your repository.
If you decided to use python scripts and not groovy, put all your python scripts in (root)/resources dir.
in your Jenkinsfile - load the script with libraryResource
script = libraryResource 'my_script.py'
and use it
sh script
(not enough reputation to add comment to accepted answer above)
Given a python script in /resources/myscript.py like this:
#!/usr/bin/env python3
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--echo")
args = parser.parse_args()
print(args.echo)
Use a Jenkins function like this:
def runPy(String scriptPath, def args) {
String script = libraryResource(scriptPath)
String argsString = args.join(' ')
sh "python3 -c '${script}' ${argsString}"
}
runPy('myscript.py', ['--echo', 'foo'])
I'm writing some unit tests for my Spark code in python. My code depends on spark-csv. In production I use spark-submit --packages com.databricks:spark-csv_2.10:1.0.3 to submit my python script.
I'm using pytest to run my tests with Spark in local mode:
conf = SparkConf().setAppName('myapp').setMaster('local[1]')
sc = SparkContext(conf=conf)
My question is, since pytest isn't using spark-submit to run my code, how can I provide my spark-csv dependency to the python process?
you can use your config file spark.driver.extraClassPath to sort out the problem.
Spark-default.conf
and add the property
spark.driver.extraClassPath /Volumes/work/bigdata/CHD5.4/spark-1.4.0-bin-hadoop2.6/lib/spark-csv_2.11-1.1.0.jar:/Volumes/work/bigdata/CHD5.4/spark-1.4.0-bin-hadoop2.6/lib/commons-csv-1.1.jar
After setting the above you even don't need packages flag while running from shell.
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false').load(BASE_DATA_PATH + '/ssi.csv')
Both the jars are important, as spark-csv depends on commons-csv apache jar. The spark-csv jar you can either build or download from mvn-site.