Getting coverage xml for a python file run with spark - python

I have a python file (myfile.py) that I typically run by running a command like
spark2-submit --master yarn --deploy-mode client myfile.py arg1 arg2
I need to get coverage on this file, and I have been trying things like
coverage run myfile.py arg1 arg2
coverage xml -o coverage-myfile.xml
This works fine and gives me the coverage xml, but so lines don't work properly as it needs to be run using spark-submit, not simply python. Therefore, my coverage is a little lower than I would like.
Is there a way to do this but using spark?

In a test environment (so it's not perfectly one-to-one, but should be similar), I've done this by implementing a session-wide fixture that builds a SparkSession. Something like:
#pytest.fixture(scope="session")
def spark():
spark = (
# In place for Spark 3.x to work.
SparkSession.builder.config(
"spark.driver.extraJavaOptions",
"-Dio.netty.tryReflectionSetAccessible=true",
)
.config(
"spark.executor.extraJavaOptions",
"-Dio.netty.tryReflectionSetAccessible=true",
)
.appName("pytest-provider-test")
.master("local[2]")
.getOrCreate()
)
return spark
Then, for all the functions I need to test (and coverage on), use the spark fixture.
In your case, you may need to have a test/coverage-specific module call a relevant starting function in myfile.py with a SparkSession object built as above, and pass that down through your codebase. Your coverage would still be accurate for those functions and any submodules.
This can then be run as a regular python module.

Related

Pytest does not output junitxml when running in Databricks repo

We have a Databricks platform where repos and files in repos are enabled. As such, we can have .py files within the repos which can be called by Databricks notebooks.
We are currently testing the viability of running our unit tests on Databricks clusters instead of using a (PySpark) image in our Git / CI environment.
The repo within Databricks looks like
| - notebook
| - mycode.py
| - mycode_test.py
Here, mycode.py contains a function that applies a transformation on a Spark Dataframe. The file mycode_test.py contains an unit test build with pytest (and some fixtures to create test data and handling the Spark session / Spark context).
We run pytest from the notebook, instead of from the command line. Hence, the Databricks notebook looks like:
import pytest
retcode = pytest.main(['-k', 'mycode_test',
'-o', 'cache_dir=/dbfs/FileStore/',
'--junitxml', '/dbfs/FileStore/pytestreport.xml',
'-v'
]
)
This code snippet runs fine on a standard Databricks cluster (with runtime 10.4 LTS and pytest installed) and the results of the unit testing are printed out below the cell.
However, no output is stored at the cache directory or the pointer for the junit xml file.
Questions:
Are we missing something here?
Can we assume that it actually generated output at an unknown location because the pytest.main did not crash?
Are the .fuse-mounts within Databricks causing the issue here?
It seemed that I made some mistakes during my initial setup of the paths in the pytest.main command. I have updated these paths now and they work.
Thus, the snippet below generates the XML and caching files in the databricks FileStore.
Again, this probably only works when you are working within a Databricks Repo with files in repos enabled.
import pytest
retcode = pytest.main(['-k', 'mycode_test',
'-o', 'cache_dir=/dbfs/FileStore/',
'--junitxml', '/dbfs/FileStore/pytestreport.xml',
'-v'
]
)

Jenkinsfile syntax - is there a DRY-example of shared python build steps?

I have a repo "A" with shared python build scripts which I today run in various "Execute shell" build steps in Jenkins. I seed this steps/scripts from job-dsl groovy code.
Using the newer Jenkins 2 Pipeline-concept in a repo "B" (where my app source code resides) what must my Jenkinsfile in this repo look like to keep it DRY and reuse my existing python build scripts?
I have studied the plugin 'workflow-cps-global-lib' and I have tried to setup "Pipeline Libraries" on my Jenkins master but since this setup groovy-oriented it does not just feel like the right way to go or I just does not get hang of the correct syntax. I cannot find any examples on this specific use case.
Basically I just want to to this in my Jenkinsfile:
Clone my source repo ('B') for my app
Make my shared python build scripts from my repo "A" available
Execute the python build scripts from various "execute shell" steps
Etcetera...
workflow-cps-global-lib is the way to go. Install it and setup in 'Manage Jenkins -> Configure System -> Global Pipeline Libraries to use your repository.
If you decided to use python scripts and not groovy, put all your python scripts in (root)/resources dir.
in your Jenkinsfile - load the script with libraryResource
script = libraryResource 'my_script.py'
and use it
sh script
(not enough reputation to add comment to accepted answer above)
Given a python script in /resources/myscript.py like this:
#!/usr/bin/env python3
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--echo")
args = parser.parse_args()
print(args.echo)
Use a Jenkins function like this:
def runPy(String scriptPath, def args) {
String script = libraryResource(scriptPath)
String argsString = args.join(' ')
sh "python3 -c '${script}' ${argsString}"
}
runPy('myscript.py', ['--echo', 'foo'])

py.test use -m AS WELL as directory path in command

I'd like to only run selenium tests in my test suite, in addition to filtering it down to only run tests in a specific file/folder. It seems like I should be able to accomplish this with the -m option, and the path positional argument. Furthermore, I'm doing this in a bash script.
So for example, I tried something like this:
#!/bin/bash
# ...some logic here for determining `EXTRA` arg
EXTRA = "not selenium"
py.test -m $EXTRA -v -s --tb=long --no-flaky-report ~/project/mytests/test_blerg.py
And then my test looks like this (still using xunit-style classes):
#pytest.mark.selenium
class BaseTest(UnitTest):
pass
class ChildTest(BaseTest):
def test_first_case(self):
pass
When I run the py.test command as I described above, I get this:
============================================================================ no tests ran in 0.01 seconds ============================================================================
ERROR: file not found: selenium"
Not completely sure why this doesn't work. I'll try manually overriding pytest_runtest_setup(), but I'm feel like I should be able to accomplish what I want without doing that. Also, just FYI, this is a django project, using Django==1.8.7 and pytest-django==2.9.1.
Any help would be greatly appreciated :)
Figured it out. This has nothing to do with py.test itself. I had an error in how I was calling the py.test command in my bash script. The amended command looks like this:
py.test -m "$EXTRA" -v -s --tb=long --no-flaky-report ~/project/mytests/test_blerg.py
Works as expected!

Include package in Spark local mode

I'm writing some unit tests for my Spark code in python. My code depends on spark-csv. In production I use spark-submit --packages com.databricks:spark-csv_2.10:1.0.3 to submit my python script.
I'm using pytest to run my tests with Spark in local mode:
conf = SparkConf().setAppName('myapp').setMaster('local[1]')
sc = SparkContext(conf=conf)
My question is, since pytest isn't using spark-submit to run my code, how can I provide my spark-csv dependency to the python process?
you can use your config file spark.driver.extraClassPath to sort out the problem.
Spark-default.conf
and add the property
spark.driver.extraClassPath /Volumes/work/bigdata/CHD5.4/spark-1.4.0-bin-hadoop2.6/lib/spark-csv_2.11-1.1.0.jar:/Volumes/work/bigdata/CHD5.4/spark-1.4.0-bin-hadoop2.6/lib/commons-csv-1.1.jar
After setting the above you even don't need packages flag while running from shell.
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false').load(BASE_DATA_PATH + '/ssi.csv')
Both the jars are important, as spark-csv depends on commons-csv apache jar. The spark-csv jar you can either build or download from mvn-site.

python nose xunit report file is empty

I have a problem running nose tests and get results inside Jenkins.
The job has a shell script like this:
. /var/lib/jenkins/envs/itravel/bin/activate
python api/manage.py syncdb --noinput
DJANGO_SETTINGS_MODULE=ci_settings nosetests --verbosity=0 --processes=1 --with-xunit --xunit-file=nosetests.xml
deactivate
Part of the test suite is run using the django_nose.NoseTestSuiteRunner.
All the tests are run and the resulting nosetests.xml file is created but does not seem to be filled with the tests results:
<?xml version="1.0" encoding="UTF-8"?><testsuite name="nosetests" tests="0" errors="0" failures="0" skip="0"></testsuite>
I noticed that on an import Error fail the file is filled with one error, but otherwise, nothing...
Any idea? Is there something special to do from the tests side? Any property to set or so?
Thanks.
As far as I know, the --processes option is not compatible with --with-xunit. When you ask nosetests to run with the processes plugin, the tests are run in specified number of subprocesses. The xunit plugin does not know how to gather results into the xml file.
Just remove the --processes option and you should be fine.
Nose has had an open and unresolved GitHub issue for this since 2011. As #sti said, everything works fine if you don't use --processes. For everyone else, consider using Ignas/nose_xunitmp instead:
pip install nose_xunitmp
nosetests --with-xunitmp
nosetests --xunitmp-file results.xml

Categories