Packaging shell action files with Oozie, retaining original directory structure

Packaging shell action files with Oozie, retaining original directory structure - python

I have a PySpark application I would like to schedule with Oozie, using the shell action.
My submit-application.sh script simply initializes a Python virtualenv (present on all worker nodes) and calls the application.py Python application script.
The application.py script is a PySpark application that comes with one own local Python module, let's say called foobar, which is simply imported and used throughout the code.
So I have a directory structure similar to this:
.
├── foobar
│   ├── config.py
│   ├── foobar.py
│   └── __init__.py
├── application.DEV.ini
├── application.PROD.ini
├── application.py
├── requirements.txt
└── submit-application.sh
I am trying to use an Oozie workflow to package all script and local module files, but apparently, they are always delivered as flattened, dumped into the root directory of the container, regardless any configuration I used. This prevents the Python script from loading the local modules, causing ModuleNotFoundError: No module named 'foobar' errors.
Is not there any way to tell Oozie to place file artifacts to a sub-directory?
It seems that the # notation is just ignored.
This is my Oozie workflow.xml file
<workflow-app name="Data-Extraction-WF" xmlns="uri:oozie:workflow:0.5">
<global>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
</global>
<start to="Data-Extraction"/>
<action name="Data-Extraction">
<shell xmlns="uri:oozie:shell-action:1.0">
<exec>submit-application.sh</exec>
<file>app/__init__.py#app/__init__.py</file>
<file>app/config.py#app/config.py</file>
<file>app/foobar.py#app/foobar.py</file>
<file>application.DEV.ini#application.DEV.ini</file>
<file>application.PROD.ini#application.PROD.ini</file>
<file>application.py#application.py</file>
<file>submit-application.sh#submit-application.sh</file>
<capture-output/>
</shell>
<ok to="success"/>
<error to="failure"/>
</action>
<kill name="failure">
<message>Workflow failed, error message: [${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="success"/>
</workflow-app>

I ended up creating a wrapper script that gets the files from HDFS and
simply using that within the Oozie workflow. In addition of the HDFS location
of the workflow, the step (sub-directory) is passed to this script, which
then downloads the whole directory and executes the run script inside.
<workflow-app name="Data-Extraction-WF" xmlns="uri:oozie:workflow:0.5">
<global>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
</global>
<start to="Data-Extraction"/>
<action name="Data-Extraction">
<shell xmlns="uri:oozie:shell-action:1.0">
<exec>execute_workflow_step.sh</exec>
<argument>-w</argument>
<argument>${wf:conf('oozie.wf.application.path')}</argument>
<argument>-s</argument>
<argument>data-transformation</argument>
<file>execute_workflow_step.sh</file>
</shell>
<ok to="success"/>
<error to="failure"/>
</action>
<kill name="failure">
<message>Workflow failed, error message: [${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="success"/>
</workflow-app>
This is my execute_workflow_step.sh script: it downloads the step directory from the HDFS directory of the workflow and executes its run script:
#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'
err_trap() {
echo "*** FAILED: Error on line $1"
exit 1
}
trap 'err_trap $LINENO' ERR
usage() { echo "Usage: $0 [-w <workflow HDFS path>] [-s <step-directory>] [-p <step submit script parameters>]" 1>&2; exit 1; }
PARAMETERS=""
while getopts ":w:s:p:" o; do
case "${o}" in
w)
WORKFLOW_PATH=${OPTARG}
;;
s)
STEP_DIRECTORY=${OPTARG}
;;
p)
PARAMETERS=${OPTARG}
;;
*)
usage
;;
esac
done
shift $((OPTIND-1))
if [ -z "${WORKFLOW_PATH}" ] || [ -z "${STEP_DIRECTORY}" ]; then
usage
fi
HDFS_BASEDIR=$(dirname "${WORKFLOW_PATH}")
WORKFLOW_STEP_DIRECTORY="${HDFS_BASEDIR}/${STEP_DIRECTORY}"
echo "Getting: ${WORKFLOW_STEP_DIRECTORY}"
hdfs dfs -get "${WORKFLOW_STEP_DIRECTORY}"
STEP_SCRIPT="${STEP_DIRECTORY}/submit-application.sh"
chmod 755 "$STEP_SCRIPT"
echo "Step submit script: ${STEP_SCRIPT}"
echo "Parameters: ${PARAMETERS}"
echo "Invoking: ${STEP_SCRIPT} ${PARAMETERS}"
"${STEP_SCRIPT}" "${PARAMETERS}"

Related

ansible molecule "python not found"

I have some ansible roles and I would like to use molecule testing with them.
When I execute command molecule init scenario -r get_files_uid -d docker I get the following file structure
get_files_uid
├── molecule
│ └── default
│ ├── converge.yml
│ ├── molecule.yml
│ └── verify.yml
├── tasks
│ └── main.yml
└── vars
└── main.yml
After that, I execute molecule test and I receive the following error:
PLAY [Converge] ****************************************************************
TASK [Gathering Facts] *********************************************************
fatal: [instance]: FAILED! => {"ansible_facts": {}, "changed": false, "failed_modules": {"ansible.legacy.setup": {"failed": true, "module_stderr": "/bin/sh: python: command not found\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 127}}, "msg": "The following modules failed to execute: ansible.legacy.setup\n"}
PLAY RECAP *********************************************************************
instance : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0
My ansible.cfg looks like this:
[defaults]
roles_path = roles
ansible_python_interpreter = /usr/bin/python3
And I use MacOS with Ansible
ansible [core 2.13.3]
config file = None
configured module search path = ['/Users/scherevko/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /opt/homebrew/Cellar/ansible/6.3.0/libexec/lib/python3.10/site-packages/ansible
ansible collection location = /Users/scherevko/.ansible/collections:/usr/share/ansible/collections
executable location = /opt/homebrew/bin/ansible
python version = 3.10.6 (main, Aug 11 2022, 13:36:31) [Clang 13.1.6 (clang-1316.0.21.2.5)]
jinja version = 3.1.2
libyaml = True
molecule version:
molecule 4.0.1 using python 3.10
ansible:2.13.3
delegated:4.0.1 from molecule
docker:2.0.0 from molecule_docker requiring collections: community.docker>=3.0.0-a2
podman:2.0.2 from molecule_podman requiring collections: containers.podman>=1.7.0 ansible.posix>=1.3.0
When I run molecule --debug test I see
ANSIBLE_PYTHON_INTERPRETER: python not found
How to fix that?

The default scaffold for role molecule role initialization uses quay.io/centos/centos:stream8 as the test instance image (see molecule/default/molecule.yml)
This image does not have any /usr/bin/python3 file available:
$ docker run -it --rm quay.io/centos/centos:stream8 ls -l /usr/bin/python3
ls: cannot access '/usr/bin/python3': No such file or directory
If you let ansible discover the available python by itself, you'll see that the interpreter actually found is /usr/libexec/platform-python like in the following demo (no ansible.cfg in use):
$ docker run -d --rm --name instance quay.io/centos/centos:stream8 tail -f /dev/null
2136ad2e8b91f73d21550b2403a6b37f152a96c2373fcb5eb0491a323b0ed093
$ ansible instance -i instance, -e ansible_connection=docker -m setup | grep discovered
"discovered_interpreter_python": "/usr/libexec/platform-python",
$ docker stop instance
instance
Since your ansible.cfg only contains a default value for role path besides that wrong python interpreter path, I suggest you simply remove that file which will fix your problem. At the very least, remove the line defining ansible_python_interpreter to use default settings.
Note that you should also make sure that ANSIBLE_PYTHON_INTERPRETER is not set as a variable in your current shell (and remove that definition from whatever shell init file if it is the case).
Hardcoding the path of the python interpreter should anyway be your very last solution in very few edge cases.

Use included XML files in deployed server

I have xml files that reside in the directory like this
src
|---lib
| |---folder
| | |---XML files
| |---script.py
|---app.py
The app.py file runs the codes in script.py, and in script.py it requires the XML files. When I run the server locally (window) I can just use the relative path "lib\folder\'xml files'". But when I deploy my server to Cloud Run, it says the files don't exist.
I've tried to specify the absolute path by doing this in script.py
package_directory = os.path.dirname(os.path.abspath(__file__))
path = os.path.join(package_directory, "folder\'xml files")
and tried changing all backward dash to forward dash, but the error still occurs.
In the dockerfile, I had this:
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . ./
which I believe to copy everything in the src folder except things specified in .dockerignore, which I had these insides:
Dockerfile
README.md
*.pyc
*.pyo
*.pyd
__pycache__
.pytest_cache

Because Cloud Run requires a container, a good test for you would be to create the container and run it locally. I suspect that it's your container that's incorrect rather than Cloud Run.
I created the following repro of your code:
.
├── app.py
├── Dockerfile
└── lib
├── folder
│   └── XML files
│   └── test
├── __init__.py
└── script.py
app.py:
from lib import script
script.foo()
script.py:
import os
def foo():
package_directory = os.path.dirname(os.path.abspath(__file__))
path = os.path.join(package_directory, "folder/XML files")
for f in os.listdir(path):
if os.path.isfile(os.path.join(path,f)):
print(f)
Dockerfile:
FROM docker.io/python:3.9.9
ENV APP_HOME /app
WORKDIR ${APP_HOME}
COPY . ./
ENTRYPOINT ["python","app.py"]
And, when I build|run the container, it correctly reports test:
Q="70734734"
podman build \
--tag=${Q} \
--file=./Dockerfile \
.
podman run \
--interactive --tty \
localhost/${Q}
I'm confident that, if I were to push it to Cloud Run, it would work correctly there too.
NOTE
Try to avoid spaces in directory names; os.path.join accommodates spaces
You describe XML files but your code references xml files
You don't include a full repro of your issue making it more difficult to help you

Mixed execution of two Python scripts & DatastaxBulk loader scripts to load to .csv in Apache Cassandra

I have a .sh file in which I call two python scripts:
for fileMaster.sh :
python script1.py && python script2.py
now, the problem is that I want to add the action that after the script2.py do the upload into Apache Cassandra with Datastax bulk loader.
so, if I do this;
python script1.py && python script2.py && fileSlave.sh
in with fileSlave.sh is:
export PATH=/home/mypc/dsbulk-1.7.0/bin:$PATH
source ~/.bashrc
dsbulk load -url /home/mypc/Desktop/foldertest/data.csv -k data_test -t data_table -delim "," -header true -m '0=time_exp, 1=p'
it gives to me access denied to load into Cassandra. As imagine, the impossibility to do the same if I add the code of fileSlave.sh directly under the py calls in fileMaster.sh
How can I do that?

I solved the problem cause the file fileMaster.sh need this:
python script1.py && python script2.py
chmod u+x ./shell2.sh
./fileSlave.sh
It works!

Best practice 10+ projects will run by using ONE container / parameterised

The idea is to have a single container which will contain all small projects and will run based on parameters.
What is the current situation:
I have folders with the project this way:
├── MAIN_PROJECT_FOLDER
│ ├── PROJECT_SUB_CATEGORY1
│ ├── ├── PROJECT_NAME_FOLDER1
│ │ │ ├── run.sh
│ │ │ ├── main.py
│ │ │ ├── config.py
│ ├── ├── PROJECT_NAME_FOLDER2
│ │ │ ├── run.sh
│ │ │ ├── main.py
│ │ │ ├── config.py
│ ├── PROJECT_SUB_CATEGORY2
│ ├── ├── PROJECT_NAME_FOLDER1
│ │ │ ├── run.sh
│ │ │ ├── main.py
│ │ │ ├── config.py
│ ├── ├── PROJECT_NAME_FOLDER2
│ │ │ ├── run.sh
│ │ │ ├── main.py
│ │ │ ├── config.py
Each run.sh file has prod/dev parameters which can be executed like this:
sudo ./run.sh prod = prod
sudo ./run.sh dev = dev
sudo ./run.sh = dev
What is the way to create another .SH file or Dockerfile which at the end can be executed like this:
sudo docker run CONTAINER_NAME PROJECT_NAME PROD/DEV
sudo docker run test_contaner test_project1 prod
sudo docker run test_contaner test_project1 dev
sudo docker run test_contaner test_project2 prod
... and so one
Basically, each project is the parameter and prod/dev will be part of run.sh execution somehow.
Looking for the best practice to make this happen.

The best practice is generally to have an image that does only one thing. In your example that would imply four separate Docker images; each directory would have its own Dockerfile.
It also tends to be easier to configure settings like this using environment variables than command-line parameters. Sites like https://12factor.net/ describe this and some other practices for building services. (In YAML specifications like Docker Compose or Kubernetes, it is easier to add another key/value environment pair than to build up a correct command line from multiple disparate parts, in my experience.)
This leads you to a sequence like
sudo docker build -t me/cat1proj1 CATEGORY_1/PROJECT_1
sudo docker run -e ENVIRONMENT=prod me/cat1proj1
Architecturally, the Docker container runs any single process, and absolutely nothing stops you from writing the wrapper script you describe. That single command is specified as a combination of an "entrypoint" and a "command"; if you specify both then the command is passed as arguments to the entrypoint. The "command" part can be specified in the Dockerfile CMD, but it can also be overridden at the docker run command line.
If you write no special scripts at all, you can run (assuming you've COPYd the projects to the right directories)
sudo docker run test_image ./test_project1/run.sh prod
(I have a couple of projects that are the same application with different scripts to start them in different ways – a Web server vs. an async job runner with the same code, for instance – and just launch them with alternate startup scripts this way.)
There is a pattern of making some other script be the ENTRYPOINT, and interpreting the "command" as just arguments to that script. The command just gets passed as arguments $1, $2, "$#". The problem with doing this is that it breaks some routine debugging paths.
# "test_project1" "prod" passed as arguments to entrypoint script
sudo docker run test_image test_project1 prod
# But that breaks getting a debug shell
sudo docker run --rm -it test_image bash
# More complex commands get awkward
sudo docker run --rm --entrypoint=/bin/ls test_image -l /app

I would personally use tool like Supervisor which can be run inside one docker container.
Installing supervisor on Ubuntu and Debian based distros:
sudo apt install supervisor
Starting supervisor daemon:
sudo service supervisor start
In /etc/supervisord/supervisord.conf you will find place where to put configs for your projects:
[include]
files = /etc/supervisor/conf.d/*.conf
Now you can create configuration for supervisor and copy it to /etc/supervisor/conf.d/. Example supervisor config for project PROJECT_1:
project_1_supervisor.conf:
[program:project_1_app]
command=/usr/bin/bash /project_1_path/run.sh prod
directory=/project_1_path/
autostart=true
autorestart=true
startretries=3
stderr_logfile=/var/log/project_1.err.log
stdout_logfile=/var/log/project_1.out.log
After this restart your supervisor:
sudo supervisorctl reread
sudo supervisorctl update
After this you can check if your project program runs:
$ supervisorctl
project_1_app RUNNING pid 590, uptime 0:02:45

I think the best way to handle this is ENV, Here is the complete example that what you are looking for.
Here is the directory structure
Here is the smartest dockerfile that clone the above app and do smart thing ;) That will take four env, by default it will run project A.
ENV BASE_PATH="/opt/project"
This ENV is for project base path during clone
ENV PROJECT_PATH="/main/sub_folder_a/project_a"
This ENV is for project path for example Project B
ENV SCRIPT_NAME="hello.py"
This ENV will be used to run the actual file can be run.sh or main.py in your case.
ENV SYSTEM_ENV=dev
This env is used run.sh this can either dev or prod
FROM python:3.7.4-alpine3.10
WORKDIR /opt/project
# Required Tools
RUN apk add --no-cache supervisor git tree && \
mkdir -p /etc/supervisord.d/
# clone remote project or copy your own one
RUN echo "Starting remote clonning...."
RUN git clone https://github.com/Adiii717/python-demo-app.git /opt/project
RUN tree /opt/project
# ENV for start different project, can be overide at run time
ENV BASE_PATH="/opt/project"
ENV PROJECT_PATH="/main/sub_folder_a/project_a"
ENV SCRIPT_NAME="hello.py"
# possible dev or prod
ENV SYSTEM_ENV=dev
RUN chmod +x /opt/project/main/*/*/run.sh
# general config
RUN echo $'[supervisord] \n\
[unix_http_server] \n\
file = /tmp/supervisor.sock \n\
chmod = 0777 \n\
chown= nobody:nogroup \n\
[supervisord] \n\
logfile = /tmp/supervisord.log \n\
logfile_maxbytes = 50MB \n\
logfile_backups=10 \n\
loglevel = info \n\
pidfile = /tmp/supervisord.pid \n\
nodaemon = true \n\
umask = 022 \n\
identifier = supervisor \n\
[supervisorctl] \n\
serverurl = unix:///tmp/supervisor.sock \n\
[rpcinterface:supervisor] \n\
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface \n\
[include] \n\
files = /etc/supervisord.d/*.conf' >> /etc/supervisord.conf
# script supervisord Config
RUN echo $'[supervisord] \n\
nodaemon=true \n\
[program:run_project ] \n\
command= /run_project.sh \n\
stdout_logfile=/dev/fd/1 \n\
stdout_logfile_maxbytes=0MB \n\
stderr_logfile_maxbytes = 0 \n\
stderr_logfile=/dev/fd/2 \n\
redirect_stderr=true \n\
autorestart=false \n\
startretries=0 \n\
exitcodes=0 ' >> /etc/supervisord.d/run_project.conf
RUN echo $'#!/bin/ash \n\
echo -e "\x1B[31m starting project having name ${BASE_PATH}${PROJECT_PATH}/${SCRIPT_NAME} \x1B[0m" \n\
fullfilename=${BASE_PATH}${PROJECT_PATH}/${SCRIPT_NAME} \n\
filename=$(basename "$fullfilename") \n\
extension="${filename##*.}" \n\
if [[ ${extension} == "sh" ]];then \n\
sh ${BASE_PATH}${PROJECT_PATH}/${SCRIPT_NAME} ${SYSTEM_ENV} \n\
else \n\
python ${BASE_PATH}${PROJECT_PATH}/${SCRIPT_NAME} \n\
fi ' >> /run_project.sh
RUN chmod +x /run_project.sh
EXPOSE 9080 8000 9088 80
ENTRYPOINT ["supervisord", "--nodaemon", "--configuration", "/etc/supervisord.conf"]
Build the docker image
docker build -t multipy .
Run the docker container
docker run --rm -it multipy
This will run project a by default
to project b, your command will be
docker run --rm -it --env PROJECT_PATH=/main/sub_folder_b/project_b --env SCRIPT_NAME=hello.py multipy
To run your run.sh bash file command will be
docker run --rm -it --env SCRIPT_NAME=run.sh multipy
Here is the some logs

Implement Git hook - prePush and preCommit

Could you please show me how to implement git hook?
Before committing, the hook should run a python script. Something like this:
cd c:\my_framework & run_tests.py --project Proxy-Tests\Aeries \
--client Aeries --suite <Commit_file_Name> --dryrun
If the dry run fails then commit should be stopped.

You need to tell us in what way the dry run will fail. Will there be an output .txt with errors? Will there be an error displayed on terminal?
In any case you must name the pre-commit script as pre-commit and save it in .git/hooks/ directory.
Since your dry run script seems to be in a different path than the pre-commit script, here's an example that finds and runs your script.
I assume from the backslash in your path that you are on a windows machine and I also assume that your dry-run script is contained in the same project where you have git installed and in a folder called tools (of course you can change this to your actual folder).
#!/bin/sh
#Path of your python script
FILE_PATH=tools/run_tests.py/
#Get relative path of the root directory of the project
rdir=`git rev-parse --git-dir`
rel_path="$(dirname "$rdir")"
#Cd to that path and run the file.
cd $rel_path/$FILE_PATH
echo "Running dryrun script..."
python run_tests.py
#From that point on you need to handle the dry run error/s.
#For demonstrating purproses I'll asume that an output.txt file that holds
#the result is produced.
#Extract the result from the output file
final_res="tac output | grep -m 1 . | grep 'error'"
echo -e "--------Dry run result---------\n"${final_res}
#If a warning and/or error exists abort the commit
eval "$final_res" | while read -r line; do
if [ $line != "0" ]; then
echo -e "Dry run failed.\nAborting commit..."
exit 1
fi
done
Now every time you fire git commit -m the pre-commit script will run the dry run file and abort the commit if any errors have occured, keeping your files in the stagin area.

I have implemented this in my hook. Here is the code snippet.
#!/bin/sh
#Path of your python script
RUN_TESTS="run_tests.py"
FRAMEWORK_DIR="/my-framework/"
CUR_DIR=`echo ${PWD##*/}`
`$`#Get full path of the root directory of the project under RUN_TESTS_PY_FILE
rDIR=`git rev-parse --git-dir --show-toplevel | head -2 | tail -1`
OneStepBack=/../
CD_FRAMEWORK_DIR="$rDIR$OneStepBack$FRAMEWORK_DIR"
#Find list of modified files - to be committed
LIST_OF_FILES=`git status --porcelain | awk -F" " '{print $2}' | grep ".txt" `
for FILE in $LIST_OF_FILES; do
cd $CD_FRAMEWORK_DIR
python $RUN_TESTS --dryrun --project $CUR_DIR/$FILE
OUT=$?
if [ $OUT -eq 0 ];then
continue
else
return 1
fi
done

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.