GCP Dataflow custom template creation - python

I am trying to create a custom template in dataflow, so that I can use a DAG from composer to run on a set frequency. I have understood that I need to deploy my Dataflow template first, and then the DAG.
I have used this example - https://cloud.google.com/dataflow/docs/guides/templates/creating-templates#:~:text=The%20following%20example%20shows%20how%20to%20stage%20a%20template%20file%3A
My code:
- python3 -m job.process_file \
--runner DataflowRunner \
--project project \
--staging_location gs://bucketforjob/staging \
--temp_location gs://bucketforjob/temp \
--template_location gs://bucketfordataflow/templates/df_job_template.py \
--region eu-west2 \--output_bucket gs://cleanfilebucket \
--output_name file_clean \
--input gs://rawfilebucket/file_raw.csv
The issue I am having is, it just trys to run the pipeline (the input file doesn't exist in the bucket yet, and I don't want it to randomly process it by putting it in there), so it fails saying that file_raw.csv doesn't exist in bucket. how do I get it just to create/compile the pipeline as a template in the template folder for me to call on with my dag?
This is really confusing me, and there seems to be little guidance out there from what I could find... any help would be appreciated.

I think you would like to separate commands for a template creation from the job execution.
An example on a page you provided, depicts necessary parameters...
python -m examples.mymodule
--runner DataflowRunner
--project PROJECT_ID
--staging_location gs://BUCKET_NAME/staging
--temp_location gs://BUCKET_NAME/temp
--template_location gs://BUCKET_NAME/templates/TEMPLATE_NAME
--region REGION
where examples.mymodule - is the source code (as I understand), and --template_location gs://BUCKET_NAME/templates/TEMPLATE_NAME - is the place, where the result template is to be stored.
In order to execute the job, you might like to run a command according to the Running classic templates using gcloud documentation example...
gcloud dataflow jobs run JOB_NAME
--gcs-location gs://YOUR_BUCKET_NAME/templates/MyTemplate
--parameters inputFile=gs://YOUR_BUCKET_NAME/input/my_input.txt,outputFile=gs://YOUR_BUCKET_NAME/output/my_output
Or, in your case, you probably would like to start the job Using the REST API
In any case - don't forget about relevant IAM roles and permissions for service accounts, under which the job is to run.

Related

Passing azure secret variables to pytest in pipeline?

We are running integration tests, written in Python, in Azure Pipeline. These tests access a database, and the credentials for accessing the database are stored in a variable group in Azure, including secret variables. This is the part of the yaml file, where the integration tests are started:
jobs:
- job: IntegrationTests
variables:
- group: <some_variable_group>
- script: |
pdm run pytest \
--variables "$VARIABLE_FILE" \
--test-run-title="$TEST_TITLE" \
--napoleon-docstrings \
--doctest-modules \
--color=yes \
--junitxml=junit/test-results.xml \
integration
env:
DB_USER: $(SMDB_USER)
DB_PASSWORD: $(SMDB_PASSWORD)
DB_HOST: $(SMDB_HOST)
DB_DATABASE: $(SMDB_DATABASE)
The problem is, that we cannot read the value of SMDB_PASSWORD, as it is a secret variable. In order to use the secret variables, it is advised to use arguments in a PythonScript task (like here: Passing arguments to python script in Azure Devops)
but i am not aware how to modify this script to be defines PythonScript, as it includes using pdm.
actually according to docs they should be available as env variables: https://learn.microsoft.com/en-us/azure/devops/pipelines/process/set-secret-variables?view=azure-devops&tabs=yaml%2Cbash#use-a-secret-variable-in-the-ui
environ.get('DB_USER')
edit: repro:
python -c "import os, base64; print(base64.b64encode(bytes(os.environ.get('TEST_PLAIN'), 'ascii')))"

Passing folder as argument to a Docker container with the help of volumes

I have a python script that takes two arguments -input and -output and they are both directory paths. I would like to know first if this is a recommended use case of docker, and I would like to know also how to run the docker container by specifying custom input and output folders with the help of docker volume.
My post is similar to this : Passing file as argument to Docker container.
Still I was not able to solve the problem.
Its common practice to use volumes to persist data or mount some input data. See the postgres image for example.
docker run -d \
--name some-postgres \
-e PGDATA=/var/lib/postgresql/data/pgdata \
-v /custom/mount:/var/lib/postgresql/data \
postgres
You can see how the path to the data dir is set via env var and then a volume is mounted at this path. So the produced data will end up in the volume.
You can also see in the docs that there is a directory /docker-entrypoint-initdb.d/ where you can mount input scripts, that run on first DB setup.
In your case it might look something like this.
docker run \
-v "$PWD/input:/app/input" \
-v "$PWD/output:/app/output" \
myapp --input /app/input --output /app/output
Or you use the same volume for both.
docker run \
-v "$PWD/data:/app/data" \
myapp --input /app/data/input --output /app/data/output

Sagemaker Training Job Not Uploading/Saving Training Model to S3 Output Path

Ok I've been dealing with this issue in Sagemaker for almost a week and I'm ready to pull my hair out. I've got a custom training script paired with a data processing script in a BYO algorithm Docker deployment type scenario. It's a Pytorch model built with Python 3.x, and the BYO Docker file was originally built for Python 2, but I can't see an issue with the problem that I am having.....which is that after a successful training run Sagemaker doesn't save the model to the target S3 bucket.
I've searched far and wide and can't seem to find an applicable answer anywhere. This is all done inside a Notebook instance. Note: I am using this as a contractor and don't have full permissions to the rest of AWS, including downloading the Docker image.
Dockerfile:
FROM ubuntu:18.04
MAINTAINER Amazon AI <sage-learner#amazon.com>
RUN apt-get -y update && apt-get install -y --no-install-recommends \
wget \
python-pip \
python3-pip3
nginx \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
RUN wget https://bootstrap.pypa.io/get-pip.py && python3 get-pip.py && \
pip3 install future numpy torch scipy scikit-learn pandas flask gevent gunicorn && \
rm -rf /root/.cache
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"
COPY decision_trees /opt/program
WORKDIR /opt/program
Docker Image Build:
%%sh
algorithm_name="name-this-algo"
cd container
chmod +x decision_trees/train
chmod +x decision_trees/serve
account=$(aws sts get-caller-identity --query Account --output text)
region=$(aws configure get region)
region=${region:-us-east-2}
fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1
if [ $? -ne 0 ]
then
aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi
# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)
# Build the docker image locally with the image name and then push it to ECR
# with the full name.
docker build -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}
docker push ${fullname}
Env setup and session start:
common_prefix = "pytorch-lstm"
training_input_prefix = common_prefix + "/training-input-data"
batch_inference_input_prefix = common_prefix + "/batch-inference-input-data"
import os
from sagemaker import get_execution_role
import sagemaker as sage
sess = sage.Session()
role = get_execution_role()
print(role)
Training Directory, Image, and Estimator Setup, then a fit call:
TRAINING_WORKDIR = "a/local/directory"
training_input = sess.upload_data(TRAINING_WORKDIR, key_prefix=training_input_prefix)
print ("Training Data Location " + training_input)
account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/image-that-works:working'.format(account, region)
tree = sage.estimator.Estimator(image,
role, 1, 'ml.p2.xlarge',
output_path="s3://sagemaker-directory-that-definitely/exists",
sagemaker_session=sess)
tree.fit(training_input)
The above script is working, for sure. I have print statements in my script and they are printing the expected results to the console. This runs as it's supposed to, finishes up, and says that it's deploying model artifacts when IT DEFINITELY DOES NOT.
Model Deployment:
model = tree.create_model()
predictor = tree.deploy(1, 'ml.m4.xlarge')
This throws an error that the model can't be found. A call to aws sagemaker describe-training-job shows that the training was completed but I found that the time it took to upload the model was super fast, so obviously there's an error somewhere and it's not telling me. Thankfully it's not just uploading it to the aether.
{
"Status": "Uploading",
"StartTime": 1595982984.068,
"EndTime": 1595982989.994,
"StatusMessage": "Uploading generated training model"
},
Here's what I've tried so far:
I've tried uploading it to a different bucket. I figured my permissions were the problem so I pointed it to one that I new allowed me to upload as I had done it before to that bucket. No dice.
I tried backporting the script to Python 2.x, but that caused more problems than it probably would have solved, and I don't really see how that would be the problem anyways.
I made sure the Notebook's IAM role has sufficient permissions, and it does have a SagemakerFullAccess policy
What bothers me is that there's no error log I can see. If I could be directed to that I would be happy too, but if there's some hidden Sagemaker kungfu that I don't know about I would be forever grateful.
EDIT
The training job runs and prints to both the Jupyter cell and CloudWatch as expected. I've since lost the cell output in the notebook but below is the last few lines in CloudWatch. The first number is the epoch and the rest are various custom model metrics.
Can you verify from the training job logs that your training script is running? It doesn't look like your Docker image would respond to the command train, which is what SageMaker requires, and so I suspect that your model isn't actually getting trained/saved to /opt/ml/model.
AWS documentation about how SageMaker runs the Docker container: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html
edit: summarizing from the comments below - the training script must also save the model to /opt/ml/model (the model isn't saved automatically).
Have you tried saving to a local file and moving it to S3? I would save it locally (to the root directory of the script) and upload it via boto3.
The sagemaker session object may not have a bucket attributes initialized. Doing it explicitly isn't much an extra step.
import boto3
s3 = boto3.client('s3')
with open("FILE_NAME", "rb") as f:
s3.upload_fileobj(f, "BUCKET_NAME", "DESTINATION_NAME(optional)")

Is it possible to run / serialize Dataflow job without having all dependencies locally?

I have created a pipeline for Google Cloud Dataflow using Apache Beam, but I cannot have Python dependencies locally. However, there are no problems for those dependencies to be installed remotely.
Is it somehow possible to run the job or create a template without executing Python code in my local (development) environment?
Take a look at this tutorial. Basically, you write the python pipeline, then deploy it via command line with
python your_pipeline.py \
--project $YOUR_GCP_PROJECT \
--runner DataflowRunner \
--temp_location $WORK_DIR/beam-temp \
--setup_file ./setup.py \
--work-dir $WORK_DIR
The crucial part is --runner DataflowRunner, so it uses Google Dataflow (and not your local installation) to run the pipeline. Obviously, you have to set your Google account and credentials.
Well, I am not 100% sure that this is possible, but you may:
Define a requirements.txt file with all of the dependencies for pipeline execution
Avoid importing and using your dependencies in pipeline-construction time, only in execution time code.
So, for instance, your file may look like this:
import apache_beam as beam
with beam.Pipeline(...) as p:
result = (p | ReadSomeData(...)
| beam.ParDo(MyForbiddenDependencyDoFn()))
And in the same file, your DoFn would be importing your dependency from within the pipeline execution-time code, so the process method, for instance. See:
class MyForbiddenDependencyDoFn(beam.DoFn):
def process(self, element):
import forbidden_dependency as fd
yield fd.totally_cool_operation(element)
When you execute your pipeline, you can do:
python your_pipeline.py \
--project $GCP_PROJECT \
--runner DataflowRunner \
--temp_location $GCS_LOCATION/temp \
--requirements_file=requirements.txt
I have never tried this, but it just may work : )

How to load additional JARs for an Hadoop Streaming job on Amazon EMR

TL;DR
How I can upload or specify additional JARs to an Hadoop Streaming Job on Amazon Elastic MapReduce (Amazon EMR)?
Long version
I want to analyze a set of Avro files (> 2000 files) using Hadoop on Amazon Elastic MapReduce (Amazon EMR). It should be a simple exercise through which I should gain some confidence with MapReduce and Amazon EMR (I am new to both).
Since python is my favorite language I have decided to use Hadoop Streaming. I have built a simple mapper and reducer in python, and I have tested it on a local Hadoop (single node install). The command I was issuing on my local Hadoop install was this:
$HADOOP_PREFIX/bin/hadoop jar $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming-2.4.0-amzn-1.jar \
-files avro-1.7.7.jar,avro-mapred-1.7.7.jar \
-libjars avro-1.7.7.jar,avro-mapred-1.7.7.jar \
-input "input" \
-mapper "python2.7 $PWD/mapper.py" \
-reducer "python2.7 $PWD/reducer.py" \
-output "output/outdir" \
-inputformat org.apache.avro.mapred.AvroAsTextInputFormat
and the job completed successfully.
I have a bucket on Amazon S3 with a folder containing all the input files and another folder with the mapper and reducer scripts (mapper.py and reducer.py respectively).
Using the interface I have created a small cluster, then I have added a bootstrap action to install all the required python modules on each node and then I have added an "Hadoop Streaming" step specifying the location of the mapper and reducer scripts on S3.
The problem is that I don't have the slightest idea on how I can upload or specify in the options the two JARs - avro-1.7.7.jar and avro-mapred-1.7.7.jar - required to run this job?
I have tried several things:
using the -files flag in combination with -libjars in the optional arguments;
adding another bootstrap action that downloads the JARs on every node (and I have tried to download it on different locations on the nodes);
I have tried to upload the JARs on my bucket and specify a full s3://... path as argument to -libjars (note: these file are actively ignored by Hadoop, and a warning is issued) in the options;
If I don't pass the two JARs the job fails (it does not recognize the -inputformat class), but I have tried all the possibilities (and combinations thereof!) I could think of to no avail.
In the end, I figures it out (and it was, of course, obvious):
Here's how I have done it:
add a bootstrap action that downloads the JARs on every node, for example you can upload the JARs in your bucket, make them public and then do:
wget https://yourbucket/path/somejar.jar -O $HOME/somejar.jar
wget https://yourbucket/path/avro-1.7.7.jar -O $HOME/avro-1.7.7.jar
wget https://yourbucket/path/avro-mapred-1.7.7.jar -O $HOME/avro-mapred-1.7.7.jar
when you specify -libjars in the optional arguments use the abosolute path, so:
-libjars /home/hadoop/somejar.jar,$HOME/avro-1.7.7.jar,/home/hadoop/avro-mapred-1.7.7.jar
I have lost a number of hours that I am ashamed to say, hope this helps somebody else.
Edit (Feb 10th, 2015)
I have double checked, and I want to point out that it seems that environment variable are not expanded when passed to the optional arguments field. So, use the explicit $HOME path (i.e. /home/hadoop)
Edit (Feb 11th, 2015)
If you want to launch the a streaming job on Amazon EMR using the AWS cli you can use the following command.
aws emr create-cluster --ami-version '3.3.2' \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType='m1.medium' InstanceGroupType=CORE,InstanceCount=2,InstanceType='m3.xlarge' \
--name 'TestStreamingJob' \
--no-auto-terminate \
--log-uri 's3://path/to/your/bucket/logs/' \
--no-termination-protected \
--enable-debugging \
--bootstrap-actions Path='s3://path/to/your/bucket/script.sh',Name='ExampleBootstrapScript' Path='s3://path/to/your/bucket/another_script.sh',Name='AnotherExample' \
--steps file://./steps_test.json
and you can specify the steps in a JSON file:
[
{
"Name": "Avro",
"Args": ["-files","s3://path/to/your/mapper.py,s3://path/to/your/reducer.py","-libjars","/home/hadoop/avro-1.7.7.jar,/home/hadoop/avro-mapred-1.7.7.jar","-inputformat","org.apache.avro.mapred.AvroAsTextInputFormat","-mapper","mapper.py","-reducer","reducer.py","-input","s3://path/to/your/input_directory/","-output","s3://path/to/your/output_directory/"],
"ActionOnFailure": "CONTINUE",
"Type": "STREAMING"
}
]
(please note that the official Amazon documentation is somewhat outdated, in fact it uses the old Amazon EMR CLI tool which is deprecated in favor of the more recente AWS CLI)

Categories