Apache Beam wordcount pipeline produces no output using Docker Container - python

I am able to successfully execute the command in the Apache Beam Python SDK Quickstart
tutorial. Specifically, the command
python -m apache_beam.examples.wordcount --input data.txt --output /tmp/out.beam
creates a file /tmp/out.beam-00000-of-00001 containing correct word-counts for data.txt. However, when I try to execute the same pipeline using a Docker container, per the Custom Containers tutorial, the command appears to produce no output.
Specifically, I run
python -m apache_beam.examples.wordcount \
--input=data.txt \
--output=/tmp/out.beam \
--runner=PortableRunner \
--job_endpoint=embed \
--environment_type="DOCKER" \
--environment_config="apache/beam_python3.9_sdk:latest"
But no file matching /tmp/out.beam* is produced. I have scanned through the output and see no errors. Here is a gist with the output.
I should add that this works when I use DirectRunner:
python -m apache_beam.examples.wordcount \
--input=data.txt \
--output=/tmp/out.beam \
--runner=DirectRunner \
--job_endpoint=embed \
--environment_type="DOCKER" \
--environment_config="apache/beam_python3.9_sdk:latest"
But my impression is that DirectRunner is not performant.
Thank you for your help!

Related

Passing folder as argument to a Docker container with the help of volumes

I have a python script that takes two arguments -input and -output and they are both directory paths. I would like to know first if this is a recommended use case of docker, and I would like to know also how to run the docker container by specifying custom input and output folders with the help of docker volume.
My post is similar to this : Passing file as argument to Docker container.
Still I was not able to solve the problem.
Its common practice to use volumes to persist data or mount some input data. See the postgres image for example.
docker run -d \
--name some-postgres \
-e PGDATA=/var/lib/postgresql/data/pgdata \
-v /custom/mount:/var/lib/postgresql/data \
postgres
You can see how the path to the data dir is set via env var and then a volume is mounted at this path. So the produced data will end up in the volume.
You can also see in the docs that there is a directory /docker-entrypoint-initdb.d/ where you can mount input scripts, that run on first DB setup.
In your case it might look something like this.
docker run \
-v "$PWD/input:/app/input" \
-v "$PWD/output:/app/output" \
myapp --input /app/input --output /app/output
Or you use the same volume for both.
docker run \
-v "$PWD/data:/app/data" \
myapp --input /app/data/input --output /app/data/output

GCP Dataflow custom template creation

I am trying to create a custom template in dataflow, so that I can use a DAG from composer to run on a set frequency. I have understood that I need to deploy my Dataflow template first, and then the DAG.
I have used this example - https://cloud.google.com/dataflow/docs/guides/templates/creating-templates#:~:text=The%20following%20example%20shows%20how%20to%20stage%20a%20template%20file%3A
My code:
- python3 -m job.process_file \
--runner DataflowRunner \
--project project \
--staging_location gs://bucketforjob/staging \
--temp_location gs://bucketforjob/temp \
--template_location gs://bucketfordataflow/templates/df_job_template.py \
--region eu-west2 \--output_bucket gs://cleanfilebucket \
--output_name file_clean \
--input gs://rawfilebucket/file_raw.csv
The issue I am having is, it just trys to run the pipeline (the input file doesn't exist in the bucket yet, and I don't want it to randomly process it by putting it in there), so it fails saying that file_raw.csv doesn't exist in bucket. how do I get it just to create/compile the pipeline as a template in the template folder for me to call on with my dag?
This is really confusing me, and there seems to be little guidance out there from what I could find... any help would be appreciated.
I think you would like to separate commands for a template creation from the job execution.
An example on a page you provided, depicts necessary parameters...
python -m examples.mymodule
--runner DataflowRunner
--project PROJECT_ID
--staging_location gs://BUCKET_NAME/staging
--temp_location gs://BUCKET_NAME/temp
--template_location gs://BUCKET_NAME/templates/TEMPLATE_NAME
--region REGION
where examples.mymodule - is the source code (as I understand), and --template_location gs://BUCKET_NAME/templates/TEMPLATE_NAME - is the place, where the result template is to be stored.
In order to execute the job, you might like to run a command according to the Running classic templates using gcloud documentation example...
gcloud dataflow jobs run JOB_NAME
--gcs-location gs://YOUR_BUCKET_NAME/templates/MyTemplate
--parameters inputFile=gs://YOUR_BUCKET_NAME/input/my_input.txt,outputFile=gs://YOUR_BUCKET_NAME/output/my_output
Or, in your case, you probably would like to start the job Using the REST API
In any case - don't forget about relevant IAM roles and permissions for service accounts, under which the job is to run.

Does pyinstaller have any parameters like gcc -static?

I have a similar question to this : Is there a way to compile a Python program to binary and use it with a Scratch Dockerfile?
In this page, I saw that someone said that a C application runs well when compiled with -static.
So I have a new question: does pyinstaller have any parameters like gcc -static to make a python application run well in a Scratch Docker image?
From the question Docker Minimal Image PyInstaller Binary File?'s commands,I get the links about how to make python binary to static,which like the go application demo,say hello world in scratch.
And I do a single ,easy demo,app.py:
print("test")
Then,do docker build with the Dockerfile:
FROM bigpangl/python:3.6-slim AS complier
WORKDIR /app
COPY app.py ./app.py
RUN apt-get update \
&& apt-get install -y build-essential patchelf \
&& pip install staticx pyinstaller \
&& pyinstaller -F app.py \
&& staticx /app/dist/app /tu2k1ed
FROM scratch
WORKDIR /
COPY --from=complier /tu2k1ed /
COPY --from=complier /tmp /tmp
CMD ["/tu2k1ed"]
Get the image below, just 7.22M(I am not sure if could see the pic):
Try to run by code docker run test,successfully:
PS:
With my tests
the CMD must write by ['xxx'] not xxx direct.
/tmp directory is required in the demo.
other python application not test ,jsut the demo codes about print
The -F and --onefile parameters should do what you are looking to do. You'll likely want to take a look at your specs file and tweak accordingly.
Using --onefile will compile it into (you guessed it) one file. And you can include binaries with --add-binary parameter.
These pages in the docs may have some useful details on all of the parameters: https://pyinstaller.readthedocs.io/en/stable/spec-files.html#adding-binary-files
https://pyinstaller.readthedocs.io/en/stable/usage.html

Is it possible to run / serialize Dataflow job without having all dependencies locally?

I have created a pipeline for Google Cloud Dataflow using Apache Beam, but I cannot have Python dependencies locally. However, there are no problems for those dependencies to be installed remotely.
Is it somehow possible to run the job or create a template without executing Python code in my local (development) environment?
Take a look at this tutorial. Basically, you write the python pipeline, then deploy it via command line with
python your_pipeline.py \
--project $YOUR_GCP_PROJECT \
--runner DataflowRunner \
--temp_location $WORK_DIR/beam-temp \
--setup_file ./setup.py \
--work-dir $WORK_DIR
The crucial part is --runner DataflowRunner, so it uses Google Dataflow (and not your local installation) to run the pipeline. Obviously, you have to set your Google account and credentials.
Well, I am not 100% sure that this is possible, but you may:
Define a requirements.txt file with all of the dependencies for pipeline execution
Avoid importing and using your dependencies in pipeline-construction time, only in execution time code.
So, for instance, your file may look like this:
import apache_beam as beam
with beam.Pipeline(...) as p:
result = (p | ReadSomeData(...)
| beam.ParDo(MyForbiddenDependencyDoFn()))
And in the same file, your DoFn would be importing your dependency from within the pipeline execution-time code, so the process method, for instance. See:
class MyForbiddenDependencyDoFn(beam.DoFn):
def process(self, element):
import forbidden_dependency as fd
yield fd.totally_cool_operation(element)
When you execute your pipeline, you can do:
python your_pipeline.py \
--project $GCP_PROJECT \
--runner DataflowRunner \
--temp_location $GCS_LOCATION/temp \
--requirements_file=requirements.txt
I have never tried this, but it just may work : )

set the number of reducers does not work

I am using Hadoop streaming with -io typedbytes and set mapred.reduce.tasks=2, but I finally got only one output file. And if I set mapred.reduce.tasks=0, then I got many output files. I am very confused.
SO my question is:
How to make mapred.reduce.tasks = num (num >1) config valid when I using -io typedbytes in streaming?
PS: my mapper's output is (key:string of python, value:array of numpy) .
And my .sh file:
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.2.1.jar \
-D mapred.reduce.tasks=2 \
-fs local \
-jt local \
-io typedbytes \
-inputformat org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat \
-input FFT_SequenceFile \
-output pinvoutput \
-mapper 'pinvmap.py' \
-file pinvmap.py \
-D mapred.reduce.tasks=2 \ -fs local \ -jt local
By checking values of -fs and -jt i came to know you are running it in local mode.
In local mode, either zero or one reducer can run atmost.
Because it uses local file system and a single JVM, there is no Hadoop daemons in this mode.
In psuedo distributed mode where all the daemons runs on the same machine, the property -D mapred.reduce.tasks=n will work and results n reducers.
So you should use psuedo distributed mode for working with multiple reducers.
Hope it helps!

Categories