What does an Apache Beam Dataflow job do locally?

What does an Apache Beam Dataflow job do locally? - python

I'm having some issues with an Apache Beam Python SDK defined Dataflow. If I step through my code it reaches the the pipeline.run() step which I assume means the execution graph was successfully defined. However, the job never registers on the Dataflow monitoring tool, which from this makes me think it never reaches the pipeline validation step.
I'd like to know more about what happens between these two steps to help in debugging the issue. I see output indicating packages in my requirements.txt and apache-beam are getting pip installed and it seems like something is getting pickled before being sent to Google's servers. Why is that? If I already have apache-beam downloaded, why download it again? What exactly is getting pickled?
I'm not looking for a solution to my problem here, just trying to understand the process better.

During Graph Construction, Dataflow checks for errors and any illegal operations in the pipeline. Once the checks are successful, execution graph is translated into JSON and transmitted to Dataflow service. In Dataflow service the JSON graph is validated and it becomes a job.
However, if the pipeline is executed locally, the graph is not translated to JSON or transmitted to Dataflow service. So, the graph would not show up as a job in the monitoring tool, it will run on the local machine [1]. You can follow the documentation to configure the local machine [2].
[1] https://cloud.google.com/dataflow/service/dataflow-service-desc#pipeline-lifecycle-from-pipeline-code-to-dataflow-job
[2] https://cloud.google.com/dataflow/pipelines/specifying-exec-params#configuring-pipelineoptions-for-local-execution

The packages from requirements.txt are downloaded using pip download and they are staged on to the staging location. This staging location will be used as cache by the Dataflow and is used to look up packages when pip install -r requirements.txt is called on Dataflow workers to reduce calls to the pypi.

Related

is there a way to run code on GPUs from my terminal?

I have been using GPU acceleration sevices like Google Colab for a while, but I am not satisfied. I don't like having to write all my code in Jupyter Notebooks, and I have some other issues too. I am wondering if there is a way to get something set up where i could just run a command from my terminal, something like upload train.py to upload a file train.py to a server, and then later run run train.py or something like that to run it on that server, and to have the output appear in my local terminal. Does anyone know a way to achieve something like this?

.. if there is a way to get something set up where I could just run a command from my terminal, something like upload train.py to upload a file train.py to a server, and then later run run train.py or something like that to run it on that server
If you are talking about running a code on the google colab server with GPU, no.
As I remember they updated their policy and now you can only use the GPU on google Colab via the Colab notebooks. If you have a Linux server with a GPU, you can connect to it via SSH and install Cuda and libraries like tensorflow_gpu or pytorch and run your code.
If you are looking for cheap alternatives for GPU servers, check this and this link.
Heroku is a non-GPU alternative where you can prototype your codes and then use any of the cloud providers such as AWS or Google Cloud. (As I remember AWS provides a decent number of free hours of GPU time at signup).
Then there is another alternative called FloydHub that I have heard people call heroku for deep learning. I haven't used it personally but this might also be what you are looking for.
On a personal note, even though it's not that efficient. I prototype my codes locally and then upload them to my google drive and do the final training on google Colab GPU. It's an unnecessary step, but that's the best I could find without renting a server.

How to upload packages to an instance in a Processing step in Sagemaker?

I have to do large scale feature engineering on some data. My current approach is to spin up an instance using SKLearnProcessor and then scale the job by choosing a larger instance size or increasing the number of instances. I require using some packages that are not installed on Sagemaker instances by default and so I want to install the packages using .whl files.
Another hurdle is that the Sagemaker role does not have internet access.
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor
sess = sagemaker.Session()
sess.default_bucket()
region = boto3.session.Session().region_name
role = get_execution_role()
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
role=role,
sagemaker_session = sess,
instance_type="ml.t3.medium",
instance_count=1)
sklearn_processor.run(code='script.py')
Attempted resolutions:
Upload the packages to a CodeCommit repository and clone the repo into the SKLearnProcessor instances. Failed with error fatal: could not read Username for 'https://git-codecommit.eu-west-1.amazonaws.com': No such device or address. I tried cloning the repo into a sagemaker notebook instance and it works, so its not a problem with my script.
Use a bash script to copy the packages from s3 using the CLI. The bash script I used is based off this post. But the packages never get copied, and an error message is not thrown.
Also looked into using the package s3fs but it didn't seem suitable to copy the wheel files.
Alternatives
My client is hesitant to spin up containers from custom docker images. Any alternatives?

2. Use a bash script to copy the packages from s3 using the CLI. The bash script I used is based off this post. But the packages never get copied, and an error message is not thrown.
This approach seems sound.
You may be better off overriding the command field on the SKLearnProcessor to /bin/bash, run a bash script like install_and_run_my_python_code.sh that installs the wheel containing your python dependencies, then runs your main python entry point script.
Additionally, instead of using AWS S3 calls to download your code in a script, you could use a ProcessingInput to download your code rather than doing this with AWS CLI calls in a bash script, which is what the SKLearnProcessor does to download your entry point script.py code across all the instances.

Unable to pinpoint issue with GCP Composer (Airflow) DAG task failure

I am new at using Apache Airflow.
Some operators of my dag have a failed status. I am trying to understand the origin of the error.
Here are the details of the problem:
My dag is pretty big, and certain parts of it are composed of sub-dags.
What I notice in the Composer UI, is that the Subdags that failed, all did in a task_id named download_file that uses XCom with a GoogleCloudStorageDownloadOperator.
>> GoogleCloudStorageDownloadOperator(
task_id='download_file',
bucket="sftp_sef",
object="{{task_instance.xcom_pull(task_ids='find_file') | first }}",
filename="/home/airflow/gcs/data/zips/{{{{ds_nodash}}}}_{0}.zip".format(table)
)
The logs in the said Subdag do not show anything useful.
LOG :
[2020-04-07 15:19:25,618] {models.py:1359} INFO - Dependencies all met
for [2020-04-07 15:19:25,660]
{models.py:1359} INFO - Dependencies all met for [2020-04-07 15:19:25,660]
{models.py:1577} INFO -
------------------------------------------------------------------------------- Starting attempt 10 of 1
[2020-04-07 15:19:25,685] {models.py:1599} INFO - Executing
on
2020-04-06T11:44:31+00:00 [2020-04-07 15:19:25,685]
{base_task_runner.py:118} INFO - Running: ['bash', '-c', 'airflow run
datamart_integration.consentement_email download_file
2020-04-06T11:44:31+00:00 --job_id 156313 --pool integration --raw -sd
DAGS_FOLDER/datamart/datamart_integration.py --cfg_path
/tmp/tmpacazgnve']
I am not sure if there is somewhere I am not checking... Here are my questions :
How do I debug errors in my Composer DAGs in general
Is it a good idea to create a local airflow environment to run &
debug my dags locally?
How do I verify if there are errors in XCom?

Regarding your three questions:
First, when using Cloud Composer you have several ways of debugging error in your code. According to the documentation, you should:
Check the Airflow logs.
These logs are related to single DAG tasks. It is possible to view them in the Cloud Storage's logs folder and in the Web Airflow interface.
When you create a Cloud Composer environment a Cloud Storage Bucket is also created and associate with it. Thus, Cloud Composer stores the logs for single DAG tasks in the logs folder inside this bucket, each workflow folder has a folder for its DAGs and sub-DAGs. You can check its structure here.
Regarding the Airflow web interface, it is refreshed every 60 seconds.Also, you can check more about it here.
Review the Google Cloud's operations suite.
You can use Cloud Monitoring and Cloud Logging with Cloud Composer. Whereas Cloud Monitoring provides visibility into the performance and overall health of cloud-powered applications, Cloud Logging shows the logs that the scheduler and worker containers produce. Therefore, you can use both or just the one you find more useful based on your need.
In the Cloud Console, check for errors on the pages for the Google Cloud components running your environment.
In the Airflow web interface, check in the DAG's Graph View for failed task instances.
Thus, these are the steps recommended when troubleshooting your DAG.
Second, regarding testing and debugging, it is recommended that you separate production and test environment to avoid DAG interference.
Furthermore, it is possible to test your DAG locally, there is a tutorial in the documentation about this topic, here. Testing locally allows you to identify syntax and task errors. However, I must point that it won't be possible to check/evaluate dependencies and communication to the database.
Third, in general, in order to verify errors in Xcom you should check:
If there is any error code/number;
Check with a sample code from the documentation if your syntax is correct;
Check if the packages if they are deprecated;
I would like to point that, according to this documentation, the path to GoogleCloudStorageDownloadOperator was updated to GCSToLocalOperator.
In addition, I also encourage you to have a look at this: code and documentation to check Xcom syntax and errors.
Feel free to share the error code with me if you need further help.

Google Cloud Dataflow Dependencies

I want to use dataflow to process in parallel a bunch of video clips I have stored in google storage. My processing algorithm has non-python dependencies and is expected to change over development iterations.
My preference would be to use a dockerized container with the logic to process the clips, but it appears that custom containers are not supported (in 2017):
use docker for google cloud data flow dependencies
Although they may be supported now - since it was being worked on:
Posthoc connect FFMPEG to opencv-python binary for Google Cloud Dataflow job
According to this issue a custom docker image may be pulled, but I couldn't find any documentation on how to do it with dataflow.
https://issues.apache.org/jira/browse/BEAM-6706?focusedCommentId=16773376&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16773376
Another option might be to use setup.py to install any dependencies as described in this dated example:
https://cloud.google.com/blog/products/gcp/how-to-do-distributed-processing-of-landsat-data-in-python
However, when running the example I get an error that there is no module named osgeo.gdal.
For pure python dependencies I have also tried to pass the --requirements_file argument, however I still get an error: Pip install failed for package: -r
I could find documentation for adding dependencies to apache_beam, but not to dataflow, and it appears the apache_beam instructions do not work, based on my tests of --requirements_file and --setup_file

This was answered in the comments, rewriting here for clarity:
In Apache Beam you can modify the setup.py file while will be run once per container on start-up. This file allows you to perform arbitrary commands before the the SDK Harness start to receive commands from the Runner Harness.
A complete example can be found in the Apache Beam repo.

As of 2020, you can use Dataflow Flex Templates, which allow you to specify a custom Docker container in which to execute your pipeline.

How to run python code on AWS lambda with package dependencies >500MB?

The requirement is that I have to trigger a SageMaker endpoint on lambda to get predictions(which is easy) but have to do some extra processing for variable importance using packages such as XGBoost and SHAP.
I am able to hit the endpoint and get variable importance using the SageMaker Jupyter notebook. Now, I want to replicate the same thing on AWS lambda.
1) How to run python code on AWS lambda with package dependencies for Pandas, XGBoost and SHAP (total package size greater than 500MB). The unzipped deployment package size is greater than 250 MB, hence lambda is not allowing to deploy. I even tried using lambda function from Cloud9 and got the same error due to size restrictions. I have also tried lambda layers, but no luck.
2) Is there a way for me to run the code with such big packages on or through lambda bypassing the deployment package size limitation of 250 MB
3) Is there a way to trigger a SageMaker notebook execution through lambda which would do the calculations and return the output back to lambda?

Try to upload your dependencies to the Lambda Layer. FYI: https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html

In addition to use multiple layers for your dependencies - you may want to reduce the *.so files by linux strip command which discards symbols from compiled object files which may not necessary in production
In order to strip all *.so -
use linux/docker container with access to your dependencies directory
cd to your dependencies directory
Run
find . -name *.so -exec strip {} \;
Will execute strip command on every *.so file in the current working directory recursively.
It helped me reduce one of my dependencies objects from 94MB to just 7MB

I found the 250MB limitation on AWS lambda size to be draconian. Only one file ibxgboost.so from xgboost package is already around 140 MB which leaves only 110Mb for everything else. That makes AWS lambdas useless for anything but simple "hello world" stuff.
As an ugly workaround you can store xgboost package somewhere on s3 an copy it to the /tmp folder from the lambda invocation routine and point your python path to it. The allowed tmp space is a bit higher - 500MB so it might work.
I am not sure though if the /tmp folder is not cleaned between the lambda function runs though.

You can try using SageMaker Inference Pipelines to do pre-processing before making actual predictions. Basically, you can use the same pre-processing script used for training for inference as well. When the pipeline model is deployed, the full set of containers with pre-processing tasks installs and runs on each EC2 instance in the endpoint or transform job. Feature processing and inferences are executed with low latency because the containers deployed in an inference pipeline are co-located on the same EC2 instance (endpoint). You can refer documentation here.
Following blog posts/notebooks cover this feature in detail
Preprocess input data before making predictions using Amazon SageMaker inference pipelines and Scikit-learn
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/inference_pipeline_sparkml_xgboost_abalone/inference_pipeline_sparkml_xgboost_abalone.ipynb
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/inference_pipeline_sparkml_blazingtext_dbpedia/inference_pipeline_sparkml_blazingtext_dbpedia.ipynb

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.