How to install package dependencies for a custom Airbyte connector?

How to install package dependencies for a custom Airbyte connector? - python

I'm developing a custom connector for Airbyte, and it involves extracting files from different compressed formats, like .zip or .7z. My plan was to use patool for this, and indeed it works in local tests, running:
python main.py read --config valid_config.json --catalog configured_catalog_old.json
However, since Airbyte runs in docker containers, I need those containers to have packages like p7zip installed. So my question is, what is the proper way to do that?
I just downloaded and deployed Airbyte Open Source in my own machine using the recommended commands listed on Airbyte documentation:
git clone https://github.com/airbytehq/airbyte.git
cd airbyte
docker compose up
I tried using docker exec -it CONTAINER_ID bash into airbyte/worker and airbyte/connector-builder-server, to install p7zip directly, but it's not working yet. My connector calls patoolib from a Python script, but it is unable to process the given file, because it fails to find a program to extract it. This is the log output:
> patool: Extracting /tmp/tmpan2mjkmn ...
> unknown archive format for file `/tmp/tmpan2mjkmn'

It turns out I completely ignored that the connector template comes with a Dockerfile, which is used precisely to configure the container that is supposed to run the connector code. So all I had to do was to add this line do Dockerfile:
RUN apt-get update && apt-get install -y file p7zip p7zip-full lzma lzma-dev
Specifically, to use patoolib, I had to install the file package, so it could detect the mime type of archive files.

Related

Cannot run Google Vision API on AWS Lambda [duplicate]

I am trying to use Google Cloud Platform (specifically, the Vision API) for Python with AWS Lambda. Thus, I have to create a deployment package for my dependencies. However, when I try to create this deployment package, I get several compilation errors, regardless of the version of Python (3.6 or 2.7). Considering the version 3.6, I get the issue "Cannot import name 'cygrpc'". For 2.7, I get some unknown error with the .path file. I am following the AWS Lambda Deployment Package instructions here. They recommend two options, and both do not work / result in the same issue. Is GCP just not compatible with AWS Lambda for some reason? What's the deal?
Neither Python 3.6 nor 2.7 work for me.
NOTE: I am posting this question here to answer it myself because it took me quite a while to find a solution, and I would like to share my solution.

TL;DR: You cannot compile the deployment package on your Mac or whatever pc you use. You have to do it using a specific OS/"setup", the same one that AWS Lambda uses to run your code. To do this, you have to use EC2.
I will provide here an answer on how to get Google Cloud Vision working on AWS Lambda for Python 2.7. This answer is potentially extendable for other other APIs and other programming languages on AWS Lambda.
So the my journey to a solution began with this initial posting on Github with others who have the same issue. One solution someone posted was
I had the same issue " cannot import name 'cygrpc' " while running
the lambda. Solved it with pip install google-cloud-vision in the AMI
amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2 instance and exported the
lib/python3.6/site-packages to aws lambda Thank you #tseaver
This is partially correct, unless I read it wrong, but regardless it led me on the right path. You will have to use EC2. Here are the steps I took:
Set up an EC2 instance by going to EC2 on Amazon. Do a quick read about AWS EC2 if you have not already. Set one up for amzn-ami-hvm-2018.03.0.20180811-x86_64-gp2 or something along those lines (i.e. the most updated one).
Get your EC2 .pem file. Go to your Terminal. cd into your folder where your .pem file is. ssh into your instance using
ssh -i "your-file-name-here.pem" ec2-user#ec2-ip-address-here.compute-1.amazonaws.com
Create the following folders on your instance using mkdir: google-cloud-vision, protobuf, google-api-python-client, httplib2, uritemplate, google-auth-httplib2.
On your EC2 instance, cd into google-cloud-vision. Run the command:
pip install google-cloud-vision -t .
Note If you get "bash: pip: command not found", then enter "sudo easy_install pip" source.
Repeat step 4 with the following packages, while cd'ing into the respective folder: protobuf, google-api-python-client, httplib2, uritemplate, google-auth-httplib2.
Copy each folder on your computer. You can do this using the scp command. Again, in your Terminal, not your EC2 instance and not the Terminal window you used to access your EC2 instance, run the command (below is an example for your "google-cloud-vision" folder, but repeat this with every folder):
sudo scp -r -i your-pem-file-name.pem ec2-user#ec2-ip-address-here.compute-1.amazonaws.com:~/google-cloud-vision ~/Documents/your-local-directory/
Stop your EC2 instance from the AWS console so you don't get overcharged.
For your deployment package, you will need a single folder containing all your modules and your Python scripts. To begin combining all of the modules, create an empty folder titled "modules." Copy and paste all of the contents of the "google-cloud-vision" folder into the "modules" folder. Now place only the folder titled "protobuf" from the "protobuf" (sic) main folder in the "Google" folder of the "modules" folder. Also from the "protobuf" main folder, paste the Protobuf .pth file and the -info folder in the Google folder.
For each module after protobuf, copy and paste in the "modules" folder the folder titled with the module name, the .pth file, and the "-info" folder.
You now have all of your modules properly combined (almost). To finish combination, remove these two files from your "modules" folder: googleapis_common_protos-1.5.3-nspkg.pth and google_cloud_vision-0.34.0-py3.6-nspkg.pth. Copy and paste everything in the "modules" folder into your deployment package folder. Also, if you're using GCP, paste in your .json file for your credentials as well.
Finally, put your Python scripts in this folder, zip the contents (not the folder), upload to S3, and paste the link in your AWS Lambda function and get going!
If something here doesn't work as described, please forgive me and either message me or feel free to edit my answer. Hope this helps.

Building off the answer from #Josh Wolff (thanks a lot, btw!), this can be streamlined a bit by using a Docker image for Lambdas that Amazon makes available.
You can either bundle the libraries with your project source or, as I did below in a Makefile script, upload it as an AWS layer.
layer:
set -e ;\
docker run -v "$(PWD)/src":/var/task "lambci/lambda:build-python3.6" /bin/sh -c "rm -R python; pip install -r requirements.txt -t python/lib/python3.6/site-packages/; exit" ;\
pushd src ;\
zip -r my_lambda_layer.zip python > /dev/null ;\
rm -R python ;\
aws lambda publish-layer-version --layer-name my_lambda_layer --description "Lambda layer" --zip-file fileb://my_lambda_layer.zip --compatible-runtimes "python3.6" ;\
rm my_lambda_layer.zip ;\
popd ;
The above script will:
Pull the Docker image if you don't have it yet (above uses Python 3.6)
Delete the python directory (only useful for running a second
time)
Install all requirements to the python directory, created in your projects /src directory
ZIP the python directory
Upload the AWS layer
Delete the python directory and zip file
Make sure your requirements.txt file includes the modules listed above by Josh: google-cloud-vision, protobuf, google-api-python-client, httplib2, uritemplate, google-auth-httplib2

There's a fast solution that doesn't require much coding.
Cloud9 uses AMI so using pip on their virtual environment should make it work.
I created a Lambda from the Cloud9 UI and from the console activated the venv for the EC2 machine. I proceeded to install google-cloud-speech with pip.That was enough to fix the issue.

I was facing same error using goolge-ads API.
{
"errorMessage": "Unable to import module 'lambda_function': cannot import name'cygrpc' from 'grpc._cython' (/var/task/grpc/_cython/init.py)","errorType": "Runtime.ImportModuleError","stackTrace": []}
My Lambda runtime was Python 3.9 and architecture x86_64.
If somebody encounter similar ImportModuleError then see my answer here : Cannot import name 'cygrpc' from 'grpc._cython' - Google Ads API

How to create docker container with Python and Orange

Does anyone know how I create a docker container with Python and Orange, without installing the whole Anaconda package.
I managed to make it work with a container of 8.0 GB, but that is way too big

From the GitHub project page, look at the README, and download the appropriate requirements-* files. Create a directory containing the file(s), and write a Dockerfile like this:
FROM python:3.7
RUN pip install PyQt5
COPY requirements-core.txt /tmp
RUN pip install -r requirements-core.txt
# repeat the previous two commands with other files, if needed
pip install git+https://github.com/biolab/orange3
Add any other commands as needed, e.g. to COPY your source code.

Docker Python Image

I've a RHEL host with docker installed, it has default Py 2.7. My python scripts needs a bit more modules which
I can't install due to lack of sudo access & moreover, I dont want to screw up with the default Py which is needed for host to function.
Now, I am trying to get a python in docker container where I get to add few modules do the needfull.
Issue - docker installed RHEL is not connected to internet and cant be connected as well
The laptop i have doesnt have the docker either and I can't install docker here (no admin acccess) to create the docker image and copy them to RHEL host
I was hoping if docker image with python can be downloaded from Internet I might be able to use that as is!,
Any pointers in any approprite direction would be appreciated.
what have I done - tried searching for the python images, been through the dockers documentation to create the image.
Apologies if the above question sounds silly, I am getting better with time on docker :)

If your environment is restricted enough that you can't use sudo to install packages, you won't be able to use Docker: if you can run any docker run command at all you can trivially get unrestricted root access on the host.
My python scripts needs a bit more modules which I can't install due to lack of sudo access & moreover, I dont want to screw up with the default Py which is needed for host to function.
That sounds like a perfect use for a virtual environment: it gives you an isolated local package tree that you can install into as an unprivileged user and doesn't interfere with the system Python. For Python 2 you need a separate tool for it, with a couple of steps to install:
export PYTHONUSERBASE=$HOME
pip install --user virtualenv
~/bin/virtualenv vpy
. vpy/bin/activate
pip install ... # installs into vpy/lib/python2.7/site-packages

you can create a docker image on any standalone machine and push the final required image to docker registry ( docker hub ). Then in your laptop you can pull that image and start working :)
Below are some key commands that will be required for the same.
To create a image, you will need to create a Dockerfile with all the packages installed
Or you can also do sudo docker run -it ubuntu:16.04 then install python and other packages as required.
then sudo docker commit container_id name
sudo docker tag SOURCE_IMAGE[:TAG] TARGET_IMAGE[:TAG]
sudo docker push IMAGE_NAME
Then you pull this image in your laptop and start working.
You can refer to this link for more docker commands https://github.com/akasranjan005/docker-k8s/blob/master/docker/basic-commands.md
Hope this helps. Thanks

Shared Library libpython3.5 not found in Docker Container (but override works fine)

I am trying to deploy a Python Webservice (with flask) that uses CNTK in a Docker Container. I use an Ubuntu-Base-Image from Microsoft that is supposed to contain all the neccessary and correct programs and libraries to run CNTK.
The Script works locally (on Windows) and also when I run the container and start a bash from cmd-line with
docker exec -it <container_id> bash
and start the script from "within the container".
An important addition is that the python script uses two precompiled modules that are *.pyd files for windows and *.so files for Linux. So for the docker image I replaced the former for the latter for running the script from within the container.
The problems start when I start the script with a CMD in the Dockerfile. The creation of the image shows no problems. But when I start the container with
docker run -p 1234:80 app
I get the following error:
...
ImportError: libpython3.5m.so.1.0: cannot open shared object file: No such file or directory
It seems like the library is missing. But (I repeat) when I run the script from within a bash running in the container (which should only have the containers libraries as far as I can see), everything works fine. I even can look up the library with
ldd $(which python)
And the file is definitely in the folder. So the question is why python can't find its dependency when running the docker container.
It even gets weirder when I try to give the path to the library explicitely by writing it in the environment variable:
ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/root/anaconda3/pkgs/python-3.5.2-0/lib/"
Then the library it seems the library is found but it is not accepted as correct:
ImportError: dynamic module does not define init function (initcython_bbox)
"cython_bbox" is the name of one of the *.pyd / *.so file/library that is to be imported. This is apparantly a typical error for these kinds of filetypes. But I don't have any experience with them.
I am also not at the point (in my personal development) to be able to compile my own files from foreign source or create the docker image itself on my own. I rely on the parts I got from Microsoft. But I would be open to suggestions.
I also already tried to install the library anew inside my Dockerfile after importing the base image with
RUN apt-get install -y libpython3.5
But it provoked the same error as when I put the path in the environment variable.
I am really eager to know what goes wrong here. Why does everything run smoothly "inside the container" but not with Autostart at Initialization of a Container with CMD?
For additional info I add the Dockerfile:
# Use an official Python runtime as a parent image
FROM microsoft/cntk:2.5.1-cpu-python3.5
# Set the working
directory to /app
WORKDIR /app
# Copy the current directory contents into the container at /app
ADD . /app
RUN apt-get update && apt-get install -y python-pip RUN pip install
--upgrade pip
# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt
# Make port 80 available to the world outside this container
EXPOSE 80
# Run app.py when the container launches
CMD ["python", "test.py"]
The rest of the project is a pretty straightforward flask-webapp that runs without problems when I comment out all import of the actual CNTK-project. It is the CNTK Object Detection with Faster-RCNN by the way, as it can be found in the cntk-git-repository.
EDIT:
I found out what the actual problem is, yet I still have no way to solve it. The thing is that when I start bash with "docker exec" it runs a script at startup that activates an anaconda environment with python3.5 and all the neat libraries. But when CMD just starts python this is done by the standard Bourne shell "sh" which (as I tried out) runs with python2.7.
So I need a way either to start my container with bash (including its autostart scripts) or somehow activate the environment on startup in another way.
I looked up the script and it basically checks if bash is the current shell, sets some environment variables and activates the environment.
if [ -z "$BASH_VERSION" ]; then
echo Error: only Bash is supported.
elif [ "$(basename "$0" 2> /dev/null)" == "activate-cntk" ]; then
echo Error: this script is meant to be sourced. Run 'source activate-cntk'
else
export PATH="/cntk/cntk/bin:$PATH"
export LD_LIBRARY_PATH="/cntk/cntk/lib:/cntk/cntk/dependencies/lib:$LD_LIBRARY_PATH"
source "/root/anaconda3/bin/activate" "/root/anaconda3/envs/cntk-py35"
cat <<MESSAGE
************************************************************
CNTK is activated.
Please checkout tutorials and examples here:
/cntk/Tutorials
/cntk/Examples
To deactivate the environment run
source /root/anaconda3/bin/deactivate
************************************************************
MESSAGE
fi
I tried some dozens of things like linking sh to bash
RUN ln -fs /bin/bash /bin/sh
or using bash as ENTRYPOINT.

I have found a workaround that works for now.
First I manually link python to python3 in my environment:
RUN ln -fs /root/anaconda3/envs/cntk-py35/bin/python3.5 /usr/bin/python
Then I add the environment libraries to the Library-Path:
ENV LD_LIBRARY_PATH "/cntk/cntk/lib:/cntk/cntk/dependencies/lib:$LD_LIBRARY_PATH"
And to be sure I add all important folders to PATH:
ENV PATH "/cntk/cntk/bin:$PATH"
ENV PATH "/root/anaconda3/envs/cntk-py35/bin:$PATH"
I then have to install my python packages again:
RUN pip install flask
And can finally just start my script with:
CMD ["python", "app.py"]

I have also found this GIT Repository doing pretty much the same thing I did. And they also need to start their environment. Realizing that I need to learn how to write better Dockerfiles. I think this is the correct way to do it, i.e. using a shell script as ENTRYPOINT
ENTRYPOINT ["/app/run.sh"]
which activates the environment, installs additional python packages (this could be a problem) and starting the actual app.
#!/bin/bash
source /root/anaconda3/bin/activate root
pip install easydict
pip install azure-ml-api-sdk==0.1.0a9
pip install sanic
python3 /app/app.py

Set JAVA_HOME for docker in NLTK for Stanford NLP

I am a beginner in using Docker.
I'm using Docker toolbox for Windows 7 , i built an image for my python web app and everything works fine.
However, for this app, i use nltk module which also needs java and java_home setting to the java file.
When running on my computer, i can mannualy set the java_home, but how to do it in the dockerfile so that it wont get error when running on another machine.
Here is my error :
p.s : Answer below

When you are running a container you have the option of passing in environment variables that will be set in your container using the -e flag. This answer explains environment variables nicely: How do I pass environment variables to Docker containers?
docker container run -e JAVA_HOME='/path/to/java' <your image>
Make sure your image actually contains Java as well. You might want to look at something like the openjdk:8 image on docker hub.
It sounds like you need a docker file to build your image. Have a look at the ENV command documented here to set the JAVA_HOME var: https://docs.docker.com/engine/reference/builder/#env and then build your image with docker build /path/to/Dockerfile
I see you've already tried that and didn't have much luck.. run the container and instead of running your application process just run a bash script along the lines of echo $JAVA_HOME so you can at least verify that part is working.
Also make sure you copy in whatever files/binaries needed to the appropriate directories within the image in the docker file as noted below.

i finally found the way to install the java for dockerfile , it is use the java install commandline of ubuntu image.
Below is the docker file . Thanks for your reading.
RUN apt-get update
RUN apt-get install -y python-software-properties
RUN apt-get install -y software-properties-common
RUN add-apt-repository -y ppa:openjdk-r/ppa
RUN apt-get update
RUN apt-get install -y openjdk-8-jdk
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
RUN export JAVA_HOME

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.