Installing python packages in Serverless Dataproc GCP

Installing python packages in Serverless Dataproc GCP - python

I wanted to install some python packages (eg: python-json-logger) on Serverless Dataproc. Is there a way to do an initialization action to install python packages in serverless dataproc? Please let me know.

You have two options:
Using command gcloud in terminal:
You can create a custom image with dependencies(python packages) in the GCR(Google Container Registry GCP) and add uri as parameter in the command below:
e.g.
$ gcloud beta dataproc batches submit
--container-image=gcr.io/my-project-id/my-image:1.0.1
--project=my-project-id --region=us-central1
--jars=file:///usr/lib/spark/external/spark-avro.jar
--subnet=projects/my-project-id/regions/us-central1/subnetworks/my-
subnet-name
To create custom container image for Dataproc Serveless for Spark.
Using operator DataprocCreateBatchOperator of airflow:
Add to python-file the script below, it will install the desired package and then load this package into the container path (dataproc servless), this file must be saved in a bucket, this uses the secret manager package as an example.
python-file.py
import pip
import importlib
from warnings import warn
from dataclasses import dataclass
def load_package(package, path):
warn("Update path order. Watch out for importing errors!")
if path not in sys.path:
sys.path.insert(0,path)
module = importlib.import_module(package)
return importlib.reload(module)
#dataclass
class PackageInfo:
import_path: str
pip_id: str
packages = [PackageInfo("google.cloud.secretmanager","google-cloud-secret-manager==2.4.0")]
path = '/tmp/python_packages'
pip.main(['install', '-t', path, *[package.pip_id for package in packages]])
for package in packages:
load_package(package.import_path, path=path)
...
finally the perator calls the python-file.py
create_batch = DataprocCreateBatchOperator(
task_id="batch_create",
batch={
"pyspark_batch": {
"main_python_file_uri": "gs://bucket-name/python-file.py",
"args": [
"value1",
"value2"
],
"jar_file_uris": "gs://bucket-name/jar-file.jar",
},
"environment_config": {
"execution_config": {
"subnetwork_uri": "projects/my-project-id/regions/us-central1/subnetworks/my-subnet-name"
},
},
},
batch_id="batch-create",
)

Related

Module doesn't have member in JSII Python package

I am following AWS User Guide to create a Python package from TypeScript Source via JSII. My TypeScript source looks like this:
export interface GreeterProps {
readonly greetee: string;
}
export class Greeter {
private readonly greetee: string;
public constructor(props: GreeterProps) {
this.greetee = props.greetee;
}
public greet(): string {
return `Hello, ${this.greetee}!`
}
}
This is section from JSII config for Python:
"targets": {
"python": {
"distName": "jsii-test.jsii-test",
"module": "jsii_test.jsii_test"
}
}
The project builds without errors and, and Python package is created successfully. I uploaded the package (via twine) to AWS CodeArtifact, and installed (via pip). When I import it in interactive Python console (import jsii_test) it imports successfully, but it doesn't seem to have the members exported from the original TypeScript source (GreeterProps, Greeter). What am I missing?
Project source: https://github.com/YuriGal/jsii-test

Turned out pip was installing package into incorrect location (I have to sort out my python versions). When I added path where the package was installed
import sys
sys.path.append('/usr/local/lib/python3.10/site-packages')
everything worked.

Runtime.ImportModuleError: Unable to import module (lambda)

When I check the cloud watch logs of my Lambda function, I see theses errors:
[ERROR] Runtime.ImportModuleError: Unable to import module 'trigger_bitbucket_pipeline_from_s3': No module named 'requests'
File structure:
/bin
--trigger_bitbucket_pipeline_from_s3.zip
/src
--trigger_bitbucket_pipeline_from_s3.py
--/requests (lib folder)
lambda.tf
Lambda.tf:
data "archive_file" "lambda_zip" {
type = "zip"
source_file = "${path.module}/src/trigger_bitbucket_pipeline_from_s3.py"
output_file_mode = "0666"
output_path = "${path.module}/bin/trigger_bitbucket_pipeline_from_s3.zip"
}
resource "aws_lambda_function" "processing_lambda" {
filename = data.archive_file.lambda_zip.output_path
function_name = "triggering_pipleline_lambda"
handler = "trigger_bitbucket_pipeline_from_s3.lambda_handler"
source_code_hash = data.archive_file.lambda_zip.output_base64sha256
role = aws_iam_role.processing_lambda_role.arn
runtime = "python3.9"
}
My lambda function in src/trigger_bitbucket_pipeline_from_s3.py is pretty straightforward for now:
import logging
import requests
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
logger.info(f'## EVENT: {event}')
return {
'statusCode': 200,
}
What am I doing wrong? I have already double checked file names.

That is because there is no module named 'requests' in lambda, remembrer lambda is serverless so you need to configure all your dependencies before you run it.
One way to solve this is to install that dependency locally in your project:
pip install requests -t ./
Then create again the .zip file (with the dependency in it) and upload to your lambda function.
And other way to solve it is to use a custom layer in AWS lambda that contains the relevant 'requests' site-packages you require. Example:
https://dev.to/razcodes/how-to-create-a-lambda-layer-in-aws-106m

You typically receive this error when your Lambda environment can't find the specified library in the Python code. This is because Lambda isn't prepackaged with all Python libraries.
To resolve this error, create a deployment package or Lambda layer that includes the libraries that you want to use in your Python code for Lambda.
Make sure that you put the library that you import for Python inside the /python folder.
In your local environment install all library files into the python folder by running the following:
pip install librarywhatyouneed -t python/
There are dependencies to create all python libraries prepackaged and zip the python with all dependencies and put into the layer associated with add Layer lambda created on AWS.

ModuleNotFoundError in pipenv shell?

A Python project I'm working on recently switched from using a virtualenv with a requirements.txt to using pipenv. The root directory contains the following Pipfile:
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"
[packages]
# AWS SDK for Python
boto3 = "==1.6.17"
# Use DATABASE_URL env variable to configure Django application
dj-database-url = "==0.4.2"
# Web framework
django = "==1.11.9"
# Django email integration for transactional ESPs
django-anymail = "==0.10"
# Log of changes made to a model
django-auditlog = "==0.4.5"
# Compresses linked and inline JavaScript or CSS into a single cached file
django-compressor = "==2.2"
# Save and retrieve current request object anywhere in your code
django-crequest = "==2018.5.11"
# Blocks people from brute forcing login attempts
django-defender = "==0.5.4"
# Wrap standard Django fields with encryption
django-encrypted-model-fields = "==0.5.3"
# Custom extensions for the Django Framework
django-extensions = "==2.0.0"
# A set of high-level abstractions for Django forms
django-formtools = "==2.1"
# Import and export data in multiple formats (Excel, CSV, JSON, and so on)
django-import-export = "==0.5.1"
# OAuth2 for Django
django-oauth-toolkit = "==1.0.0"
# SASS integration
django-sass-processor = "==0.5.7"
# Collection of custom storage backends for Django
django-storages = "==1.6.6"
# Two-Factor Authentication for Django
django-two-factor-auth = "==1.7.0"
# Tweak the form field rendering in templates
django-widget-tweaks = "==1.4.1"
# Toolkit for building Web APIs
djangorestframework = "==3.6.3"
# Fixtures replacement
factory-boy = "==2.10.0"
# Style Guide Enforcement
flake8 = "==3.5.0"
# Allows tests to travel through time by mocking the datetime module
freezegun = "==0.3.9"
# Python WSGI HTTP Server
gunicorn = "==19.7.1"
# Newrelic adapter
newrelic = "==2.90.0.75"
# Parsing, formatting, and validating international phone numbers
phonenumbers = "==8.9.1"
# Imaging processing library
pillow = "==5.0.0"
# PostgreSQL adapter
psycopg2 = "==2.7.1"
# Python exception notifier for Airbrake
pybrake = "==0.3.3"
# ISO databases for languages, countries and subdivisions
pycountry = "==18.2.23"
# Extensions to the standard datetime module
python-dateutil = "==2.6.0"
# Loads environment variables from .env file
python-dotenv = "==0.7.1"
# HTTP library
requests = "==2.19.1"
# Python library to capitalize strings
titlecase = "==0.12.0"
# Communication with the Twilio API
twilio = "==6.4.3"
# Static file serving
whitenoise = "==3.3.1"
[dev-packages]
[requires]
python_version = "3.7.0"
As you can see, the packages include python-dotenv, which is used in our Django project's manage.py. However, if I activate the pipenv shell:
Kurts-MacBook-Pro-2:lucy-web kurtpeek$ pipenv shell
Loading .env environment variables...
Creating a virtualenv for this project...
Pipfile: /Users/kurtpeek/Documents/Dev/lucy2/lucy-web/Pipfile
Using /Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7m (3.7.0) to create virtualenv...
⠋Running virtualenv with interpreter /Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7m
Using base prefix '/Library/Frameworks/Python.framework/Versions/3.7'
/usr/local/Cellar/pipenv/2018.7.1/libexec/lib/python3.7/site-packages/virtualenv.py:1041: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
New python executable in /Users/kurtpeek/.local/share/virtualenvs/lucy-web-CVxkrCFK/bin/python3.7m
Also creating executable in /Users/kurtpeek/.local/share/virtualenvs/lucy-web-CVxkrCFK/bin/python
Installing setuptools, pip, wheel...done.
Setting project for lucy-web-CVxkrCFK to /Users/kurtpeek/Documents/Dev/lucy2/lucy-web
Virtualenv location: /Users/kurtpeek/.local/share/virtualenvs/lucy-web-CVxkrCFK
Launching subshell in virtual environment…
bash-3.2$ . /Users/kurtpeek/.local/share/virtualenvs/lucy-web-CVxkrCFK/bin/activate
(lucy-web-CVxkrCFK) bash-3.2$
and then try to run python manage.py shell, I get a ModuleNotFoundError for dotenv:
(lucy-web-CVxkrCFK) bash-3.2$ python manage.py shell
Traceback (most recent call last):
File "manage.py", line 4, in <module>
from dotenv import load_dotenv, find_dotenv
ModuleNotFoundError: No module named 'dotenv'
Also, if I do a pip freeze command I don't see any packages installed:
(lucy-web-CVxkrCFK) bash-3.2$ pip freeze
(lucy-web-CVxkrCFK) bash-3.2$
Should the pipenv shell not have the packages in the Pipfile already installed? Or do I need to do an additional step?

It turns out that you indeed need to run pipenv install to actually install the packages. The pipenv shell command only activates the virtual environment.

Try to use virtual env instead.
python -m venv venv
.\venv\Scripts\activate
pip install -r requirements.txt

How can I get my gradle test task to use python pip install for library that isn't on maven central?

I am trying to set up a gradle task that will run Robot tests. Robot uses a python library to interact with Selenium in order to test a web page through a browser. But unfortunately it seems the only way to install the https://github.com/robotframework/Selenium2Library is via pip - pip install robotframework-selenium2library. Is there a way to get Gradle to run this command in my task?
Here's what I have:
build.gradle:
configurations {
//...
acceptanceTestRuntime {extendsFrom testCompile, runtime}
}
dependencies {
//...
acceptanceTestRuntime group: 'org.robotframework', name: 'robotframework', version: '2.8.7'
//The following doesn't work, apparently this library isn't on maven...
//acceptanceTestRuntime group: 'org.robotframework', name: 'Selenium2Library', version: '1.+'
}
sourceSets {
//...
acceptanceTest {
runtimeClasspath = sourceSets.test.output + configurations.acceptanceTestRuntime
}
}
task acceptanceTest(type: JavaExec) {
classpath = sourceSets.acceptanceTest.runtimeClasspath
main = 'org.robotframework.RobotFramework'
args '--variable', 'BROWSER:gc'
args '--outputdir', 'target'
args 'src/testAcceptance'
}
My robot resources file - login.resource.robot:
*** Settings ***
Documentation A resource file for my example login page test.
Library Selenium2Library
*** Variables ***
${SERVER} localhost:8080
(etc.)
*** Keywords ***
Open Browser to Login Page
Open Browser ${LOGIN_URL} ${BROWSER}
Maximize Browser Window
Set Selenium Speed ${DELAY}
Login Page Should Be Open
Login Page Should Be Open
Location Should Be ${LOGIN_URL}
And when I run this task, my robot tests are run, BUT they fail. Because certain keywords that are defined in the robotframework-selenium2Library aren't recognized, such as "Open Browser", and an exception is thrown.
How can I get gradle to import this selenium library for this task? Can I install and call pip via some python plugin?

I had to use a gradle Exec task to run a python script that then kicked off the robot tests. So it looked like this:
build.gradle
task acceptanceTest(type: Exec) {
workingDir 'src/testAcceptance'
commandLine 'python', 'run.py'
}
src/testAcceptance/run.py
import os
import robot
import setup
#Which runs setup.py
os.environ['ROBOT_OPTIONS'] = '--variable BROWSER.gc --outputdir results'
robot.run('.')
src/testAcceptance/setup.py
import os
import sys
import pip
import re
pip.main(['install', 'robotframework==3.0'])
pip.main(['install', 'robotframework-selenium2library==1.8.0'])
# Checksums can be looked up by chromedriver version here - http://chromedriver.storage.googleapis.com/index.html
pip.main(['install', '--upgrade', 'chromedriver_installer',
'--install-option=--chromedriver-version=2.24',
'--install-option=--chromedriver-checksums=1a46c83926f891d502427df10b4646b9,d117b66fac514344eaf80691ae9a4687,' +
'c56e41bdc769ad2c31225b8495fc1a93,8e6b6d358f1b919a0d1369f90d61e1a4'])
#Add the Scripts dir to the path, since that's where the chromedriver is installed
scriptsDir = re.sub('[A-Za-z0-9\\.]+$', '', sys.executable) + 'Scripts'
os.environ['PATH'] += os.pathsep + scriptsDir

AWS Lambda not detecting pyopenssl

I have an AWS Lambda function that uses oauth2client and SignedJwtAssertionCredentials.
I have installed my requirements locally (at the root) of my Lambda function directory.
requirements.txt
boto3==1.2.5
gspread==0.3.0
oauth2client==1.5.2
pyOpenSSL==0.15.1
pycrypto==2.6.1
My lambda function looks like:
import boto3
import gspread
from oauth2client.client import SignedJwtAssertionCredentials
def lambda_handler(event, context):
dynamodb = boto3.resource('dynamodb')
scope = ['https://spreadsheets.google.com/feeds']
private_key = "!--some-private-key"
google_email = "some-email"
credentials = SignedJwtAssertionCredentials(google_email, private_key, scope)
gc = gspread.authorize(credentials)
However, when running this, I get the following stack trace:
{
"stackTrace": [
[
"/var/task/lambda_function.py",
20,
"lambda_handler",
"credentials = SignedJwtAssertionCredentials(google_email, private_key, scope)"
],
[
"/var/task/oauth2client/util.py",
140,
"positional_wrapper",
"return wrapped(*args, **kwargs)"
],
[
"/var/task/oauth2client/client.py",
1630,
"__init__",
"_RequireCryptoOrDie()"
],
[
"/var/task/oauth2client/client.py",
1581,
"_RequireCryptoOrDie",
"raise CryptoUnavailableError('No crypto library available')"
]
],
"errorType": "CryptoUnavailableError",
"errorMessage": "No crypto library available"
}
From everything I've read online, I am told that I need to install pyopenssl. However, I already have that installed and pycrypto.
Is there something I'm missing?

Looks like this is a bit old of a question, but if you are still looking for an answer:
This occurs because one or more of the dependencies for pyopenssl is a native package or has native bindings (cryptography is a dependency of pyopenssl and has a dependency on libssl) that is not compiled for the target platform.
Unfortunately the process varies for how to get compiled versions. The simplest way (which works only if its a different in the platforms, not missing .so libraries) is to:
Create an ec2 host (use t2.micro and the AWS AMI Image)
Install python and virtualenv
Create a virtual env
Install your target library
Zip up the virtualenv virtualenv/site-packages and virtualenv/dist-packages and move them off the machine
Discard the machine image
This zip will then need to be expanded into your lambda zip before uploading. The result will be the required packages residing at the root of your zip file (not in site-packages or dist-packages folders)
For simple dependencies this works, if you require native libraries as well (such as for Numpy or Scipy) you will need to take more elaborate approaches such as the ones outlined here: http://thankcoder.com/questions/jns3d/using-moviepy-scipy-and-numpy-in-amazon-lambda

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Installing python packages in Serverless Dataproc GCP - python

I wanted to install some python packages (eg: python-json-logger) on Serverless Dataproc. Is there a way to do an initialization action to install python packages in serverless dataproc? Please let me know.

Related

Module doesn't have member in JSII Python package

Runtime.ImportModuleError: Unable to import module (lambda)

ModuleNotFoundError in pipenv shell?

How can I get my gradle test task to use python pip install for library that isn't on maven central?

AWS Lambda not detecting pyopenssl

Categories

Resources