I am trying to provide my custom python code which requires libraries that are not supported by AWS(pandas). So, I created a zip file with the necessary libraries and uploaded it to the S3 bucket. While running the job, I pointed the path of S3 bucket in the advanced properties.Still my job is not running successfully. Can anyone suggest why?
1.Do I have to include my code in the zip file?
If yes then how will Glue understand that it's the code?
2. Also do I need to create a package or just zip file will do?
Appreciate the help!
An update on AWS Glue Jobs released on 22nd Jan 2019.
Introducing Python Shell Jobs in AWS Glue -- Posted On: Jan 22, 2019
Python shell jobs in AWS Glue support scripts that are compatible with
Python 2.7 and come pre-loaded with libraries such as the Boto3,
NumPy, SciPy, pandas, and others. You can run Python shell jobs using
1 DPU (Data Processing Unit) or 0.0625 DPU (which is 1/16 DPU). A
single DPU provides processing capacity that consists of 4 vCPUs of
compute and 16 GB of memory.
More info at : https://aws.amazon.com/about-aws/whats-new/2019/01/introducing-python-shell-jobs-in-aws-glue/
https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html
According to AWS Glue Documentation:
Only pure Python libraries can be used. Libraries that rely on C
extensions, such as the pandas Python Data Analysis Library, are not
yet supported.
I think it wouldn't work even if we upload the python library as a zip file, if the library you are using has a dependency for C extensions. I had tried using Pandas, Holidays, etc the same way you have tried, and on contacting AWS Support, they mentioned it is in their to do list (support for these python libaries), but no ETA as of now.
So, any libraries that are not native python, would not work in AWS Glue, at this point. But should be available in the near future, since this is a popular demand.
If still you would like to try it out, please refer to this link, where its explained how to package the external libraries to run in AWS glue, I tried it but didnt work for me.
As Yuva's answer mentioned, I believe it's currently impossible to import a library that is not purely in Python and the documentation reflects that.
However, in case someone came here looking for an answer on how to import a python library in AWS Glue in general, there is a good explanation in this post on how to do it with the pg8000 library:
AWS Glue - Truncate destination postgres table prior to insert
Because Google Functions does not support Python, I was hoping to use google-cloud-datastore in AWS Lambda but hit the same error as
AWS Lambda to Firestore error: cannot import name 'cygrpc'
google-cloud-storage works just fine in Lambda so core packaging is not the issue but could pip'ing datastore miss some dependencies ?
setting env.var GOOGLE_CLOUD_DISABLE_GRPC=true does not help as error occurs at import itself. I wonder if this is not a design flaw: you would want to import any gRPC-related libs only if enabled (which is the default). Seems cygprc is being loaded regardless.
I could try downgrade the datastore module to a version which only supported the http api but not sure which one - and that would require most likely to change storage version too
If we managed to get datastore lib to work, our next attempt will be bigquery - should work as doesn't use gRPC (yet?)
I am running a Map Reduce Code in Amazon EMR using Python which uses the native boto library. I need to know which packages are pre-installed in the cluster nodes ? Also how do I automatically install some modules while bootstrapping ?
I'm looking for a visualisation and analytical notebook engine for BigQuery and am interested in Apache/Zeppelin.
We have internal capability in Python and R and want to use this with our BigQuery back end.
All the installation scenarios I've seen so far ( eg: https://cloud.google.com/blog/big-data/2016/09/analyzing-bigquery-datasets-using-bigquery-interpreter-for-apache-zeppelin) seem to require the installation of a fairly hefty Scala/Spark cluster which I don't see the need for (and which would cost a lot)
Is it possible to install Zeppelin without the cluster in Google Cloud?
Starting with 0.6.1 there is a Native BigQuery Interpreter for Apache Zeppelin available.
It allows you to process and analyze datasets stored in Google BigQuery by directly running SQL against it from within an Apache Zeppelin notebook.
So you do not need anymore query BigQuery using Apache Spark as it was only way before
I'm trying to get Google sign-in working using their Python API but the server I'm on (4UHosting) has Python 2.6 and pyOpenSSL 0.10-2 (5-years old).
This means that the API's call to OpenSSL.crypto.verify() fails as it doesn't exist in this version.
I can't install these myself, even --self as they require compiler use which I don't have. The admins are reluctant to install any updates that are not vetted. They won't install pyOpenSSL or Python 2.7 locally just for me. I can't find documentation from pyOpenSSL 0.10-2 that would have an equivalent function to verify().
I'm looking for some suggestions as where to head from here.
Any suggestions would be greatly appreciated,
Cyrille
A few ideas:
You could directly make your API calls to the Google API Endpoints instead of using the Python client library, for example, the token info endpoint can be used to verify tokens
You could do sign-in operations client-side and transfer data to your server once a session is attached
You could use another language (e.g. Ruby) for the sign-in operations