Accessing delta lake through Pyspark on EMR notebooks - python

I have a query with respect to using external libraries like delta-core over AWS EMR notebooks. Currently there isn’t any mechanism of installing the delta-core libraries through pypi packages. The available options include.
Launching out pyspark kernel with --packages option
The other option is to change the packages option in the python script through os configuration, but I don’t see that it is able to download the packages and I still get import error on import delta.tables library.
Third option is to download the JARs manually but it appears that there isn’t any option on EMR notebooks.
Has anyone tried this out before?

You can download the jars while creating EMR using bootstrap scripts.
You can place the jars in s3 and pass it to pyspark with --jars option

Related

How to connect Azure Python package in WebJob

I have my code in my local related to our business, I am trying to deploy it to Azure but displayed with few import errors and few internal server errors.
Here I am interacting with some services like storage etc.. so I installed all the services with pip(pip is also a latest version).
I am new to Azure in interacting with SDK's. Any suggestions or steps are highly appreciated
We will have all the packages in sitepackages in our local. Whenever you install all the packages you need to install them by activating virtual environment in you local, so that they will be accessible when you import them.
You can try something like below in you code so that your webjob will load all your packages when the code runs:
import sys
package = "D:\home\site\wwwroot\env\Lib\site-packages"
sys.path.append(package)
Also you can refer to this SO where we have clear explanation similar to your problem, thanks to Gary for covering it.

Is there anyway to run and deploy ubuntu packages on Azure functions Startup?

In my Az Function app, I have some ubuntu packages like Azure CLI and Kubectl that I need to install on the AZ Host whenever it starts a new container. I have already tried Start-up Commands and also going into the Bash. The former doesnt work and the latter tells me permission is denied and resource is locked. Is there any way to install these packages on function start-up in Azure Functions?
If you try to install the package via bash, it is impossible and will not be dealt with at all. The reason is because when you use python to write functions and deploy them to linux os on azure, in fact it installs various packages according to requirements.txt, and finally merges these packages into a whole. When you run the function on azure, you are based on this whole package. Therefore, if it is incorrect to try to install the package after deployment, you should specify the package to be installed in requirements.txt before deployment and then deploy to azure.

SageMaker notebook connected to EMR import custom Python module

I looked through similar questions but none of them solved my problem.
I have a SageMaker notebook instance, opened a SparkMagic Pyspark notebook connected to a AWS EMR cluster. I have a SageMaker repo connected to this notebook as well called dsci-Python
Directory looks like:
/home/ec2-user/SageMaker/dsci-Python
/home/ec2-user/SageMaker/dsci-Python/pyspark_mle/datalake_data_object/SomeClass
/home/ec2-user/SageMaker/dsci-Python/Pyspark_playground.ipynb
There are __init__.py under both pyspark_mle and datalake_data_object directory and I have no problem importing them in other environments
when I'm running this code in Pyspark_playground.ipynb:
from pyspark_mle.datalake_data_object.SomeClass.SomeClass import Something
I got No module named 'pyspark_mle'
I think this is an environment path thing.
The repo is on your Notebook Instance, whereas the PySpark kernel is executing code on the EMR cluster.
To access these local modules on the EMR cluster, you can clone the repository on the EMR cluster.
Also, SparkMagic has a useful magic send_to_spark which can be used to send data from the Notebook locally to the Spark kernel. https://github.com/jupyter-incubator/sparkmagic/blob/master/examples/Send%20local%20data%20to%20Spark.ipynb

How to Connect to RDS Instance from AWS Glue Python Shell?

I am trying to access RDS Instance from AWS Glue, I have a few python scripts running in EC2 instances and I currently use PYODBC to connect, but while trying to schedule jobs for glue, I cannot import PYODBC as it is not natively supported by AWS Glue, not sure how drivers will work in glue shell as well.
From: Introducing Python Shell Jobs in AWS Glue announcement:
Python shell jobs in AWS Glue support scripts that are compatible with Python 2.7 and come pre-loaded with libraries such as the Boto3, NumPy, SciPy, pandas, and others.
The module list doesn't include pyodbc module, and it cannot be provided as custom .egg file because it depends on libodbc.so.2 and pyodbc.so libraries.
I think you have 2 options:
Create a jdbc connection to your DB from Glue's console, and use Glue's internal methods to query it. This will require code changes of course.
Use Lambda function instead. You'll need to pack pyodbc and the required libs along with your code in a zip file. Someone has already compiled those libs for AWS Lambda, see here.
Hope it helps
For AWS Glue use either Dataframe/DynamicFrame and specify the SQL Server JDBC driver. AWS Glue already contain JDBC Driver for SQL Server in its environment so you don't need to add any additional driver jar with glue job.
df1=spark.read.format("jdbc").option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver").option("url", url_src).option("dbtable", dbtable_src).option("user", userID_src).option("password", password_src).load()
if you are using a SQL instead of table:
df1=spark.read.format("jdbc").option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver").option("url", url_src).option("dbtable", ("your select statement here") A).option("user", userID_src).option("password", password_src).load()
As an alternate solution you can also use jtds driver for SQL server in your python script running in AWS Glue
If anyone needs a postgres connection with sqlalchemy using python shell, it is possible by referencing the sqlalchemy, scramp, pg8000 wheel files, it's important to reconstruct the wheel from pg8000 by eliminating the scramp dependency on the setup.py.
I needed to so something similar and ended up creating another Glue job in Scala while using Python for everything else. I know it may not work for everyone but wanted to mention How to run DDL SQL statement using AWS Glue
I was able to use the python library psycopg2 even though it is not written in pure python and it does not come preloaded with aws glue python shell environment. This runs contrary to aws glue documentation. So you might be able to use odbc related python libraries in a similar way. I created .egg files for psycopg2 library and used it successfully within glue python shell environment. Following are the logs from glue python shell if you have import psycopg2 in your script and the glue job refers to the related psycopg2 .egg files.
Creating /glue/lib/installation/site.py
Processing psycopg2-2.8.3-py2.7.egg
Copying psycopg2-2.8.3-py2.7.egg to /glue/lib/installation
Adding psycopg2 2.8.3 to easy-install.pth file
Installed /glue/lib/installation/psycopg2-2.8.3-py2.7.egg
Processing dependencies for psycopg2==2.8.3
Searching for psycopg2==2.8.3
Reading https://pypi.org/simple/psycopg2/
Downloading https://files.pythonhosted.org/packages/5c/1c/6997288da181277a0c29bc39a5f9143ff20b8c99f2a7d059cfb55163e165/psycopg2-2.8.3.tar.gz#sha256=897a6e838319b4bf648a574afb6cabcb17d0488f8c7195100d48d872419f4457
Best match: psycopg2 2.8.3
Processing psycopg2-2.8.3.tar.gz
Writing /tmp/easy_install-dml23ld7/psycopg2-2.8.3/setup.cfg
Running psycopg2-2.8.3/setup.py -q bdist_egg --dist-dir /tmp/easy_install-dml23ld7/psycopg2-2.8.3/egg-dist-tmp-9qwen3l_
creating /glue/lib/installation/psycopg2-2.8.3-py3.6-linux-x86_64.egg
Extracting psycopg2-2.8.3-py3.6-linux-x86_64.egg to /glue/lib/installation
Removing psycopg2 2.8.3 from easy-install.pth file
Adding psycopg2 2.8.3 to easy-install.pth file
Installed /glue/lib/installation/psycopg2-2.8.3-py3.6-linux-x86_64.egg
Finished processing dependencies for psycopg2==2.8.3
These are the steps that I used to connect to an RDS from glue python shell job:
Package up your dependency package into an egg file (these package must be pure python if I remember correctly). Put it in S3.
Set your job to reference that egg file under the job configuration > Python library path
Verify that your job can import the package/module
Create a glue connection to your RDS (it's in Database > Tables, Connections), test the connection make sure it can hit your RDS
Now in your job, you must set it to reference/use this connection. It's in the require connection as you configure your job or edit your job.
Once those steps are done and verify, you should be able to connect. In my sample I used pymysql.

AWS EMR install python libraries

I am running a Map Reduce Code in Amazon EMR using Python which uses the native boto library. I need to know which packages are pre-installed in the cluster nodes ? Also how do I automatically install some modules while bootstrapping ?

Categories