I'm currently playing around with the Apache Spark Service in IBM Bluemix. There is a quick start composite application (Boilerplate) consisting of the Spark Service itself, an OpenStack Swift service for the data and an IPython/Jupyter Notebook.
I want to add some 3rd party libraries to the system and I'm wondering how this could be achieved. Using an python import statement doesn't really help since the libraries are then expected to be located on the SparkWorker nodes.
Is there a ways of loading python libraries in Spark from an external source during job runtime (e.g. a Swift or ftp source)?
thanks a lot!
You cannot add 3rd party libraries at this point in the beta. This will most certainly be coming later in the beta as it's a popular requirement ;-)
Related
I wanted to share my python program to my friends but the problem is they will have to install python first then all the libraries which I used in order to run my program and it might be hard to do so as I have used too many libraries like 15-20 something.
MY Questions:-
Q1. How can I share my python program without making them install so much stuff?
Q2. Is there any other language on which it could be done?
Thank you.
Regards
Google Colab
You can write your python program in Google Colab and then share the notebook with others to run.
It's free to use and your dependencies can be imported or installed based on the information from this stackoverflow post.
Streamlit
Streamlit allows you to build custom web apps using python that are shareable. It's marketed use is for data science and machine learning python projects. You should check their website to see if it satisfies your specific needs.
repl.it
On the website repl.it you can create public Python projects which can even include PyPI dependencies. The user can then run and edit them, for example here: https://replit.com/#TedTaras/Monster-Hunter. Projects are public by default, private ones cost extra.
I am trying to provide my custom python code which requires libraries that are not supported by AWS(pandas). So, I created a zip file with the necessary libraries and uploaded it to the S3 bucket. While running the job, I pointed the path of S3 bucket in the advanced properties.Still my job is not running successfully. Can anyone suggest why?
1.Do I have to include my code in the zip file?
If yes then how will Glue understand that it's the code?
2. Also do I need to create a package or just zip file will do?
Appreciate the help!
An update on AWS Glue Jobs released on 22nd Jan 2019.
Introducing Python Shell Jobs in AWS Glue -- Posted On: Jan 22, 2019
Python shell jobs in AWS Glue support scripts that are compatible with
Python 2.7 and come pre-loaded with libraries such as the Boto3,
NumPy, SciPy, pandas, and others. You can run Python shell jobs using
1 DPU (Data Processing Unit) or 0.0625 DPU (which is 1/16 DPU). A
single DPU provides processing capacity that consists of 4 vCPUs of
compute and 16 GB of memory.
More info at : https://aws.amazon.com/about-aws/whats-new/2019/01/introducing-python-shell-jobs-in-aws-glue/
https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html
According to AWS Glue Documentation:
Only pure Python libraries can be used. Libraries that rely on C
extensions, such as the pandas Python Data Analysis Library, are not
yet supported.
I think it wouldn't work even if we upload the python library as a zip file, if the library you are using has a dependency for C extensions. I had tried using Pandas, Holidays, etc the same way you have tried, and on contacting AWS Support, they mentioned it is in their to do list (support for these python libaries), but no ETA as of now.
So, any libraries that are not native python, would not work in AWS Glue, at this point. But should be available in the near future, since this is a popular demand.
If still you would like to try it out, please refer to this link, where its explained how to package the external libraries to run in AWS glue, I tried it but didnt work for me.
As Yuva's answer mentioned, I believe it's currently impossible to import a library that is not purely in Python and the documentation reflects that.
However, in case someone came here looking for an answer on how to import a python library in AWS Glue in general, there is a good explanation in this post on how to do it with the pg8000 library:
AWS Glue - Truncate destination postgres table prior to insert
I am new to Google Cloud Platform and in my whole I have been working on Python 3. I am trying to find out which version of Python is more complete for Google App Engine: Python 2.7 or Python 3.
As I'm starting to work with Google App Engine I have realised that continuing using Python 3 seems too painful as basic tools like dev_appserver.py are written for Python 2 only. Now I am hitting the opposite problem: cloudstorage module seems to exist only for python3. Again, when I install it, seems the only way I can test read/write to google bucket locally is by authenticating with google.appengine.ext, which in turn only works within dev_appserver.py or remotely. This leaves me confused which environment to chose.
What is a general agreement / what is the focus of Google App Engine: Python 2 or Python 3?
In App Engine, you have to options: the Standard environment and the Flexible environment.
Python 2.7 is available in both Standard and Flexible, while Python 3.6 is only available in Flexible.
Also, the choice between Standard and Flexible depends on what you want to do/what libraries you need:
There are some third-party libraries already built-in in the Standard Environment, and you can include other libraries, but, those libraries can't include C extensions, they must be written in pure Python. If you need libraries with C extensions, you will have to move to Flexible.
In Standard, you can use propietary libraries (like google.appengine.ext, as you mentioned) to do tasks like accessing databases, while in Flexible you can use other libraries (like the client you mentioned).
There are also another important differences, like pricing, scaling, etc. The choice will depend, as I said, in your needs for your application.
EDIT
dev_appserver.py is only used when developing in Standard. There is a tutorial in here, with Flask. If you are in Flexible, you can test the app locally as if you were running as usual a python file, like in this other example.
You can use buckets in both Standard and Flexible
The python3-only cloudstorage support assumption based on the SO post you referenced is not correct:
the import appears to be done in a regular python shell or as a standalone script, not from a standard environment GAE app - different things, see import cloudstorage, ImportError: No module named google.appengine.api.
it is not specified where that library comes from
GCS is definitely supported in the standard env GAE (i.e. on python 2), you just need to follow the steps from the official documentation: Setting Up Google Cloud Storage and Reading and Writing to Google Cloud Storage.
Both were good. But the question is what kind of environment do you want? Standard environment or Flexible environment.
Find your answer in this document: https://cloud.google.com/appengine/docs/python/
It kind of depends on what you're using it for. If you're doing data science, for example, I'm seeing a few notices of Python libraries that are (finally) dropping support for Python 2. numpy is one that is dropping support.
Generally speaking, I would recommend Python 3 over Python 2. Why spend time developing in an aging version when its replacement has matured nicely and is more consistent?
I am investigating into if I can use a library like GHMM with my python web service in which runs on AppEngine.
Short answer: no
https://developers.google.com/appengine/kb/commontasks
What third party libraries can I use in my application?
You can use any pure Python third party libraries in your Google App Engine application. In order to use a third party library, simply include the files in your application's directory, and they will be uploaded with your application when you deploy it to our system. You can import the files as you would any other Python files with your application.
As #gahooa has said, the generic answer is no.
For more popular libraries that have C dependencies your best option right now is to file a ticket[1], get other to upvote (star) your ticket and have the App Engine add it as a supported library.
[1] http://code.google.com/p/googleappengine/issues/entry?template=Feature%20request
In 2021 yes you can.
The flexible version of AppEngine allows you to do this, standard does not.
If, like me, you cannot justify the full time running costs of flexible, an alternative is to host the C library on Cloud Run and make API calls to it. Then you have costs of AppEngine standard and Cloud Run, but both are on-demand only.
Is there a way to access a JET database from Python? I'm on Linux. All I found was a .mdb viewer in the repositories, but it's very faulty. Thanks
MDB Tools is a set of open source libraries and utilities to facilitate exporting data from MS Access databases (mdb files) without using the Microsoft DLLs. Thus non Windows OSs can read the data. Or, to put it another way, they are reverse engineering the layout of the MDB file.
Jackcess is a pure Java library for reading from and writing to MS Access databases. It is part of the OpenHMS project from Health Market Science, Inc. . It is not an application. There is no GUI. It's a library, intended for other developers to use to build Java applications.
ACCESSdb is a JavaScript library used to dynamically connect to and query locally available Microsoft Access database files within Internet Explorer.
Both Jackcess and ACCESSdb are much newer than MDB tools, are more active and have write support.
Install your distribution's packaged version of mdbtools, use mdb-export to export the Jet data to text files, import the data into a SQLite database, and have a combination of code and data that works in almost any computing environment you might get your hands on.
Probably the most simple solution:
Download VirtualBox and install Windows and MS access in it.
Write a small Python server which use ODBC to access the database and which receives commands from a network socket.
On Linux, connect to the server in the virtual machine and access the database this way.
This gives you full access to all features. Every other solution will either limit the features you can use (for example, you won't be able to modify the data) or be pretty unsafe.
If you build the CVS version of mdb-tools, it works rather well. It fixed a lot of issues I had trying to use the one in the repositories related to memo field size. mdb-tools is basically a dead project, but people have still been occasionally contributing code to the CVS. The build in Ubuntu is from 2004 I think.
CVS instructions here:
http://sourceforge.net/scm/?type=cvs&group_id=2294
If using Ubuntu, before downloading the sources you'll want to enable source repositories and do:
apt-get build-dep mdbtools
That will get the required packages you'll need to manually build the sources from CVS.