import error : No module in AWS Glue job script- Python - python

I am trying to provide my custom python code which requires libraries that are not supported by AWS(pandas). So, I created a zip file with the necessary libraries and uploaded it to the S3 bucket. While running the job, I pointed the path of S3 bucket in the advanced properties.Still my job is not running successfully. Can anyone suggest why?
1.Do I have to include my code in the zip file?
If yes then how will Glue understand that it's the code?
2. Also do I need to create a package or just zip file will do?
Appreciate the help!

An update on AWS Glue Jobs released on 22nd Jan 2019.
Introducing Python Shell Jobs in AWS Glue -- Posted On: Jan 22, 2019
Python shell jobs in AWS Glue support scripts that are compatible with
Python 2.7 and come pre-loaded with libraries such as the Boto3,
NumPy, SciPy, pandas, and others. You can run Python shell jobs using
1 DPU (Data Processing Unit) or 0.0625 DPU (which is 1/16 DPU). A
single DPU provides processing capacity that consists of 4 vCPUs of
compute and 16 GB of memory.
More info at : https://aws.amazon.com/about-aws/whats-new/2019/01/introducing-python-shell-jobs-in-aws-glue/
https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html

According to AWS Glue Documentation:
Only pure Python libraries can be used. Libraries that rely on C
extensions, such as the pandas Python Data Analysis Library, are not
yet supported.
I think it wouldn't work even if we upload the python library as a zip file, if the library you are using has a dependency for C extensions. I had tried using Pandas, Holidays, etc the same way you have tried, and on contacting AWS Support, they mentioned it is in their to do list (support for these python libaries), but no ETA as of now.
So, any libraries that are not native python, would not work in AWS Glue, at this point. But should be available in the near future, since this is a popular demand.
If still you would like to try it out, please refer to this link, where its explained how to package the external libraries to run in AWS glue, I tried it but didnt work for me.

As Yuva's answer mentioned, I believe it's currently impossible to import a library that is not purely in Python and the documentation reflects that.
However, in case someone came here looking for an answer on how to import a python library in AWS Glue in general, there is a good explanation in this post on how to do it with the pg8000 library:
AWS Glue - Truncate destination postgres table prior to insert

Related

Best practices with google cloud app engine: Python2 or Python3?

I am new to Google Cloud Platform and in my whole I have been working on Python 3. I am trying to find out which version of Python is more complete for Google App Engine: Python 2.7 or Python 3.
As I'm starting to work with Google App Engine I have realised that continuing using Python 3 seems too painful as basic tools like dev_appserver.py are written for Python 2 only. Now I am hitting the opposite problem: cloudstorage module seems to exist only for python3. Again, when I install it, seems the only way I can test read/write to google bucket locally is by authenticating with google.appengine.ext, which in turn only works within dev_appserver.py or remotely. This leaves me confused which environment to chose.
What is a general agreement / what is the focus of Google App Engine: Python 2 or Python 3?
In App Engine, you have to options: the Standard environment and the Flexible environment.
Python 2.7 is available in both Standard and Flexible, while Python 3.6 is only available in Flexible.
Also, the choice between Standard and Flexible depends on what you want to do/what libraries you need:
There are some third-party libraries already built-in in the Standard Environment, and you can include other libraries, but, those libraries can't include C extensions, they must be written in pure Python. If you need libraries with C extensions, you will have to move to Flexible.
In Standard, you can use propietary libraries (like google.appengine.ext, as you mentioned) to do tasks like accessing databases, while in Flexible you can use other libraries (like the client you mentioned).
There are also another important differences, like pricing, scaling, etc. The choice will depend, as I said, in your needs for your application.
EDIT
dev_appserver.py is only used when developing in Standard. There is a tutorial in here, with Flask. If you are in Flexible, you can test the app locally as if you were running as usual a python file, like in this other example.
You can use buckets in both Standard and Flexible
The python3-only cloudstorage support assumption based on the SO post you referenced is not correct:
the import appears to be done in a regular python shell or as a standalone script, not from a standard environment GAE app - different things, see import cloudstorage, ImportError: No module named google.appengine.api.
it is not specified where that library comes from
GCS is definitely supported in the standard env GAE (i.e. on python 2), you just need to follow the steps from the official documentation: Setting Up Google Cloud Storage and Reading and Writing to Google Cloud Storage.
Both were good. But the question is what kind of environment do you want? Standard environment or Flexible environment.
Find your answer in this document: https://cloud.google.com/appengine/docs/python/
It kind of depends on what you're using it for. If you're doing data science, for example, I'm seeing a few notices of Python libraries that are (finally) dropping support for Python 2. numpy is one that is dropping support.
Generally speaking, I would recommend Python 3 over Python 2. Why spend time developing in an aging version when its replacement has matured nicely and is more consistent?

Command glossary for dataflow?

I'm experimenting with the Dataflow Python SDK and would like some sort of reference as to what the various commands do, their required args and their recommended syntax.
So after
import google.cloud.dataflow as df
Where can I read up on df.Create, df.Write, df.FlatMap, df.CombinePerKey, etc. ? Has anybody put together such a reference?
Is there anyplace (link please) where all the possible Apache Beam / Dataflow commands are collected and explained?
There is not yet a pydoc server running for Dataflow Python. However, you can easily run your own in order to browse: https://github.com/GoogleCloudPlatform/DataflowPythonSDK#a-quick-tour-of-the-source-code

How can I reference libraries for ApacheSpark using IPython Notebook only?

I'm currently playing around with the Apache Spark Service in IBM Bluemix. There is a quick start composite application (Boilerplate) consisting of the Spark Service itself, an OpenStack Swift service for the data and an IPython/Jupyter Notebook.
I want to add some 3rd party libraries to the system and I'm wondering how this could be achieved. Using an python import statement doesn't really help since the libraries are then expected to be located on the SparkWorker nodes.
Is there a ways of loading python libraries in Spark from an external source during job runtime (e.g. a Swift or ftp source)?
thanks a lot!
You cannot add 3rd party libraries at this point in the beta. This will most certainly be coming later in the beta as it's a popular requirement ;-)

Python: Windows registry hive access NOT using registry APIs

I am trying to extract some data out of the Windows registry, both the software hive and ntuser.dat from XP computers. Currently I'm using reg.exe to load the hive and _winreg to extract the data. I need to use reg.exe as the computers I'm backing up data from are usually offline and I'm putting the hard drive from them in an external drive bay and loading the hives from that in another Windows session. It's not feasible to boot up the computers being backed up as they are often failing hard drives or otherwise unbootable.
I've seen a utility called hivex which runs under Linux which combines a c-module with a python wrapper to allow for read-only (limited write) access to the Windows registry, without using the Windows Registry APIs. Sadly there doesn't appear to be a Windows version of hivex, assumingly because no one figured a need to access the Windows registry under Windows by directly accessing the hive files.
I'd love to drop the dependency of reg.exe being called by subprocess.Popen() as calling an external executable has a host of issues, plus it makes the backup utility platform limited.
Does anyone know of a python module which allows for direct access of the hive files themselves? I already know of, and am currently using _winreg, so suggesting that would be less than helpful. Thanks in advance.
I'm not sure how much better it is, but the pywin32 library supplies bindings to most of the windows API. I don't know the windows API well enough to know if you can open arbitrary hive files, however it could be worth a quick look (the release contains a CHM with the full API mapping).
Did you have a look to regobj it provides pythonic access to registry value (but it is still based on _winreg)
Is your problem with calling an external application or using the registry APIs? If it is the former you can load and unload hives yourself using RegLoadKey / RegUnLoadKey. If it is the latter then I'm sure somebody has written a C library to parse hives directly. A quick Google search gave me Microsoft's Offline Registry Library.

Accessing a JET (.mdb) database in Python

Is there a way to access a JET database from Python? I'm on Linux. All I found was a .mdb viewer in the repositories, but it's very faulty. Thanks
MDB Tools is a set of open source libraries and utilities to facilitate exporting data from MS Access databases (mdb files) without using the Microsoft DLLs. Thus non Windows OSs can read the data. Or, to put it another way, they are reverse engineering the layout of the MDB file.
Jackcess is a pure Java library for reading from and writing to MS Access databases. It is part of the OpenHMS project from Health Market Science, Inc. . It is not an application. There is no GUI. It's a library, intended for other developers to use to build Java applications.
ACCESSdb is a JavaScript library used to dynamically connect to and query locally available Microsoft Access database files within Internet Explorer.
Both Jackcess and ACCESSdb are much newer than MDB tools, are more active and have write support.
Install your distribution's packaged version of mdbtools, use mdb-export to export the Jet data to text files, import the data into a SQLite database, and have a combination of code and data that works in almost any computing environment you might get your hands on.
Probably the most simple solution:
Download VirtualBox and install Windows and MS access in it.
Write a small Python server which use ODBC to access the database and which receives commands from a network socket.
On Linux, connect to the server in the virtual machine and access the database this way.
This gives you full access to all features. Every other solution will either limit the features you can use (for example, you won't be able to modify the data) or be pretty unsafe.
If you build the CVS version of mdb-tools, it works rather well. It fixed a lot of issues I had trying to use the one in the repositories related to memo field size. mdb-tools is basically a dead project, but people have still been occasionally contributing code to the CVS. The build in Ubuntu is from 2004 I think.
CVS instructions here:
http://sourceforge.net/scm/?type=cvs&group_id=2294
If using Ubuntu, before downloading the sources you'll want to enable source repositories and do:
apt-get build-dep mdbtools
That will get the required packages you'll need to manually build the sources from CVS.

Categories