Command glossary for dataflow? - python

I'm experimenting with the Dataflow Python SDK and would like some sort of reference as to what the various commands do, their required args and their recommended syntax.
So after
import google.cloud.dataflow as df
Where can I read up on df.Create, df.Write, df.FlatMap, df.CombinePerKey, etc. ? Has anybody put together such a reference?
Is there anyplace (link please) where all the possible Apache Beam / Dataflow commands are collected and explained?

There is not yet a pydoc server running for Dataflow Python. However, you can easily run your own in order to browse: https://github.com/GoogleCloudPlatform/DataflowPythonSDK#a-quick-tour-of-the-source-code

Related

Which tools I could use for launching an instance in openstack with python script

Basically, I need to write a python script which takes arguments with argparser and launches instance of VM in openstack, optionally creates a disk tool and mounts it to VM.
I've tried to search for similiar scripts and found this , generally this should work out, but it is quite old and when I looked for PythonSDK documentation on openstack website and found many different clients and python api for that clients, which I should use?
Each OpenStack service has its own python client library, such as python-novaclient, python-cinderclient, python-glanceclient. They also provide user guides, e.g. How to use cinderclient, have a look and you will find out the answer.
Generally, I prefer trying command-line in terminal first, like cinder create --display-name corey-volume 10 or nova boot --image xxx --block-device source=volume,id=xxx corey-vm, to verify the command exists and the idea works, then change it to python code. If I don't know how to use it or get unexpected errors in the script, I will go to Github to check its source code, it really helps, especially in debugging.

Can we clone a terminated emr cluster using lambda function and will there be any dfferences in the new cluster?

I have come across answers saying that it is not entirely possible to clone a cluster using lambda boto3. And some saying that it is possible only through the aws cli. And I have come across run_job_flow function but it involves passing all the parameters separately. And couldn't figure how do we use the terminated cluster along with run_job_flow to get a new clone. If you can please suggest me a way to do this. Thank you.
Option1: If you have enough permissions in the AWS console, then in simply one click you should be able to get a new cluster, exactly the clone of the terminated cluster.
I am attaching the screenshot for your reference.
Option2: You can also do the AWS CLI export highlighted in the above image so that you can use them using the command line. paste the output of the CLI export in a file and run it from a place that has enough access to launch the EMR.
Option3: You can also write an AWS lambda function which will be responsible for spawning the EMR. You can find multiple examples of this online.

how to setup modules in the Glue job script

I'm following this tutorial: https://www.youtube.com/watch?v=EzQArFt_On4
In this tutorial it's only using one python script, what is I need to import some functions from another python script? For example: import script2
I wonder what's the correct way to setup in Glue job? I've tried to store this script in s3 bucket and add the location in editjob -> Security configuration, script libraries, and job parameters (optional)
->Python library path, but it gave me error ModuleNotFoundError: No module named 'script2, does anyone know how to fix this? Thanks.
In the video tutorial there is no such import like import script2. So if you do this in your script and don't provide script2.py library, the import is going to fail with your the message you are getting.
How to write modules, is best explained in Python docs.
The best way to start programming glue jobs is to auto-generate glue scripts by glue console. Then you can use the scripts generated as a starting point for customization. What's more you can setup Glue Endpoints or even run glue locally (or on ec2 instance) for learning and development purposes.

Using kubectl rollout restart equivalent with k8s python client

I am trying to develop a AWS lambda to make a rollout restart deployment using the python client. I cannot find any implementation in the github repo or references. Using the -v in the kubectl rollout restart is not giving me enough hints to continue with the development.
Anyways, it is more related to the python client:
https://github.com/kubernetes-client/python
Any ideas? perhaps I could be missing something
The python client interacts directly with the Kubernetes API. Similar to what kubectl does. However, kubectl added some utility commands which contain logic that is not contained in the Kubernetes API. Rollout is one of those utilities.
In this case that means you have two approaches. You could reverse engineer the API calls the kubectl rollout restart makes. Pro tip: With go, you can actually import internal Kubectl behaviour and libraries, making this quite easy. So consider writing your lambda in golang.
Alternatively, you can have your Lambda call the Kubectl binary (using the process exec libraries in python). However, this does mean you need to include the binary in your lambda in some way (either by uploading it with your lambda or by building a lambda layer containing kubectl).
#Andre Pires, it can be done like this way :
data := fmt.Sprintf(`{"spec":{"template":{"metadata":{"annotations":{"kubectl.kubernetes.io/restartedAt":"%s"}}}},"strategy":{"type":"RollingUpdate","rollingUpdate":{"maxUnavailable":"%s","maxSurge": "%s"}}}`, time.Now().String(), "25%", "25%")
newDeployment, err := clientImpl.ClientSet.AppsV1().Deployments(item.Pod.Namespace).Patch(context.Background(), deployment.Name, types.StrategicMergePatchType, []byte(data), metav1.PatchOptions{FieldManager: "kubectl-rollout"})

import error : No module in AWS Glue job script- Python

I am trying to provide my custom python code which requires libraries that are not supported by AWS(pandas). So, I created a zip file with the necessary libraries and uploaded it to the S3 bucket. While running the job, I pointed the path of S3 bucket in the advanced properties.Still my job is not running successfully. Can anyone suggest why?
1.Do I have to include my code in the zip file?
If yes then how will Glue understand that it's the code?
2. Also do I need to create a package or just zip file will do?
Appreciate the help!
An update on AWS Glue Jobs released on 22nd Jan 2019.
Introducing Python Shell Jobs in AWS Glue -- Posted On: Jan 22, 2019
Python shell jobs in AWS Glue support scripts that are compatible with
Python 2.7 and come pre-loaded with libraries such as the Boto3,
NumPy, SciPy, pandas, and others. You can run Python shell jobs using
1 DPU (Data Processing Unit) or 0.0625 DPU (which is 1/16 DPU). A
single DPU provides processing capacity that consists of 4 vCPUs of
compute and 16 GB of memory.
More info at : https://aws.amazon.com/about-aws/whats-new/2019/01/introducing-python-shell-jobs-in-aws-glue/
https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html
According to AWS Glue Documentation:
Only pure Python libraries can be used. Libraries that rely on C
extensions, such as the pandas Python Data Analysis Library, are not
yet supported.
I think it wouldn't work even if we upload the python library as a zip file, if the library you are using has a dependency for C extensions. I had tried using Pandas, Holidays, etc the same way you have tried, and on contacting AWS Support, they mentioned it is in their to do list (support for these python libaries), but no ETA as of now.
So, any libraries that are not native python, would not work in AWS Glue, at this point. But should be available in the near future, since this is a popular demand.
If still you would like to try it out, please refer to this link, where its explained how to package the external libraries to run in AWS glue, I tried it but didnt work for me.
As Yuva's answer mentioned, I believe it's currently impossible to import a library that is not purely in Python and the documentation reflects that.
However, in case someone came here looking for an answer on how to import a python library in AWS Glue in general, there is a good explanation in this post on how to do it with the pg8000 library:
AWS Glue - Truncate destination postgres table prior to insert

Categories