Create or Replace AWS Glue Crawler - python

Using boto3:
Is it possible to check if AWS Glue Crawler already exists and create it if it doesn't?
If it already exists I need to update it.
How would the crawler create script look like?
Would this be similar to CREATE OR REPLACE TABLE in an RDBMS...
Has anyone done this or has recommendations?
Thank you :)
Michael

As far as I know, there is no API for this. We manually list the crawlers using list_crawlers and iterate through the list to decide whether to add or update the crawlers(update_crawler).
Check out the API #
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html

Yes, you can do all of that using boto3, however, there is no single function that can do this all at once. Instead, you would have to make a series of the following API calls:
list_crawlers
get_crawler
update_crawler
create_crawler
Each time these function would return response, which you would need to parse/verify/check manually.
AWS is pretty good on their documentation, so definetely check it out. It might seem overwhelming, but at the beggining you might find it easy to simply copy and paste a request systex that they provide in docs and then strip down unnesessary parts etc. Although boto3 is very helpful with for autocompletion/suggestions but there is a project that can help with that mypy_boto3_builder and its predecessors mypy_boto3, boto3_type_annotations.
If something goes wrong, i.e you haven't specified some parameters correcly, their error responses are pretty good and helpful.
Here is an example of how you can list all existing crawlers
import boto3
from pprint import pprint
client = boto3.client('glue')
response = client.list_crawlers()
available_crawlers = response["CrawlerNames"]
for crawler_name in available_crawlers:
response = client.get_crawler(Name=crawler_name)
pprint(response)
Assuming that in IAM you have AWSGlueServiceRoleDefault with all required permissions for glue crawler, here is how you can create one:
response = client.create_crawler(
Name='my-crawler-via-api',
Role='AWSGlueServiceRoleDefault',
Description='Crawler generated with Python API', # optional
Targets={
'S3Targets': [
{
'Path': 's3://some/path/in/s3/bucket',
},
],
},
)

I ended up using standard Python exception handling:
#Instantiate the glue client.
glue_client = boto3.client(
'glue',
region_name = 'us-east-1'
)
#Attempt to create and start a glue crawler on PSV table or update and start it if it already exists.
try:
glue_client.create_crawler(
Name = 'crawler name',
Role = 'role to be used by glue to create the crawler',
DatabaseName = 'database where the crawler should create the table',
Targets =
{
'S3Targets':
[
{
'Path':'full s3 path to the directory that crawler should process'
}
]
}
)
glue_client.start_crawler(
Name = 'crawler name'
)
except:
glue_client.update_crawler(
Name = 'crawler name',
Role = 'role to be used by glue to create the crawler',
DatabaseName = 'database where the crawler should create the table',
Targets =
{
'S3Targets':
[
{
'Path':'full s3 path to the directory that crawler should process'
}
]
}
)
glue_client.start_crawler(
Name = 'crawler name'
)

Related

how to create a tags in azure disk using python?

I want to add or create a new tag in Azure Disk using python but not able to do anyone please help me with python sdk/code for this.
for disk in compute_client.disks.list():
if disk.as_dict()["name"] == "test_disk_rohit":
tags = target_disk.tags["DetachedTime"] = datetime.now()
compute_client.disks.begin_create_or_update(resrc,disk.as_dict()["name"],disk)
This is what I tried to add/create a new tag for my azure disk called "test_disk_rohit".
Anyone help me with this..
Instead of using begin_create_or_update you can use create_or_update.
I have followed the below code snippet I can be able to create/update the tags in desk
AZURE_TENANT_ID= '<Tenent ID>'
AZURE_CLIENT_ID='<Client ID>'
AZURE_CLIENT_SECRET='<Client Secret>'
AZURE_SUBSCRIPTION_ID='<Sub_ID>'
credentials = ServicePrincipalCredentials(client_id=AZURE_CLIENT_ID,secret=AZURE_CLIENT_SECRET,tenant=AZURE_TENANT_ID)
resource_client = ResourceManagementClient(credentials, AZURE_SUBSCRIPTION_ID)
compute_client = ComputeManagementClient(credentials,AZURE_SUBSCRIPTION_ID)
Diskdetails = compute_client.disks.create_or_update(
'<ResourceGroupName>',
'<Disk Name>',
{
'location': 'eastasia',
'creation_data': {
'create_option': DiskCreateOption.copy,
'source_resource_id': <Source Resource ID>
},
"tags": {
"tagtest": "testtagGanesh"
},
}
)
disk_resource = Diskdetails.result()
#get Disk details
disk = compute_client.disks.get('<ResourceGroupName>','<Disk Name>')
print(disk.sku)

Invalid API key/application pair–Clarifai

I was reading the documentation and wanted try out the api in pycharm. So I copied the code and it told me that I had an "Invalid API key/application pair". I copied my Api key straight from my app I made on https://portal.clarifai.com/ and put it in.
My code literally an exact copy except api key was deleted I copied straight from my app was deleted, and I running it on pycharm.
`
from clarifai_grpc.channel.clarifai_channel import ClarifaiChannel
from clarifai_grpc.grpc.api import resources_pb2, service_pb2, service_pb2_grpc
from clarifai_grpc.grpc.api.status import status_pb2, status_code_pb2
# Construct one of the channels you want to use
channel = ClarifaiChannel.get_json_channel()
channel = ClarifaiChannel.get_insecure_grpc_channel()
# Note: You can also use a secure (encrypted) ClarifaiChannel.get_grpc_channel() however
# it is currently not possible to use it with the latest gRPC version
stub = service_pb2_grpc.V2Stub(channel)
# This will be used by every Clarifai endpoint call.
metadata = (('authorization', 'Key {9f3d8b8ea01245e6b61c2a1311622db1}'),)
# Insert here the initialization code as outlined on this page:
# https://docs.clarifai.com/api-guide/api-overview/api-clients#client-installation-instructions
post_inputs_response = stub.PostInputs(
service_pb2.PostInputsRequest(
inputs=[
resources_pb2.Input(
data=resources_pb2.Data(
image=resources_pb2.Image(
url="https://samples.clarifai.com/metro-north.jpg",
allow_duplicate_url=True
)
)
)
]
),
metadata=metadata
)
if post_inputs_response.status.code != status_code_pb2.SUCCESS:
raise Exception("Post inputs failed, status: " + post_inputs_response.status.description)
`
Philip, your metadata declaration is not correct. You don't put your API key within {}
# This will be used by every Clarifai endpoint call.
metadata = (('authorization', 'Key {9f3d8b8ea01245e6b61c2a1311622db1}'),)
There is an extra } towards the end, please adjust it to this
metadata = (('authorization', 'Key 9f3d8b8ea01245e6b61c2a1311622db1'),)
Could you please try generating a new API key and trying that.

Specify job type when creating glue job with boto3

I'm trying to create a glue etl job. I'm using boto3. I'm using the script below. I want to create it as type=Spark, but the script below creates a type=Python Shell. Also it doesn't disable bookmarks. Does anyone know what I need to add to make it a type Spark and disable bookmarks?
script:
response = glue_assumed_client.create_job(
Name='mlxxxx',
Role='Awsxxxx',
Command={
'Name': 'mlxxxx',
'ScriptLocation': 's3://aws-glue-scripts-xxxxx-us-west-2/xxxx',
'PythonVersion': '3'
},
Connections={
'Connections': [
'sxxxx',
'spxxxxxx',
]
},
Timeout=2880,
MaxCapacity=10
)
To create Spark jobs, you would have to mention the name of the command as ‘glueetl’ as described below and if you are not running a python shell job you need not specify the python version in the Command parameters
response = client.create_job(
Name='mlxxxyu',
Role='Awsxxxx',
Command={
'Name': 'glueetl', # <—— mention the name as glueetl to create spark job
'ScriptLocation': 's3://aws-glue-scripts-xxxxx-us-west-2/xxxx'
},
Connections={
'Connections': [
'sxxxx',
'spxxxxxx',
]
},
Timeout=2880,
MaxCapacity=10
)
Regarding job bookmarks, job bookmarks are disabled by default, so if you don’t specify a parameter for a job bookmarks then the job created would have bookmarks disabled.
If you want to explicitly disable bookmarks, then you can specify the same in the Default Arguments[1] as shown below.
response = client.create_job(
Name='mlxxxyu',
Role='Awsxxxx',
Command={
'Name': 'glueetl',
'ScriptLocation': ‘s3://aws-glue-scripts-xxxxx-us-west-2/xxxx'
},
DefaultArguments={
'--job-bookmark-option': 'job-bookmark-disable'
},
Timeout=2880,
MaxCapacity=10
)
See the documentation.
Command (dict) -- [REQUIRED] The JobCommand that executes this job.
Name (string) -- The name of the job command. For an Apache Spark ETL job, this must be glueetl . For a Python shell job, it must be pythonshell .
You may reset the bookmark by using the function
client.reset_job_bookmark(
JobName='string',
RunId='string'
)
where the JobName is required. It can be obtained from the response['Name'] of the command create_job()

How to create terraform backend.tf file from python before execution to eliminiate interpolation state file issue

Actually we bulit webapp from there we are passing variables to the terraform by
like below
terraform apply -input=false -auto-approve -var ami="%ami%" -var region="%region%" -var icount="%count%" -var type="%instance_type%"
Actually the problem here was backend does not support variables i need to pass there values also form app.
TO resolve this I find some solution like we need to create backend.tf before execution.
But I am unable to get the idea how to do it if anyone having any exmaples regarding this please help me.
Thanks in advance..
I need to create backend.tf file from python by using below variables.
And need to replace key="${profile}/tfstate
for each profile the profile need to replace
i am thinking of using git repo by using git we create files and pull the values and again commit and execute
Please help me with some examples and ideas.
Code is like below:
My main.tf like below
terraform {
backend “s3” {
bucket = “terraform-007”
key = “key”
region = “ap-south-1”
profile=“venu”
}
}
provider “aws” {
profile = “ var.awsprofile"
region="{var.aws_region}”
}
resource “aws_instance” “VM” {
count = var.icount
ami = var.ami
instance_type = var.type
tags = {
Environment = “${var.env_indicator}”
}
}
vars.tf like
variable “aws_profile” {
default = “default”
description = “AWS profile name, as set in ~/.aws/credentials”
}
variable “aws_region” {
type = “string”
default = “ap-south-1”
description = “AWS region in which to create resources”
}
variable “env_indicator” {
type = “string”
default = “dev”
description = “What environment are we in?”
}
variable “icount” {
default = 1
}
variable “ami” {
default =“ami-54d2a63b”
}
variable “bucket” {
default=“terraform-002”
}
variable “type” {
default=“t2.micro”
}
output.tf like:
output “ec2_public_ip” {
value = ["${aws_instance.VM.*.public_ip}"]
}
output “ec2_private_ip” {
value = ["${aws_instance.VM.*.private_ip}"]
}
Actually the problem here was backend does not support variables i need to pass there values also form app.
TO resolve this I find some solution like we need to create backend.tf before execution.
But I am unable to get the idea how to do it if anyone having any exmaples regarding this please help me.
Thanks in advance..
Since the configuration for the backend cannot use interpolation, we have used a configuration by convention approach.
The terraform for all of our state collections (microservices and other infrastructure) use the same S3 bucket for state storage and the same DynamoDB table for locking.
When executing terraform, we use the same IAM role (a dedicated terraform only user).
We define the key for the state via convention, so that it does not need to be generated.
key = "platform/services/{name-of-service}/terraform.tfstate"
I would avoid a process that results in changes to the infrastructure code as it is being deployed to ensure maximum understand-ability by the engineers reading/maintaining the code.
EDIT: Adding key examples
For the user service:
key = "platform/services/users/terraform.tfstate"
For the search service:
key = "platform/services/search/terraform.tfstate"
For the product service:
key = "platform/services/products/terraform.tfstate"

how to fetch firebase data?

I am new to python and firebase and I am trying to flaten my firebase database.
I have a database in this format
each cat has thousands of data in it. All I want is to fetch the cat names and put them in an array. for example I want the output to be ['cat1','cat2'....]
I was using this tutorial
http://ozgur.github.io/python-firebase/
from firebase import firebase
firebase = firebase.FirebaseApplication('https://your_storage.firebaseio.com', None)
result = firebase.get('/Data', None)
the problem with the above code is it'll attempt to fetch all the data under Data. How can I only fetch the "cats"?
if you want to get the values inside the cats as columns, try using the pyrebase, using pip install pyrebase at cmd / anaconda prompt(later prefered if you didn't set up PIP or Python at your environment paths. after installing:
import pyrebase
config {"apiKey": yourapikey
"authDomain": yourapidomain
"databaseURL": yourdatabaseurl,
"storageBucket": yourstoragebucket,
"serviceAccount": yourserviceaccount
}
Note: you can find all the information above at your Firebase's console:
https://console.firebase.google.com/project/ >>> your project >>> click on the icon "<'/>" with the tag "add firebase to your web app
back to the code...
make a neat definition so you can store it into a py file:
def connect_firebase():
# add a way to encrypt those, I'm a starter myself and don't know how
username: "usernameyoucreatedatfirebase"
password: "passwordforaboveuser"
firebase = pyrebase.initialize_app(config)
auth = firebase.auth()
#authenticate a user > descobrir como não deixar hardcoded
user = auth.sign_in_with_email_and_password(username, password)
#user['idToken']
# At pyrebase's git the author said the token expires every 1 hour, so it's needed to refresh it
user = auth.refresh(user['refreshToken'])
#set database
db = firebase.database()
return db
Ok, now save this into a neat .py file
NEXT, at your new notebook or main .py you're going to import this new .py file that we'll call auth.py from now on...
from auth import *
# add do a variable
db = connect_firebase()
#and now the hard/ easy part that took me a while to figure out:
# notice the value inside the .child, it should be the parent name with all the cats keys
values = db.child('cats').get()
# adding all to a dataframe you'll need to use the .val()
data = pd.DataFrame(values.val())
and thats it, print(data.head()) to check if the values / columns are where they're expected to be.
Firebase Realtime Database is one big JSON tree:
when you fetch data at a location in your database, you also retrieve
all of its child nodes.
The best practice is to denormalize your data, creating multiple locations (nodes) for the same data:
Many times you can denormalize the data by using a query to retrieve a
subset of the data
In your case, you may create a second node named "categories" where you list "only" the category names.
/cat1
/...
/cat2
/...
/cat3
/...
/cat4
/...
/categories
/cat1
/cat2
/cat3
/cat4
In this scenario you can use the update() method to write to more than one location at the same time.
I was exploring pyrebase documentation. As per that, we may extract only keys from some path.
To return just the keys at a particular path use the shallow() method.
all_user_ids = db.child("users").shallow().get()
In your case, it'll be something like:
firebase = pyrebase.initialize_app(config)
db = firebase.database()
allCats = db.child("data").shallow().get()
Let me know if it didn't help.

Categories