Azure Data Factory Python SDK - Convert PipelineResource to JSON

Azure Data Factory Python SDK - Convert PipelineResource to JSON - python

I am creating Azure Data Factory pipeline using Python SDK (azure.mgmt.datafactory.models.PipelineResource). I need to convert PipelineResource object to JSON file. Is it possible anyhow?
I tried json.loads(pipeline_object) , json.dumps(pipeline_object) but no luck.

I need to convert PipelineResource object to JSON file. Is it possible anyhow?
You can try the following code snippet as suggested by mccoyp:
You can add a default argument to json.dumps to make objects that are not JSON serializable into dict
import json
from azure.mgmt.datafactory.models import Activity, PipelineResource
activity = Activity(name="activity-name")
resource = PipelineResource(activities=[activity])
json_dict = json.dumps(resource, default=lambda obj: obj.__dict__)
print(json_dict)

you can use this.
# Create a copy activity
act_name = 'copyBlobtoBlob'
blob_source = BlobSource()
blob_sink = BlobSink()
dsin_ref = DatasetReference(reference_name=ds_name)
dsOut_ref = DatasetReference(reference_name=dsOut_name)
copy_activity = CopyActivity(name=act_name,inputs=[dsin_ref], outputs=[dsOut_ref], source=blob_source, sink=blob_sink)
#Create a pipeline with the copy activity
#Note1: To pass parameters to the pipeline, add them to the json string params_for_pipeline shown below in the format { “ParameterName1” : “ParameterValue1” } for each of the parameters needed in the pipeline.
#Note2: To pass parameters to a dataflow, create a pipeline parameter to hold the parameter name/value, and then consume the pipeline parameter in the dataflow parameter in the format #pipeline().parameters.parametername.
p_name = 'copyPipeline'
params_for_pipeline = {}
p_name = 'copyPipeline'
params_for_pipeline = {}
p_obj = PipelineResource(activities=[copy_activity], parameters=params_for_pipeline)
p = adf_client.pipelines.create_or_update(rg_name, df_name, p_name, p_obj)
print_item(p)

Related

Python boto3, list contents of specific dir in bucket, limit depth

This is the same as this question, but I also want to limit the depth returned.
Currently, all answers return all the objects after the specified prefix. I want to see just what's in the current hierarchy level.
Current code that returns everything:
self._session = boto3.Session(
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
)
self._session.resource("s3")
bucket = self._s3.Bucket(bucket_name)
detections_contents = bucket.objects.filter(Prefix=prefix)
for object_summary in detections_contents:
print(object_summary.key)
How to see only the files and folders directly under prefix? How to go n levels deep?
I can parse everything locally, and this is clearly not what I am looking for here.

There is no definite way to do this using list objects without getting all the objects in the dir.
But there is a way using s3 select which uses sql query like format to get n levels deep to get the file content as well as to get object keys.
If you are fine with writing sql then use this.
reference doc
import boto3
import json
s3 = boto3.client('s3')
bucket_name = 'my-bucket'
prefix = 'my-directory/subdirectory/'
input_serialization = {
'CompressionType': 'NONE',
'JSON': {
'Type': 'LINES'
}
}
output_serialization = {
'JSON': {}
}
# Set the SQL expression to select the key field for all objects in the subdirectory
expression = 'SELECT s.key FROM S3Object s WHERE s.key LIKE \'' + prefix + '%\''
response = s3.select_object_content(
Bucket=bucket_name,
ExpressionType='SQL',
Expression=expression,
InputSerialization=input_serialization,
OutputSerialization=output_serialization
)
# The response will contain a Payload field with the selected data
payload = response['Payload']
for event in payload:
if 'Records' in event:
records = event['Records']['Payload']
data = json.loads(records.decode('utf-8'))
# The data will be a list of objects, each with a "key" field representing the file name
for item in data:
print(item['key'])

There is not built in way with the Boto3 or S3 APIs to do this. You'll need some version of processing each level and asking in turn for a list of objects at that level:
import boto3
s3 = boto3.client('s3')
max_depth = 2
paginator = s3.get_paginator('list_objects_v2')
# Track all prefixes to show with a list
common_prefixes = [(0, "")]
while len(common_prefixes) > 0:
# Pull out the next prefix to show
current_depth, current_prefix = common_prefixes.pop(0)
# Loop through all of the items using a paginator to handle common prefixes with more
# than a thousand items
for page in paginator.paginate(Bucket=bucket_name, Prefix=current_prefix, Delimiter='/'):
for cur in page.get("CommonPrefixes", []):
# Show each common prefix, here just use a format like AWS CLI does
print(" " * 27 + f"PRE {cur['Prefix']}")
if current_depth < max_depth:
# This is below the max depth we want to show, so
# add it to the list to be shown
common_prefixes.append((current_depth + 1, cur['Prefix']))
for cur in page.get("Contents", []):
# Show each item sharing this common prefix using a format like the AWS CLI
print(f"{cur['LastModified'].strftime('%Y-%m-%d %H:%M:%S')}{cur['Size']:11d} {cur['Key']}")

How to convert Hive schema to Bigquery schema using Python?

What i get from api:
"name":"reports"
"col_type":"array<struct<imageUrl:string,reportedBy:string>>"
So in hive schema I got:
reports array<struct<imageUrl:string,reportedBy:string>>
Note: I got hive array schema as string from api
My target:
bigquery.SchemaField("reports", "RECORD", mode="NULLABLE",
fields=(
bigquery.SchemaField('imageUrl', 'STRING'),
bigquery.SchemaField('reportedBy', 'STRING')
)
)
Note: I would like to create universal code that can handle when i receive any number of struct inside of the array.
Any tips are welcome.

I tried creating a script that parses your input which is reports array<struct<imageUrl:string,reportedBy:string>>. This converts your input to a dictionary that could be used as schema when creating a table. The main idea of the apporach is instead of using SchemaField(), you can create a dictionary which is much easier than creating SchemaField() objects with parameters using your example input.
NOTE: The script is only tested based on your input and it can parse more fields if added in struct<.
import re
from google.cloud import bigquery
def is_even(number):
if (number % 2) == 0:
return True
else:
return False
def clean_string(str_value):
return re.sub(r'[\W_]+', '', str_value)
def convert_to_bqdict(api_string):
"""
This only works for a struct with multiple fields
This could give you an idea on constructing a schema dict for BigQuery
"""
num_even = True
main_dict = {}
struct_dict = {}
field_arr = []
schema_arr = []
# Hard coded this since not sure what the string will look like if there are more inputs
init_struct = sample.split(' ')
main_dict["name"] = init_struct[0]
main_dict["type"] = "RECORD"
main_dict["mode"] = "NULLABLE"
cont_struct = init_struct[1].split('<')
num_elem = len(cont_struct)
# parse fields inside of struct<
for i in range(0,num_elem):
num_even = is_even(i)
# fields are seen on even indices
if num_even and i != 0:
temp = list(filter(None,cont_struct[i].split(','))) # remove blank elements
for elem in temp:
fields = list(filter(None,elem.split(':')))
struct_dict["name"] = clean_string(fields[0])
# "type" works for STRING as of the moment refer to
# https://cloud.google.com/bigquery/docs/schemas#standard_sql_data_types
# for the accepted data types
struct_dict["type"] = clean_string(fields[1]).upper()
struct_dict["mode"] = "NULLABLE"
field_arr.append(struct_dict)
struct_dict = {}
main_dict["fields"] = field_arr # assign dict to array of fields
schema_arr.append(main_dict)
return schema_arr
sample = "reports array<struct<imageUrl:string,reportedBy:string,newfield:bool>>"
bq_dict = convert_to_bqdict(sample)
client = bigquery.Client()
project = client.project
dataset_ref = bigquery.DatasetReference(project, '20211228')
table_ref = dataset_ref.table("20220203")
table = bigquery.Table(table_ref, schema=bq_dict)
table = client.create_table(table)
Output:

Failed to create for each activity in azure data factory using python

I am trying to create azure data factory pipelines and resources using python. I was successful with certain ADF activities like Lookup, Copy .. but the problem I am facing here is I am trying to copy few tables from SQL to blob using FOR Each activity and it is throwing the below error
How would you create activities inside for each activity? Any inputs is greatly appreciated. Thanks!
Ref: https://learn.microsoft.com/en-us/python/api/azure-mgmt-datafactory/azure.mgmt.datafactory.models.foreachactivity?view=azure-python
Error Message
TypeError: 'CopyActivity' object is not iterable
Code Block
## Lookup Activity
ls_sql_name = 'ls_'+project_name+'_'+src_svr_type+'_dev'
linked_service_name =LinkedServiceReference(reference_name=ls_sql_name)
lkp_act_name ='Get Table Names'
sql_reader_query = "SELECT top 3 name from sys.tables where name like '%dim'"
source = SqlSource(sql_reader_query= sql_reader_query)
dataset= {"referenceName": "ds_sql_Dim_input","type": "DatasetReference"}
LookupActivity_ = LookupActivity(name =lkp_act_name, linked_service_name= linked_service_name, source = source, dataset = dataset
,first_row_only =False)
#create copy activity
ds_name = 'ds_sql_dim_input' #these datasets already created
dsOut_name ='ds_blob_dim_output' #these datasets already created
copy_act_name = 'Copy SQL to Blob(parquet)'
sql_reader_query = {"value": "#item().name","type": "Expression"}
sql_source = SqlSource(sql_reader_query=sql_reader_query)
blob_sink = ParquetSink()
dsin_ref = DatasetReference(reference_name=ds_name)
dsOut_ref = DatasetReference(reference_name=dsOut_name)
copy_activity = CopyActivity(name=copy_act_name,inputs=[dsin_ref], outputs=[dsOut_ref], source=sql_source, sink=blob_sink)
## For Each Activity
pl_name ='pl_Test'
items= {"value": "#activity('Get Table Names').output.value","type": "Expression"}
dependsOn = [{"activity": "Get Table Names","dependencyConditions": ["Succeeded"]}]
ForEachActivity_= ForEachActivity(name = 'Copy tables in loop',items=items,depends_on=dependsOn ,activities =copy_activity)
params_for_pipeline = {}
p_obj = PipelineResource(activities=[LookupActivity_,ForEachActivity_], parameters=params_for_pipeline)
p = adf_client.pipelines.create_or_update(rg_name, df_name, pl_name, p_obj)

Activities needs to be a list of Activity, and you are passing a single one. Try creating a list and add the copy activity to it, and the pass that list in the activities parameter.

Clone Kubernetes objects programmatically using the Python API

The Python API is available to read objects from a cluster. By cloning we can say:
Get a copy of an existing Kubernetes object using kubectl get
Change the properties of the object
Apply the new object
Until recently, the option to --export api was deprecated in 1.14. How can we use the Python Kubernetes API to do the steps from 1-3 described above?
There are multiple questions about how to extract the code from Python API to YAML, but it's unclear how to transform the Kubernetes API object.

Just use to_dict() which is now offered by Kubernetes Client objects. Note that it creates a partly deep copy. So to be safe:
copied_obj = copy.deepcopy(obj.to_dict())
Dicts can be passed to create* and patch* methods.
For convenience, you can also wrap the dict in Prodict.
copied_obj = Prodict.from_dict(copy.deepcopy(obj.to_dict()))
The final issue is getting rid of superfluous fields. (Unfortunately, Kubernetes sprinkles them throughout the object.) I use kopf's internal facility for getting the "essence" of an object. (It takes care of the deep copy.)
copied_obj = kopf.AnnotationsDiffBaseStorage().build(body=kopf.Body(obj.to_dict()))
copied_obj = Prodic.from_dict(copied_obj)

After looking at the requirement, I spent a couple of hours researching the Kubernetes Python API. Issue 340 and others ask about how to transform the Kubernetes API object into a dict, but the only workaround I found was to retrieve the raw data and then convert to JSON.
The following code uses the Kubernetes API to get a deployment and its related hpa from the namespaced objects, but retrieving their raw values as JSON.
Then, after transforming the data into a dict, you can alternatively clean up the data by removing null references.
Once you are done, you can transform the dict as YAML payload to then save the YAML to the file system
Finally, you can apply either using kubectl or the Kubernetes Python API.
Note:
Make sure to set KUBECONFIG=config so that you can point to a cluster
Make sure to adjust the values of origin_obj_name = "istio-ingressgateway" and origin_obj_namespace = "istio-system" with the name of the corresponding objects to be cloned in the given namespace.
import os
import logging
import yaml
import json
logging.basicConfig(level = logging.INFO)
import crayons
from kubernetes import client, config
from kubernetes.client.rest import ApiException
LOGGER = logging.getLogger(" IngressGatewayCreator ")
class IngressGatewayCreator:
#staticmethod
def clone_default_ingress(clone_context):
# Clone the deployment
IngressGatewayCreator.clone_deployment_object(clone_context)
# Clone the deployment's HPA
IngressGatewayCreator.clone_hpa_object(clone_context)
#staticmethod
def clone_deployment_object(clone_context):
kubeconfig = os.getenv('KUBECONFIG')
config.load_kube_config(kubeconfig)
v1apps = client.AppsV1beta1Api()
deployment_name = clone_context.origin_obj_name
namespace = clone_context.origin_obj_namespace
try:
# gets an instance of the api without deserialization to model
# https://github.com/kubernetes-client/python/issues/574#issuecomment-405400414
deployment = v1apps.read_namespaced_deployment(deployment_name, namespace, _preload_content=False)
except ApiException as error:
if error.status == 404:
LOGGER.info("Deployment %s not found in namespace %s", deployment_name, namespace)
return
raise
# Clone the object deployment as a dic
cloned_dict = IngressGatewayCreator.clone_k8s_object(deployment, clone_context)
# Change additional objects
cloned_dict["spec"]["selector"]["matchLabels"]["istio"] = clone_context.name
cloned_dict["spec"]["template"]["metadata"]["labels"]["istio"] = clone_context.name
# Save the deployment template in the output dir
context.save_clone_as_yaml(cloned_dict, "deployment")
#staticmethod
def clone_hpa_object(clone_context):
kubeconfig = os.getenv('KUBECONFIG')
config.load_kube_config(kubeconfig)
hpas = client.AutoscalingV1Api()
hpa_name = clone_context.origin_obj_name
namespace = clone_context.origin_obj_namespace
try:
# gets an instance of the api without deserialization to model
# https://github.com/kubernetes-client/python/issues/574#issuecomment-405400414
hpa = hpas.read_namespaced_horizontal_pod_autoscaler(hpa_name, namespace, _preload_content=False)
except ApiException as error:
if error.status == 404:
LOGGER.info("HPA %s not found in namespace %s", hpa_name, namespace)
return
raise
# Clone the object deployment as a dic
cloned_dict = IngressGatewayCreator.clone_k8s_object(hpa, clone_context)
# Change additional objects
cloned_dict["spec"]["scaleTargetRef"]["name"] = clone_context.name
# Save the deployment template in the output dir
context.save_clone_as_yaml(cloned_dict, "hpa")
#staticmethod
def clone_k8s_object(k8s_object, clone_context):
# Manipilate in the dict level, not k8s api, but from the fetched raw object
# https://github.com/kubernetes-client/python/issues/574#issuecomment-405400414
cloned_obj = json.loads(k8s_object.data)
labels = cloned_obj['metadata']['labels']
labels['istio'] = clone_context.name
cloned_obj['status'] = None
# Scrub by removing the "null" and "None" values
cloned_obj = IngressGatewayCreator.scrub_dict(cloned_obj)
# Patch the metadata with the name and labels adjusted
cloned_obj['metadata'] = {
"name": clone_context.name,
"namespace": clone_context.origin_obj_namespace,
"labels": labels
}
return cloned_obj
# https://stackoverflow.com/questions/12118695/efficient-way-to-remove-keys-with-empty-strings-from-a-dict/59959570#59959570
#staticmethod
def scrub_dict(d):
new_dict = {}
for k, v in d.items():
if isinstance(v, dict):
v = IngressGatewayCreator.scrub_dict(v)
if isinstance(v, list):
v = IngressGatewayCreator.scrub_list(v)
if not v in (u'', None, {}):
new_dict[k] = v
return new_dict
# https://stackoverflow.com/questions/12118695/efficient-way-to-remove-keys-with-empty-strings-from-a-dict/59959570#59959570
#staticmethod
def scrub_list(d):
scrubbed_list = []
for i in d:
if isinstance(i, dict):
i = IngressGatewayCreator.scrub_dict(i)
scrubbed_list.append(i)
return scrubbed_list
class IngressGatewayContext:
def __init__(self, manifest_dir, name, hostname, nats, type):
self.manifest_dir = manifest_dir
self.name = name
self.hostname = hostname
self.nats = nats
self.ingress_type = type
self.origin_obj_name = "istio-ingressgateway"
self.origin_obj_namespace = "istio-system"
def save_clone_as_yaml(self, k8s_object, kind):
try:
# Just try to create if it doesn't exist
os.makedirs(self.manifest_dir)
except FileExistsError:
LOGGER.debug("Dir already exists %s", self.manifest_dir)
full_file_path = os.path.join(self.manifest_dir, self.name + '-' + kind + '.yaml')
# Store in the file-system with the name provided
# https://stackoverflow.com/questions/12470665/how-can-i-write-data-in-yaml-format-in-a-file/18210750#18210750
with open(full_file_path, 'w') as yaml_file:
yaml.dump(k8s_object, yaml_file, default_flow_style=False)
LOGGER.info(crayons.yellow("Saved %s '%s' at %s: \n%s"), kind, self.name, full_file_path, k8s_object)
try:
k8s_clone_name = "http2-ingressgateway"
hostname = "my-nlb-awesome.a.company.com"
nats = ["123.345.678.11", "333.444.222.111", "33.221.444.23"]
manifest_dir = "out/clones"
context = IngressGatewayContext(manifest_dir, k8s_clone_name, hostname, nats, "nlb")
IngressGatewayCreator.clone_default_ingress(context)
except Exception as err:
print("ERROR: {}".format(err))

Not python, but I've used jq in the past to quickly clone something with the small customisations required for each use case (usually cloning secrets into a new namespace).
kc get pod whatever-85pmk -o json \
| jq 'del(.status, .metadata ) | .metadata.name="newname"' \
| kc apply -f - -o yaml --dry-run

This is really easy to do with Hikaru.
Here is an example from my own open source project:
def duplicate_without_fields(obj: HikaruBase, omitted_fields: List[str]):
"""
Duplicate a hikaru object, omitting the specified fields
This is useful when you want to compare two versions of an object and first "cleanup" fields that shouldn't be
compared.
:param HikaruBase obj: A kubernetes object
:param List[str] omitted_fields: List of fields to be omitted. Field name format should be '.' separated
For example: ["status", "metadata.generation"]
"""
if obj is None:
return None
duplication = obj.dup()
for field_name in omitted_fields:
field_parts = field_name.split(".")
try:
if len(field_parts) > 1:
parent_obj = duplication.object_at_path(field_parts[:-1])
else:
parent_obj = duplication
setattr(parent_obj, field_parts[-1], None)
except Exception:
pass # in case the field doesn't exist on this object
return duplication
Dumping the object to yaml afterwards or re-applying it to the cluster is trivial with Hikaru
We're using this to clean up objects so that can show users a github-style diff when objects change, without spammy fields that change often like generation

JSON Serialization of Custom Objects (Encoding and Decoding)

I've been searching the internet and couldn't find a simple example to encode and decode a custom object using JSON in python.
Let's say I have the following class:
class Test:
def __init__(self, name=None, grade=None):
self.name = name
self.grade = grade
and also have a list of Test objects:
t1 = Test("course1", 80)
t2 = Test("course2", 90)
list_of_tests = [t1, t2]
How can I serialize the class Test and the object list_of_tests
using JSON? I want to be able to write it to a file and read it from a file, using python.

To be honest the easiest thing to do here is to manually create a list of dictionaries from your objects. Then you can pass that directly to the JSON functions.
data = [{'name': x.name, 'grade': x.grade} for x in list_of_tests]
with open('output.json', 'w') as out:
json.dump(data, out)
and read it back:
with open('output.json') as inp:
data = json.load(inp)
list_of_tests = [Test(x['name'], x['grade']) for x in data]

You can control how an unrecognised object is serialised by dumps(default=converter_function). For it to be valid JSON you'd have to return a plain dict with the fields you want plus some tag field identifying that it is to be treated specially by loads.
Then have another converter function to reverse the process passed to loads() as object_hook.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Azure Data Factory Python SDK - Convert PipelineResource to JSON - python

I am creating Azure Data Factory pipeline using Python SDK (azure.mgmt.datafactory.models.PipelineResource). I need to convert PipelineResource object to JSON file. Is it possible anyhow? I tried json.loads(pipeline_object) , json.dumps(pipeline_object) but no luck.

Related

Python boto3, list contents of specific dir in bucket, limit depth

How to convert Hive schema to Bigquery schema using Python?

Failed to create for each activity in azure data factory using python

Clone Kubernetes objects programmatically using the Python API

JSON Serialization of Custom Objects (Encoding and Decoding)

Categories

Resources