Great Expectation with Azure and Databricks

Great Expectation with Azure and Databricks - python

I want to run great_expectation test suites against csv files in my ADLS Gen2. On my ADLS, I have a container called "input" in which I have a file at input/GE/ind.csv. I use a InferredAssetAzureDataConnector. I was able to create and test/validate the data source configuration. But when i validate my data I'm getting below error.
import datetime
import pandas as pd
from ruamel import yaml
from great_expectations.core.batch import RuntimeBatchRequest
from great_expectations.data_context import BaseDataContext
from great_expectations.data_context.types.base import (
DataContextConfig,
FilesystemStoreBackendDefaults,
)
from ruamel import yaml
import great_expectations as ge
from great_expectations.core.batch import Batch, BatchRequest
#Root Directory
root_directory = "/dbfs/FileStore/great_expectation_official/"
#Data Context
data_context_config = DataContextConfig(
store_backend_defaults=FilesystemStoreBackendDefaults(
root_directory=root_directory
),
)
context = BaseDataContext(project_config=data_context_config)
#Configure your Datasource
datasource_config = {
"name": "my_azure_datasource",
"class_name": "Datasource",
"execution_engine": {
"class_name": "SparkDFExecutionEngine",
"azure_options": {
"account_url": "https://<account_Name>.blob.core.windows.net",
"credential": "ADLS_key",
},
},
"data_connectors": {
"default_inferred_data_connector_name": {
"class_name": "InferredAssetAzureDataConnector",
"azure_options": {
"account_url": "https://<account_Name>.blob.core.windows.net",
"credential": "ADLS_key",
},
"container": "input",
"name_starts_with": "/GE/",
"default_regex": {
"pattern": "(.*)\\.csv",
"group_names": ["data_asset_name"],
},
},
},
}
context.test_yaml_config(yaml.dump(datasource_config))
context.add_datasource(**datasource_config)
batch_request = BatchRequest(
datasource_name="my_azure_datasource",
data_connector_name="default_inferred_data_connector_name",
data_asset_name="data_asset_name",
batch_spec_passthrough={"reader_method": "csv", "reader_options": {"header": True}},
)
context.create_expectation_suite(
expectation_suite_name="test_suite", overwrite_existing=True
)
validator = context.get_validator(
batch_request=batch_request, expectation_suite_name="test_suite"
)
[Error_snapshot_click_here]
[csv_data_snapshot]
Can someone help me to find out the issue?

You can check with the following code whether your batch list is indeed empty.
context.get_batch_list(batch_request=batch_request)
If this is empty, you probably have an issue with your data_asset_names.
You can check whether the correct data asset name has been used in the output of the following code.
context.test_yaml_config(yaml.dump(my_spark_datasource_config))
In the output there is a list of available data_asset_names that you can choose from. If the data asset_name of your BatchRequest is not in the list, you will have an empty batch_list as the data asset is not available. There should be a warning here but I think it is not implemented.
I had the same issue but figured it out yesterday. Also, you should produce a workable example of your error so people can explore the code. That way it is easier to help you.
Hopefully this helps!

Related

singer tap-zendesk - how to extract catalog.json with selected:True from discovery mode

I am using singer's tap-zendesk library and want to extract data from specific schemas.
I am running the following command in sync mode:
tap-zendesk --config config.json --catalog catalog.json.
Currently my config.json file has the following parameters:
{
"email": "<email>",
"api_token": "<token>",
"subdomain": "<domain>",
"start_date": "<start_date>"
}
I've managed to extract data by putting 'selected':true under schema, properties and metadata in the catalog.json file. But I was wondering if there was an easier way to do this? There are around 15 streams I need to go through.
I manage to get the catalog.json file through the discovery mode command:
tap-zendesk --config config.json --discover > catalog.json
The output looks something like the following, but that means that I have to go and add selected:True under every field.
{
"streams": [
{
"stream": "tickets",
"tap_stream_id": "tickets",
"schema": {
**"selected": "true"**,
"properties": {
"organization_id": {
**"selected": "true"**,},
"metadata": [
{
"breadcrumb": [],
"metadata": {
**"selected": "true"**
}

The selected=true needs to be applied only once per stream. This needs to be added to the metadata section under the stream where breadcrumbs = []. This is very poorly documented.
Please see this blog post for some helpful details: https://medium.com/getting-started-guides/extracting-ticket-data-from-zendesk-using-singer-io-tap-zendesk-57a8da8c3477

How to set schema in Python to use a json file on BigQuery?

I am looking for a way how the schema is set with a json file in Python on Big Query. The following document says I can set it with Schema field one by one, but I want to find out more efficient way.
https://cloud.google.com/bigquery/docs/schemas
Autodetect would be skeptical to make it in this case.
I will appreciate it if you helped me.

You can create a JSON file with columns/data types and use the below code to build BigQuery Schema.
JSON File (schema.json):
[
{
"name": "emp_id",
"type": "INTEGER"
},
{
"name": "emp_name",
"type": "STRING"
}
]
Python Code:
import json
from google.cloud import bigquery
bigquerySchema = []
with open('schema.json') as f:
bigqueryColumns = json.load(f)
for col in bigqueryColumns:
bigquerySchema.append(bigquery.SchemaField(col['name'], col['type']))
print(bigquerySchema)

Soumendra Mishra is already helpful, but here is a bit more general version that can optionally accept addition fields such as mode or description:
JSON File (schema.json):
[
{
"name": "emp_id",
"type": "INTEGER",
"mode": "REQUIRED"
},
{
"name": "emp_name",
"type": "STRING",
"description": "Description of this field"
}
]
Python Code:
import json
from google.cloud import bigquery
table_schema = []
# open JSON file read only
with open('schema.json', 'r') as f:
table_schema = json.load(f)
for entry in table_schema:
# rename key; bigquery.SchemaField expects `field` to be called `field_type`
entry["field_type"] = entry.pop("type")
# ** effectively provides data as argument:value pairs (e.g. name="emp_id")
table_schema.append(bigquery.SchemaField(**entry))

How to insert variable value into JSON string for use with PDAL

I'm trying to use the Python extension for PDAL to read in a laz file.
To do so, I'm using the simple pipeline structure as exampled here: https://gis.stackexchange.com/questions/303334/accessing-raw-data-from-laz-file-in-python-with-open-source-software. It would be useful for me, however, to insert the value contained in a variable for the "filename:" field. To do so, I've tried the following, where fullFileName is a str variable containing the name (full path) of the file, but I am getting an error that no such file exists. I am assuming my JSON syntax is slightly off or something; can anyone help?
pipeline="""{
"pipeline": [
{
"type": "readers.las",
"filename": "{fullFileName}"
}
]
}"""

You can follow this code:
import json
import pdal
file = "D:/Lidar data/input.laz"
pipeline={
"pipeline": [
{
"type": "readers.las",
"filename": file
},
{
"type": "filters.sort",
"dimension": "Z"
}
]
}
r = pdal.Pipeline(json.dumps(pipeline))
r.validate()
points = r.execute()
print(points)

Access GlobalParameters in Azure ML Python script

How can one access the global parameters ("GlobalParameters") sent from a web service in a Python script on Azure ML?
I tried:
if 'GlobalParameters' in globals():
myparam = GlobalParameters['myparam']
but with no success.
EDIT: Example
In my case, I'm sending a sound file over the web service (as a list of samples). I would also like to send a sample rate and the number of bits per sample. I've successfully configured the web service (I think) to take these parameters, so the GlobalParameters now look like:
"GlobalParameters": {
"sampleRate": "44100",
"bitsPerSample": "16",
}
However, I cannot access these variables from the Python script, neither as GlobalParameters["sampleRate"] nor as sampleRate. Is it possible? Where are they stored?

based on our understanding of your question, here may has a miss conception that Azure ML parameters are not “Global Parameters”, as a matter of fact they are just parameter substitution tied to a particular module. So in affect there are no global parameters that are accessible throughout the experiment you have mentioned. Such being the case, we think the experiment below accomplishes what you are asking for:
Please add an “Enter Data” module to the experiment and add Data in csv format. Then for the Data click the parameter to create a web service parameter. Add in the CSV data which will be substituted from data passed by the client application. I.e.
Please add an “Execute Python” module and hook up the “Enter Data” output to the “Execute Python” input1. Add the python code to take the dataframe1 and add it to a python list. Once you have it in a list you can use it anywhere in your python code.
Python code snippet
def azureml_main(dataframe1 = None, dataframe2 = None):
import pandas as pd
global_list = []
for g in dataframe1["Col3"]:
global_list.append(g)
df_global = pd.DataFrame(global_list)
print('Input pandas.DataFrame:\r\n\r\n{0}'.format(df_global))
return [df_global]
Once you publish your experiment, you can add in new values in the “Data”: “”, section below with the new values that you was substituted for the “Enter Data” values in the experiment.
data = {
"Inputs": {
"input1":
{
"ColumnNames": ["Col1", "Col2", "Col3"],
"Values": [ [ "0", "value", "0" ], [ "0", "value", "0" ], ]
}, },
"GlobalParameters": {
"Data": "1,sampleRate,44500\\n2,bitsPerSample,20",
}
}
Please feel free to let us know if this makes sense.

The GlobalParameters parameter can not be used in a Python script. It is used to override certain parameters in other modules.
If you, for example, take the 'Split Data' module, you'll find an option to turn a parameter into a web service parameter:
Once you click that, a new section appears titled "Web Service Parameters". There you can change the default parameter name to one of your choosing.
If you deploy your project as a web service, you can override that parameter by putting it in the GlobalParameters parameter:
"GlobalParameters": {
"myFraction": 0.7
}
I hope that clears things up a bit.

Although it is not possible to use GlobalParameters in the Python script (see my previous answer), you can however hack/abuse the second input of the Python script to pass in other parameters. In my example I call them metadata parameters.
To start, I added:
a Web service input module with name: "realdata" (for your real data off course)
a Web service input module with name: "metadata" (we will abuse this one to pass parameters to our Python).
a Web service output module with name: "computedMetadata"
Connect the modules as follows:
As you can see, I also added a real data set (Restaurant ratings) as wel as a dummy metadata csv (the Enter Data Manually) module.
In this manual data you will have to predefine your metadata parameters as if they were a csv with a header and a only a single row to hold the data:
In the example both sampleRate and bitsPerSample are set to 0.
My Python scripts then takes in that fake csv as metadata, does some dummy calculation with it and returns it as column name:
import pandas as pd
def azureml_main(realdata = None, metadata = None):
theSum = metadata["sampleRate"][0] + metadata["bitsPerSample"][0]
outputString = "The sum of the sampleRate and the bitsPerSecond is " + str(theSum)
print(outputString)
return pd.DataFrame([outputString])
I then published this as a web service and called it using Node.js like this:
httpreq.post('https://ussouthcentral.services.azureml.net/workspaces/xxx/services/xxx', {
headers: {
Authorization: 'Bearer xxx'
},
json: {
"Inputs": {
"realdata": {
"ColumnNames": [
"userID",
"placeID",
"rating"
],
"Values": [
[
"100",
"101",
"102"
],
[
"200",
"201",
"202"
]
]
},
"metadata": {
"ColumnNames": [
"sampleRate",
"bitsPerSample"
],
"Values": [
[
44100,
16
]
]
}
},
"GlobalParameters": {}
}
}, (err, res) => {
if(err) return console.log(err);
console.log(JSON.parse(res.body));
});
The output was as expected:
{ Results:
{ computedMetadata:
{ type: 'table',
value:
{ ColumnNames: [ '0' ],
ColumnTypes: [ 'String' ],
Values:
[ [ 'The sum of the sampleRate and the bitsPerSecond is 44116' ] ] } } } }
Good luck!

Convert a JSON schema to a python class

Is there a python library for converting a JSON schema to a python class definition, similar to jsonschema2pojo -- https://github.com/joelittlejohn/jsonschema2pojo -- for Java?

So far the closest thing I've been able to find is warlock, which advertises this workflow:
Build your schema
>>> schema = {
'name': 'Country',
'properties': {
'name': {'type': 'string'},
'abbreviation': {'type': 'string'},
},
'additionalProperties': False,
}
Create a model
>>> import warlock
>>> Country = warlock.model_factory(schema)
Create an object using your model
>>> sweden = Country(name='Sweden', abbreviation='SE')
However, it's not quite that easy. The objects that Warlock produces lack much in the way of introspectible goodies. And if it supports nested dicts at initialization, I was unable to figure out how to make them work.
To give a little background, the problem that I was working on was how to take Chrome's JSONSchema API and produce a tree of request generators and response handlers. Warlock doesn't seem too far off the mark, the only downside is that meta-classes in Python can't really be turned into 'code'.
Other useful modules to look for:
jsonschema - (which Warlock is built on top of)
valideer - similar to jsonschema but with a worse name.
bunch - An interesting structure builder thats half-way between a dotdict and construct
If you end up finding a good one-stop solution for this please follow up your question - I'd love to find one. I poured through github, pypi, googlecode, sourceforge, etc.. And just couldn't find anything really sexy.
For lack of any pre-made solutions, I'll probably cobble together something with Warlock myself. So if I beat you to it, I'll update my answer. :p

python-jsonschema-objects is an alternative to warlock, build on top of jsonschema
python-jsonschema-objects provides an automatic class-based binding to JSON schemas for use in python.
Usage:
Sample Json Schema
schema = '''{
"title": "Example Schema",
"type": "object",
"properties": {
"firstName": {
"type": "string"
},
"lastName": {
"type": "string"
},
"age": {
"description": "Age in years",
"type": "integer",
"minimum": 0
},
"dogs": {
"type": "array",
"items": {"type": "string"},
"maxItems": 4
},
"gender": {
"type": "string",
"enum": ["male", "female"]
},
"deceased": {
"enum": ["yes", "no", 1, 0, "true", "false"]
}
},
"required": ["firstName", "lastName"]
} '''
Converting the schema object to class
import python_jsonschema_objects as pjs
import json
schema = json.loads(schema)
builder = pjs.ObjectBuilder(schema)
ns = builder.build_classes()
Person = ns.ExampleSchema
james = Person(firstName="James", lastName="Bond")
james.lastName
u'Bond' james
example_schema lastName=Bond age=None firstName=James
Validation :
james.age = -2
python_jsonschema_objects.validators.ValidationError: -2 was less
or equal to than 0
But problem is , it is still using draft4validation while jsonschema has moved over draft4validation , i filed an issue on the repo regarding this .
Unless you are using old version of jsonschema , the above package will work as shown.

I just created this small project to generate code classes from json schema, even if dealing with python I think can be useful when working in business projects:
pip install jsonschema2popo
running following command will generate a python module containing json-schema defined classes (it uses jinja2 templating)
jsonschema2popo -o /path/to/output_file.py /path/to/json_schema.json
more info at: https://github.com/frx08/jsonschema2popo

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Great Expectation with Azure and Databricks - python

Related

singer tap-zendesk - how to extract catalog.json with selected:True from discovery mode

How to set schema in Python to use a json file on BigQuery?

How to insert variable value into JSON string for use with PDAL

Access GlobalParameters in Azure ML Python script

Convert a JSON schema to a python class

Categories

Resources