Change run parameters of Databricks python_wheel_task with CLI

Change run parameters of Databricks python_wheel_task with CLI - python

I have created a job in DataBricks. Now, I'm trying to run it using databricks cli. See part of the job json:
"tasks": [
{
"task_key": "task_key",
"python_wheel_task": {
"package_name": "package_name",
"entry_point": "entry_point",
"parameters": [
"command",
"parameter1",
]
},
"libraries": [
{
"whl": "dbfs:/wheels/anme.whl"
}
]
Command I'm using to run the job (this command works):
databricks jobs run-now --job-id 1111
Then, I try to change parameter1 by running
databricks jobs run-now --job-id 1111 --python-params '"["value1", "value2"]"'
But I get error:
Error: JSONDecodeError: Expecting value: line 1 column 2 (char 1)
how can I provide new parameters to python_wheel_task?

On top the answer from Alex, I found that we have to use escape symbols for command to work.
the solution is:
databricks jobs run-now --job-id 1111 --python-params '[\"value1\", \"value2\"]'

The value for --python-params should be JSON array of strings. In your case you don't need the double quotes around array - just use '["value1", "value2"]'.

Related

Keeping quotes from std out for passing to bash

Okay, this is a bit convoluted but I've got a python script that digests a json file and prints a string representation of that file like so
for id in pwds.keys():
secret += f"\'{id}\' : \'{pwds[id]['username']},{pwds[id]['pswd']}\',"
secret = secret[:-1] + "}\'"
print(secret)
This is taken in by a jenkins pipeline so it can be passed to a bash script
def secret_string = sh (script: "python3 syncToSecrets.py", returnStdout: true)
sh label: 'SYNC', script: "bash sync.sh ${ENVIRONMENT} ${secret_string}"
I can see that when python is printing the output it looks like
'{"key" : "value", "key" : "value"...}'
But when it gets to secret_string, and also the bash script it then looks like
{key : value, key : value}
This is how the bash script is calling it
ENV=$1; SECRET_STRING=$2;
aws secretsmanager create-secret --name NAME --secret-string "${SECRET_STRING}"
Which technically works, it just uploads the whole thing as a string instead of discrete KV-pairs.
I'm trying to run some stuff with the AWS CLI, and it requires that the data be wrapped in quotes, but so far, I've been totally unable to keep the quotes in between processes. Any advice?

Sample pwds dict data:
import json
pwds = {
'id001': {
'username': 'user001',
'pswd': 'pwd123'
},
'id002': {
'username': 'user002',
'pswd': 'pwd123'
}
}
As suggested by SuperStormer, it's a better to use Python types (dict, list, etc) instead of building your own JSON.
secrets = [{id: f"{val['username']}, {val['pswd']}"} for id, val in pwds.items()]
json.dumps(secrets)
'[{"id001": "user001, pwd123"}, {"id002": "user002, pwd123"}]'
The JSON string should be usable within Jenkins script blocks.
Try experimenting with single quotes or --secret-string file://secrets.json as alternatives.

Azure CLI returning second array when only expecting one

I'm working with azure CLI to script out a storage upgrade as well as add a policy, all in a python script. However, when I run the script I'm getting some expected and some very NOT expected output.
What I'm using so far:
from azure.cli.core import get_default_cli
def az_cli (args_str):
args = args_str.split()
cli = get_default_cli()
cli.invoke(args)
if cli.result.result:
return cli.result.result
elif cli.result.error:
raise cli.result.error
return True
sas = az_cli("storage account list --query [].{Name:name,ResourceGroup:resourceGroup,Kind:kind}")
print(sas)
By using this SO article as reference I'm pretty easily making Azure CLI calls, however my output is the following:
[
{
"Kind": "StorageV2",
"Name": "TestStorageName",
"ResourceGroup": "my_test_RG"
},
{
"Kind": "Storage",
"Name": "TestStorageName2",
"ResourceGroup": "my_test_RG_2"
}
]
[OrderedDict([('Name', 'TestStorageName'), ('ResourceGroup', 'my_test_RG'), ('Kind', 'StorageV2')]), OrderedDict([('Name', 'TestStorageName2'), ('ResourceGroup', 'my_test_RG_2'), ('Kind', 'Storage')])]
I appear to be getting 2 arrays back, and I'm unsure of what the cause is. I'm assuming it has to do with my using the --query to narrow down the output I get back, but I'm at a loss as to why it then repeats itself. Expected result would just be the first part that's in json format. I have also tried with tsv output as well with the same results. I appreciate any insight!

Navigating Event in AWS Lambda Python

So I'm fairly new to both AWS and Python. I'm on a uni assignment and have hit a road block.
I'm uploading data to AWS S3, this information is being sent to an SQS Queue and passed into AWS Lambda. I know, it would be much easier to just go straight from S3 to Lambda...but apparently "that's not the brief".
So I've got my event accurately coming into AWS Lambda, but no matter how deep I dig, I can't reach the information I need. In AMS Lambda, I run the following query.
def lambda_handler(event, context):
print(event)
Via CloudWatch, I get the output
{'Records': [{'messageId': '1d8e0a1d-d7e0-42e0-9ff7-c06610fccae0', 'receiptHandle': 'AQEBr64h6lBEzLk0Xj8RXBAexNukQhyqbzYIQDiMjJoLLtWkMYKQp5m0ENKGm3Icka+sX0HHb8gJoPmjdTRNBJryxCBsiHLa4nf8atpzfyCcKDjfB9RTpjdTZUCve7nZhpP5Fn7JLVCNeZd1vdsGIhkJojJ86kbS3B/2oBJiCR6ZfuS3dqZXURgu6gFg9Yxqb6TBrAxVTgBTA/Pr35acEZEv0Dy/vO6D6b61w2orabSnGvkzggPle0zcViR/shLbehROF5L6WZ5U+RuRd8tLLO5mLFf5U+nuGdVn3/N8b7+FWdzlmLOWsI/jFhKoN4rLiBkcuL8UoyccTMJ/QTWZvh5CB2mwBRHectqpjqT4TA3Z9+m8KNd/h/CIZet+0zDSgs5u', 'body': '{"Records":[{"eventVersion":"2.1","eventSource":"aws:s3","awsRegion":"eu-west-2","eventTime":"2021-03-26T01:03:53.611Z","eventName":"ObjectCreated:Put","userIdentity":{"principalId":"MY_ID"},"requestParameters":{"sourceIPAddress":"MY_IP_ADD"},"responseElements":{"x-amz-request-id":"BQBY06S20RYNH1XJ","x-amz-id-2":"Cdo0RvX+tqz6SZL/Xw9RiBLMCS3Rv2VOsu2kVRa7PXw9TsIcZeul6bzbAS6z4HF6+ZKf/2MwnWgzWYz+7jKe07060bxxPhsY"},"s3":{"s3SchemaVersion":"1.0","configurationId":"test","bucket":{"name":"MY_BUCKET","ownerIdentity":{"principalId":"MY_ID"},"arn":"arn:aws:s3:::MY_BUCKET"},"object":{"key":"test.jpg","size":246895,"eTag":"c542637a515f6df01cbc7ee7f6e317be","sequencer":"00605D33019AD8E4E5"}}}]}', 'attributes': {'ApproximateReceiveCount': '1', 'SentTimestamp': '1616720643174', 'SenderId': 'AIDAIKZTX7KCMT7EP3TLW', 'ApproximateFirstReceiveTimestamp': '1616720648174'}, 'messageAttributes': {}, 'md5OfBody': '1ab703704eb79fbbb58497ccc3f2c555', 'eventSource': 'aws:sqs', 'eventSourceARN': 'arn:aws:sqs:eu-west-2:ARN', 'awsRegion': 'eu-west-2'}]}
[Disclaimer, I've tried to edit out any identifying information but if there's any sensitive data I'm not understanding or missed, please let me know]
Anyways, just for a sample, I want to get the Object Key, which is test.jpg. I tried to drill down as much as I can, finally getting to: -
def lambda_handler(event, context):
print(event['Records'][0]['body'])
This returned the following (which was nice to see fully stylized): -
{
"Records": [
{
"eventVersion": "2.1",
"eventSource": "aws:s3",
"awsRegion": "eu-west-2",
"eventTime": "2021-03-26T01:08:16.823Z",
"eventName": "ObjectCreated:Put",
"userIdentity": {
"principalId": "MY_ID"
},
"requestParameters": {
"sourceIPAddress": "MY_IP"
},
"responseElements": {
"x-amz-request-id": "ZNKHRDY8GER4F6Q5",
"x-amz-id-2": "i1Cazudsd+V57LViNWyDNA9K+uRbSQQwufMC6vf50zQfzPaH7EECsvw9SFM3l3LD+TsYEmnjXn1rfP9GQz5G5F7Fa0XZAkbe"
},
"s3": {
"s3SchemaVersion": "1.0",
"configurationId": "test",
"bucket": {
"name": "MY_BUCKET",
"ownerIdentity": {
"principalId": "MY_ID"
},
"arn": "arn:aws:s3:::MY_BUCKET"
},
"object": {
"key": "test.jpg",
"size": 254276,
"eTag": "b0052ab9ba4b9395e74082cfd51a8f09",
"sequencer": "00605D3407594DE184"
}
}
}
]
}
However, from this stage on if I try to write print(event['Records'][0]['body']['Records']) or print(event['Records'][0]['s3']), I'll get told I require an integer, not a string. If I try to write print(event['Records'][0]['body'][0]), I'll be given a single character every time (in this cause the first { bracket).
I'm not sure if this has something to do with tuples, or if at this stage it's all saved as one large string, but at least in the output view it doesn't appear to be saved that way.
Does anyone have any idea what I'd do from this stage to access the further information? In the full release after I'm done testing, I'll be wanting to save an audio file and the file name as opposed to a picture.
Thanks.

You are having this problem because the contents of the body is a JSON. But in string format. You should parse it to be able to access it like a normal dictionary. Like so:
import json
def handler(event: dict, context: object):
body = event['Records'][0]['body']
body = json.loads(body)
# use the body as a normal dictionary
You are getting only a single char when using integer indexes because it is a string. So, using [n] in an string will return the nth char.

It's because your getting stringified JSON data. You need to load it back to its Python dict format.
There is a useful package called lambda_decorators. you can install with pip install lambda_decorators
so you can do this:
from lambda_decorators import load_json_body
#load_json_body
def lambda_handler(event, context):
print(event['Records'][0]['body'])
# Now you can access the the items in the body using there index and keys.
This will extract the JSON for you.

Access GlobalParameters in Azure ML Python script

How can one access the global parameters ("GlobalParameters") sent from a web service in a Python script on Azure ML?
I tried:
if 'GlobalParameters' in globals():
myparam = GlobalParameters['myparam']
but with no success.
EDIT: Example
In my case, I'm sending a sound file over the web service (as a list of samples). I would also like to send a sample rate and the number of bits per sample. I've successfully configured the web service (I think) to take these parameters, so the GlobalParameters now look like:
"GlobalParameters": {
"sampleRate": "44100",
"bitsPerSample": "16",
}
However, I cannot access these variables from the Python script, neither as GlobalParameters["sampleRate"] nor as sampleRate. Is it possible? Where are they stored?

based on our understanding of your question, here may has a miss conception that Azure ML parameters are not “Global Parameters”, as a matter of fact they are just parameter substitution tied to a particular module. So in affect there are no global parameters that are accessible throughout the experiment you have mentioned. Such being the case, we think the experiment below accomplishes what you are asking for:
Please add an “Enter Data” module to the experiment and add Data in csv format. Then for the Data click the parameter to create a web service parameter. Add in the CSV data which will be substituted from data passed by the client application. I.e.
Please add an “Execute Python” module and hook up the “Enter Data” output to the “Execute Python” input1. Add the python code to take the dataframe1 and add it to a python list. Once you have it in a list you can use it anywhere in your python code.
Python code snippet
def azureml_main(dataframe1 = None, dataframe2 = None):
import pandas as pd
global_list = []
for g in dataframe1["Col3"]:
global_list.append(g)
df_global = pd.DataFrame(global_list)
print('Input pandas.DataFrame:\r\n\r\n{0}'.format(df_global))
return [df_global]
Once you publish your experiment, you can add in new values in the “Data”: “”, section below with the new values that you was substituted for the “Enter Data” values in the experiment.
data = {
"Inputs": {
"input1":
{
"ColumnNames": ["Col1", "Col2", "Col3"],
"Values": [ [ "0", "value", "0" ], [ "0", "value", "0" ], ]
}, },
"GlobalParameters": {
"Data": "1,sampleRate,44500\\n2,bitsPerSample,20",
}
}
Please feel free to let us know if this makes sense.

The GlobalParameters parameter can not be used in a Python script. It is used to override certain parameters in other modules.
If you, for example, take the 'Split Data' module, you'll find an option to turn a parameter into a web service parameter:
Once you click that, a new section appears titled "Web Service Parameters". There you can change the default parameter name to one of your choosing.
If you deploy your project as a web service, you can override that parameter by putting it in the GlobalParameters parameter:
"GlobalParameters": {
"myFraction": 0.7
}
I hope that clears things up a bit.

Although it is not possible to use GlobalParameters in the Python script (see my previous answer), you can however hack/abuse the second input of the Python script to pass in other parameters. In my example I call them metadata parameters.
To start, I added:
a Web service input module with name: "realdata" (for your real data off course)
a Web service input module with name: "metadata" (we will abuse this one to pass parameters to our Python).
a Web service output module with name: "computedMetadata"
Connect the modules as follows:
As you can see, I also added a real data set (Restaurant ratings) as wel as a dummy metadata csv (the Enter Data Manually) module.
In this manual data you will have to predefine your metadata parameters as if they were a csv with a header and a only a single row to hold the data:
In the example both sampleRate and bitsPerSample are set to 0.
My Python scripts then takes in that fake csv as metadata, does some dummy calculation with it and returns it as column name:
import pandas as pd
def azureml_main(realdata = None, metadata = None):
theSum = metadata["sampleRate"][0] + metadata["bitsPerSample"][0]
outputString = "The sum of the sampleRate and the bitsPerSecond is " + str(theSum)
print(outputString)
return pd.DataFrame([outputString])
I then published this as a web service and called it using Node.js like this:
httpreq.post('https://ussouthcentral.services.azureml.net/workspaces/xxx/services/xxx', {
headers: {
Authorization: 'Bearer xxx'
},
json: {
"Inputs": {
"realdata": {
"ColumnNames": [
"userID",
"placeID",
"rating"
],
"Values": [
[
"100",
"101",
"102"
],
[
"200",
"201",
"202"
]
]
},
"metadata": {
"ColumnNames": [
"sampleRate",
"bitsPerSample"
],
"Values": [
[
44100,
16
]
]
}
},
"GlobalParameters": {}
}
}, (err, res) => {
if(err) return console.log(err);
console.log(JSON.parse(res.body));
});
The output was as expected:
{ Results:
{ computedMetadata:
{ type: 'table',
value:
{ ColumnNames: [ '0' ],
ColumnTypes: [ 'String' ],
Values:
[ [ 'The sum of the sampleRate and the bitsPerSecond is 44116' ] ] } } } }
Good luck!

How to distribute MongoDB test data over VCS?

I'm working on a Python/MongoDB project in both my computer at home and my laptop.
The schema in document stores, naturally, is best represented by the data itself - and that's why I want to distribute my test data over Mercurial, together with the code itself.
Would the best way be to simply dump the BSONs in a file and add it to the mercurial repository?

Dumping BSON and putting it into a VCS would not make much sense, since it's binary file and can't be viewed easily.
You can export a collection to a JSON, by using mongoexport tool. You can even pass it a query filter to limit number of exported documents.
Here's an example (reformatted for readability):
sergio#soviet-russia$ mongoexport -d test -c geo \
sergio#soviet-russia$ -q '{"_id": ObjectId("4efa5f7d8840e680c850cd94") }'
connected to: 127.0.0.1
{ "_id" : { "$oid" : "4efa5f7d8840e680c850cd94" },
"longg" : [ { "start" : 322815488, "end" : 322817535 },
{ "start" : 822815488, "end" : 822817535 } ],
"m" : "Cracow",
"postal" : 55050,
"lat" : [ "XX.89XXX", "XX.74XXX" ] }
exported 1 records

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Change run parameters of Databricks python_wheel_task with CLI - python

On top the answer from Alex, I found that we have to use escape symbols for command to work. the solution is: databricks jobs run-now --job-id 1111 --python-params '[\"value1\", \"value2\"]'

The value for --python-params should be JSON array of strings. In your case you don't need the double quotes around array - just use '["value1", "value2"]'.

Related

Keeping quotes from std out for passing to bash

Azure CLI returning second array when only expecting one

Navigating Event in AWS Lambda Python

Access GlobalParameters in Azure ML Python script

How to distribute MongoDB test data over VCS?

Categories

Resources