How do I create a partitioned collection in Cosmos DB with pydocumentdb? - python

The pydocumentdb.document_client.DocumentClient object has a CreateCollection() method, defined here.
When creating a collection with this method, one needs to specify the database link (already known), the collection (I don't know how to reference it if it hasn't been made) and options.
Parameters that I would like to control when creating the collection are:
name of collection
type of collection (fixed size vs. partitioned)
partition keys
RU value
Indexing policy (or at least be able to create a default template somewhere and automatically copy it to the newly created one)
Enums for some of these parameters seem to be defined here, but I don't see any potentially useful HTTP headers in http_constants.py, and I don't see where RUs come in to play or where a cohesive "Collection" object would be passed as a parameter.

You could refer to the source sample code from here and the rest api from here.
import pydocumentdb;
import pydocumentdb.errors as errors
import pydocumentdb.document_client as document_client
config = {
'ENDPOINT': 'https://***.documents.azure.com:443/',
'MASTERKEY': '***'
};
# Initialize the Python DocumentDB client
client = document_client.DocumentClient(config['ENDPOINT'], {'masterKey': config['MASTERKEY']})
databaseLink = "dbs/db"
coll = {
"id": "testCreate",
"indexingPolicy": {
"indexingMode": "lazy",
"automatic": False
},
"partitionKey": {
"paths": [
"/AccountNumber"
],
"kind": "Hash"
}
}
collection_options = { 'offerThroughput': 400 }
client.CreateCollection(databaseLink , coll, collection_options)
Hope it helps you.

Related

Python: Help converting messy if-elif statement ladder within loop a into dictionary switch case for creating grpc ssl certificates

I'm having trouble cleaning up some of my code.
In general terms the python code is meant to read two json files. The first json file includes the addresses for each microservice, the second json file includes a list of the services that each service communicates with. The function of this code is so create grpc ssl certificates for each microservice through looping over these two json files.
I have written the code, but I have used if statements and it is very messy, but I am struggling to clean the code up using dictionaries.
Below I will list a sample of the two jsons I have described above:
ServicesAddresses.json
[
{
"service":"service-A",
"address": ["localhost1"]
},
{
"service":"service-B",
"address": ["localhost2"]
},
]
ServicesUsed.json
[
{
"service":"service-A",
"services_used": ["service-B", "service-C"]
},
{
"service":"service-B",
"services_used": ["service-C", "service-D"]
},
]
I will share the code I use to loop over the first json and assign the addresses to variables below
for address in addressData:
if address["service"] == "service-A":
addresses.serviceA = address["address"]
elif address["service"] == "service-B":
addresses.serviceB = address["address"]
Finally I will share the code used to loop over the second json and to generate the ssl certificates using a function called cert_create which has an input of the address of each service
for service in runningData:
if service["service"] == "service-A" and service["services_used"] == ["service-B", "service-C"]:
os.chdir('/certs/service-A/service-B')
cert_create(str(addresses.serviceA))
os.chdir('/certs/service-A/service-C')
cert_create(str(addresses.serviceA))
elif service["service"] == "service-B" and service["services_used"] == ["service-C", "service-D"]:
os.chdir('/certs/service-B/service-C')
cert_create(str(addresses.serviceB))
os.chdir('/certs/service-B/service-D')
cert_create(str(addresses.serviceB))
As you can see, with a large number of services this logic can become quite a monstrosity. The issue is, with my limited experience of python, I don't see how creating a switch statement with dictionaries will be able to simplify this logic while retaining the same number of assignments and functions of a if statement.
Any ideas? I know this is quite basic but I feel as if with a different language such as go or java I would have been able to make this code a lot cleaner
Maybe you need some additional data structure (like dict below) or modification of existing addresses object so you can get address by name.
Of course, if your addresses data comes from user input consider sanitizing/checking it.
services_addresses = {
'service-A': str(addresses.serviceA),
'service-B': str(addresses.serviceB),
'service-C': str(addresses.serviceC),
}
for service in runningData:
service_name = service["service"]
for service_name_used in service["services_used"]:
os.chdir(f'/certs/{service_name}/{service_name_used}')
# here comes the difference - you need additional dict
cert_create( services_addresses[service_name] )
# or modification of addresses object ?..
# cert_create( addresses.getServiceAddress(service_name) )
import os
running_data = [
{
"service":"service-A",
"services_used": ["service-B", "service-C"]
},
{
"service":"service-B",
"services_used": ["service-C", "service-D"]
},
]
address_data = [
{
"service":"service-A",
"address": ["localhost1"]
},
{
"service":"service-B",
"address": ["localhost2"]
},
]
def get_service_address(service):
return [i.get("address")[0] for i in address_data if i.get("service") == service][0]
for data in running_data:
service = data.get("service")
used = data.get("services_used")
if service is None or used is None:
raise ValueError("Missing information")
for u in used:
print(os.path.join("certs", service, u))
print(get_service_address(service))
of course replace the prints with os.chdir and create_cert

How to create a Dataproc cluster with time to live using Python SDK

I try to create a Dataproc cluster which has a time to live of 1 day using python SDK. For this purpose, v1beta2 of the Dataproc API introduces the LifecycleConfig object which is child of the ClusterConfig object.
I use this object in the JSON file which I pass to the create_cluster method. To set the particular TTL, I use the field auto_delete_ttl which shall have the value 86,400 seconds (one day).
The documentation of Google Protocol Buffers is rather specific about how to represent a duration in the JSON file: Durations shall be represented as string with suffix s for seconds and there shall be 0,3,6 or 9 fractional seconds:
However, if I pass the duration using this format, I get the error:
Parameter to MergeFrom() must be instance of same class: expected google.protobuf.Duration got str
This is how I create the cluster:
from google.cloud import dataproc_v1beta2
project = "your_project_id"
region = "europe-west4"
cluster = "" #see below for cluster JSON file
client = dataproc_v1beta2.ClusterControllerClient(client_options={
'api_endpoint': '{}-dataproc.googleapis.com:443'.format(region)
})
# Create the cluster
operation = client.create_cluster(project, region, cluster)
The variable cluster holds the JSON object describing the desired cluster:
{
"cluster_name":"my_cluster",
"config":{
"config_bucket":"my_conf_bucket",
"gce_cluster_config":{
"zone_uri":"europe-west4-a",
"metadata":{
"PIP_PACKAGES":"google-cloud-storage google-cloud-bigquery"
},
"subnetwork_uri":"my subnet",
"service_account_scopes":[
"https://www.googleapis.com/auth/cloud-platform"
],
"tags":[
"some tags"
]
},
"master_config":{
"num_instances":1,
"machine_type_uri":"n1-highmem-4",
"disk_config":{
"boot_disk_type":"pd-standard",
"boot_disk_size_gb":200,
"num_local_ssds":0
},
"accelerators":[
]
},
"software_config":{
"image_version":"1.4-debian9",
"properties":{
"dataproc:dataproc.allow.zero.workers":"true",
"yarn:yarn.log-aggregation-enable":"true",
"dataproc:dataproc.logging.stackdriver.job.driver.enable":"true",
"dataproc:dataproc.logging.stackdriver.enable":"true",
"dataproc:jobs.file-backed-output.enable":"true"
},
"optional_components":[
]
},
"lifecycle_config":{
"auto_delete_ttl":"86400s"
},
"initialization_actions":[
{
"executable_file":"gs://some-init-script"
}
]
},
"project_id":"project_id"
}
Package versions I am using:
google-cloud-dataproc: 0.6.1
protobuf: 3.11.3
googleapis-common-protos: 1.6.0
Am I doing something wrong here, is it an issue with wrong package versions or is it even a bug?
You should use 100s format for a duration type when you construct protobuf in a text format (i.e. json, etc), but you are using a Python object to construct API request body, that's why you need to create a Duration object instead of a string:
duration_message.FromSeconds(86400)

How to create terraform backend.tf file from python before execution to eliminiate interpolation state file issue

Actually we bulit webapp from there we are passing variables to the terraform by
like below
terraform apply -input=false -auto-approve -var ami="%ami%" -var region="%region%" -var icount="%count%" -var type="%instance_type%"
Actually the problem here was backend does not support variables i need to pass there values also form app.
TO resolve this I find some solution like we need to create backend.tf before execution.
But I am unable to get the idea how to do it if anyone having any exmaples regarding this please help me.
Thanks in advance..
I need to create backend.tf file from python by using below variables.
And need to replace key="${profile}/tfstate
for each profile the profile need to replace
i am thinking of using git repo by using git we create files and pull the values and again commit and execute
Please help me with some examples and ideas.
Code is like below:
My main.tf like below
terraform {
backend “s3” {
bucket = “terraform-007”
key = “key”
region = “ap-south-1”
profile=“venu”
}
}
provider “aws” {
profile = “ var.awsprofile"
region="{var.aws_region}”
}
resource “aws_instance” “VM” {
count = var.icount
ami = var.ami
instance_type = var.type
tags = {
Environment = “${var.env_indicator}”
}
}
vars.tf like
variable “aws_profile” {
default = “default”
description = “AWS profile name, as set in ~/.aws/credentials”
}
variable “aws_region” {
type = “string”
default = “ap-south-1”
description = “AWS region in which to create resources”
}
variable “env_indicator” {
type = “string”
default = “dev”
description = “What environment are we in?”
}
variable “icount” {
default = 1
}
variable “ami” {
default =“ami-54d2a63b”
}
variable “bucket” {
default=“terraform-002”
}
variable “type” {
default=“t2.micro”
}
output.tf like:
output “ec2_public_ip” {
value = ["${aws_instance.VM.*.public_ip}"]
}
output “ec2_private_ip” {
value = ["${aws_instance.VM.*.private_ip}"]
}
Actually the problem here was backend does not support variables i need to pass there values also form app.
TO resolve this I find some solution like we need to create backend.tf before execution.
But I am unable to get the idea how to do it if anyone having any exmaples regarding this please help me.
Thanks in advance..
Since the configuration for the backend cannot use interpolation, we have used a configuration by convention approach.
The terraform for all of our state collections (microservices and other infrastructure) use the same S3 bucket for state storage and the same DynamoDB table for locking.
When executing terraform, we use the same IAM role (a dedicated terraform only user).
We define the key for the state via convention, so that it does not need to be generated.
key = "platform/services/{name-of-service}/terraform.tfstate"
I would avoid a process that results in changes to the infrastructure code as it is being deployed to ensure maximum understand-ability by the engineers reading/maintaining the code.
EDIT: Adding key examples
For the user service:
key = "platform/services/users/terraform.tfstate"
For the search service:
key = "platform/services/search/terraform.tfstate"
For the product service:
key = "platform/services/products/terraform.tfstate"

Google Sheet API - Get Data Validation

I'm trying to set data validation rules for my current spreadsheet. One thing that would help me would to be able to view the rules in JSON from data validation rules I have already set (In the spreadsheet UI or within an API call).
Example.
request = {
"requests": [
{
"setDataValidation": {
"range": {
"sheetId": SHEET_ID,
"startRowIndex": 1,
"startColumnIndex": 0,
"endColumnIndex":1
},
"rule": {
"condition": {
"type": "BOOLEAN"},
"inputMessage": "Value MUST BE BOOLEAN",
"strict": "True"
}
}
}
]
}
service.spreadsheets().batchUpdate(spreadsheetId=SPREADSHEET_ID body=request).execute()
But what API calls do I use to see the Data Validation on these range of cells? This is useful for if I set the Data Validation rules in the spreadsheet and I want to see how google interprets them. I'm having a lot of trouble setting complex Datavalidations through the API.
Thank you
To obtain only the "Data Validation" components of a given spreadsheet, you simply request the appropriate field in the call to spreadsheets.get:
service = get_authed_sheets_service_somehow()
params = {
spreadsheetId: 'your ssid',
#range: 'some range',
fields: 'sheets(data/rowData/values/dataValidation,properties(sheetId,title))' }
request = service.spreadsheets().get(**params)
response = request.execute()
# Example print code (not tested :p )
for sheet in response['sheets']:
for range in sheet['data']:
for r, row in enumerate(range['rowData']):
for c, col in enumerate(row['values']):
if 'dataValidation' in col:
# print "Sheet1!R1C1" & associated data validation object.
# Assumes whole grid was requested (add appropriate indices if not).
print(f'\'{sheet["properties"]["title"]}\'!R{r}C{c}', col['dataValidation'])
By specifying fields, includeGridData is not required to obtain data on a per-cell basis from the range you requested. By not supplying a range, we target the entire file. This particular fields specification requests the rowData.values.dataValidation object and the sheetId and title of the properties object, for every sheet in the spreadsheet.
You can use the Google APIs Explorer to interactively determine the appropriate valid "fields" specification, and additionally examine the response:
https://developers.google.com/apis-explorer/#p/sheets/v4/sheets.spreadsheets.get
For more about how "fields" specifiers work, read the documentation: https://developers.google.com/sheets/api/guides/concepts#partial_responses
(For certain write requests, field specifications are not optional so it is in your best interest to determine how to use them effectively.)
I think I found the answer. IncludeGridData=True in your spreadsheet().get
from pprint import pprint
response = service.spreadsheets().get(
spreadsheetId=SPREADSHEETID, fields='*',
ranges='InputWorking!A2:A',includeGridData=True).execute()
You get a monster datastructure back. So to look at the very first data in your range you could do.
pprint(response['sheets'][0]['data'][0]['rowData'][0]['values'][0]['dataValidation'])
{'condition': {'type': 'BOOLEAN'},
'inputMessage': 'Value MUST BE BOOLEAN',
'strict': True}

Access GlobalParameters in Azure ML Python script

How can one access the global parameters ("GlobalParameters") sent from a web service in a Python script on Azure ML?
I tried:
if 'GlobalParameters' in globals():
myparam = GlobalParameters['myparam']
but with no success.
EDIT: Example
In my case, I'm sending a sound file over the web service (as a list of samples). I would also like to send a sample rate and the number of bits per sample. I've successfully configured the web service (I think) to take these parameters, so the GlobalParameters now look like:
"GlobalParameters": {
"sampleRate": "44100",
"bitsPerSample": "16",
}
However, I cannot access these variables from the Python script, neither as GlobalParameters["sampleRate"] nor as sampleRate. Is it possible? Where are they stored?
based on our understanding of your question, here may has a miss conception that Azure ML parameters are not “Global Parameters”, as a matter of fact they are just parameter substitution tied to a particular module. So in affect there are no global parameters that are accessible throughout the experiment you have mentioned. Such being the case, we think the experiment below accomplishes what you are asking for:
Please add an “Enter Data” module to the experiment and add Data in csv format. Then for the Data click the parameter to create a web service parameter. Add in the CSV data which will be substituted from data passed by the client application. I.e.
Please add an “Execute Python” module and hook up the “Enter Data” output to the “Execute Python” input1. Add the python code to take the dataframe1 and add it to a python list. Once you have it in a list you can use it anywhere in your python code.
Python code snippet
def azureml_main(dataframe1 = None, dataframe2 = None):
import pandas as pd
global_list = []
for g in dataframe1["Col3"]:
global_list.append(g)
df_global = pd.DataFrame(global_list)
print('Input pandas.DataFrame:\r\n\r\n{0}'.format(df_global))
return [df_global]
Once you publish your experiment, you can add in new values in the “Data”: “”, section below with the new values that you was substituted for the “Enter Data” values in the experiment.
data = {
"Inputs": {
"input1":
{
"ColumnNames": ["Col1", "Col2", "Col3"],
"Values": [ [ "0", "value", "0" ], [ "0", "value", "0" ], ]
}, },
"GlobalParameters": {
"Data": "1,sampleRate,44500\\n2,bitsPerSample,20",
}
}
Please feel free to let us know if this makes sense.
The GlobalParameters parameter can not be used in a Python script. It is used to override certain parameters in other modules.
If you, for example, take the 'Split Data' module, you'll find an option to turn a parameter into a web service parameter:
Once you click that, a new section appears titled "Web Service Parameters". There you can change the default parameter name to one of your choosing.
If you deploy your project as a web service, you can override that parameter by putting it in the GlobalParameters parameter:
"GlobalParameters": {
"myFraction": 0.7
}
I hope that clears things up a bit.
Although it is not possible to use GlobalParameters in the Python script (see my previous answer), you can however hack/abuse the second input of the Python script to pass in other parameters. In my example I call them metadata parameters.
To start, I added:
a Web service input module with name: "realdata" (for your real data off course)
a Web service input module with name: "metadata" (we will abuse this one to pass parameters to our Python).
a Web service output module with name: "computedMetadata"
Connect the modules as follows:
As you can see, I also added a real data set (Restaurant ratings) as wel as a dummy metadata csv (the Enter Data Manually) module.
In this manual data you will have to predefine your metadata parameters as if they were a csv with a header and a only a single row to hold the data:
In the example both sampleRate and bitsPerSample are set to 0.
My Python scripts then takes in that fake csv as metadata, does some dummy calculation with it and returns it as column name:
import pandas as pd
def azureml_main(realdata = None, metadata = None):
theSum = metadata["sampleRate"][0] + metadata["bitsPerSample"][0]
outputString = "The sum of the sampleRate and the bitsPerSecond is " + str(theSum)
print(outputString)
return pd.DataFrame([outputString])
I then published this as a web service and called it using Node.js like this:
httpreq.post('https://ussouthcentral.services.azureml.net/workspaces/xxx/services/xxx', {
headers: {
Authorization: 'Bearer xxx'
},
json: {
"Inputs": {
"realdata": {
"ColumnNames": [
"userID",
"placeID",
"rating"
],
"Values": [
[
"100",
"101",
"102"
],
[
"200",
"201",
"202"
]
]
},
"metadata": {
"ColumnNames": [
"sampleRate",
"bitsPerSample"
],
"Values": [
[
44100,
16
]
]
}
},
"GlobalParameters": {}
}
}, (err, res) => {
if(err) return console.log(err);
console.log(JSON.parse(res.body));
});
The output was as expected:
{ Results:
{ computedMetadata:
{ type: 'table',
value:
{ ColumnNames: [ '0' ],
ColumnTypes: [ 'String' ],
Values:
[ [ 'The sum of the sampleRate and the bitsPerSecond is 44116' ] ] } } } }
Good luck!

Categories