Working with Python, I have many processes that need to update/insert data into an Azure table storage at the same time using :
table_service.update_entity(table_name, task) <br/>
table_service.insert_entity(table_name, task)
However, the following error occurs:
<br/>AzureConflictHttpError: Conflict
{"odata.error":{"code":"EntityAlreadyExists","message":{"lang":"en-US","value":"The specified entity already exists.\nRequestId:57d9b721-6002-012d-3d0c-b88bef000000\nTime:2019-01-29T19:55:53.5984026Z"}}}
Maybe I need to use a global Lock to avoid operating the same Table Entity concurrently but I don't know how to use it
Azure Tables has a new SDK available for Python that is in a preview release available on pip, here's an update for the newest library.
On a create method you can use a try/except block to catch the expected error:
from azure.data.tables import TableClient
from azure.core.exceptions import ResourceExistsError
table_client = TableClient.from_connection_string(conn_str, table_name="myTableName")
try:
table_client.create_entity(entity=my_entity)
except ResourceExistsError:
print("Entity already exists")
You can use ETag to update entities conditionally after creation.
from azure.data.tables import UpdateMode
from azure.core import MatchConditions
received_entity = table_client.get_entity(
partition_key="my_partition_key",
row_key="my_row_key",
)
etag = received_entity._metadata["etag"]
resp = table_client.update_entity(
entity=my_new_entity,
etag=etag,
mode=UpdateMode.REPLACE,
match_condition=MatchConditions.IfNotModified
)
On update, you can elect for replace or merge, more information here.
(FYI, I am a Microsoft employee on the Azure SDK for Python team)
There isn't a global "Lock" in Azure Table Storage, since it's using optimistic concurrency via ETag (i.e. If-Match header in raw HTTP requests).
If your thread A is performing insert_entity, it should catch the 409 Conflict error.
If your thread B & C are performing update_entity, they should 412 Precondition Failed error, and use a loop to retrieve the latest entity then try to update the entity again.
For more details, please check Managing Concurrency in the Table Service section in https://azure.microsoft.com/en-us/blog/managing-concurrency-in-microsoft-azure-storage-2/
Related
I am looking to create a CICD pipeline for my Azure SQL Database. I have read about the State-based approach and the Migration-based approach that has been described here,
https://devblogs.microsoft.com/azure-sql/devops-for-azure-sql/
However, I want to know if there is an approach I can use to do this in Python. I am looking to deploy both schema and data changes to the other environment through my pipeline. It would be great if I can implement a method that will deploy only chosen data points though. For example, if I can filter on a stage column for production records.
What kind of approach can I take to accomplish this?
It does not matter if I need to trigger this CICD pipeline manually through an API call or something. I believe this is also possible in Azure Pipelines.
What you can do is deploy the sql server normally and then make changes by adding a different task in .yaml file which will execute the changes .
For this you can either use DACPAC or you can directly use sql script .
In both the case you have to create you sql based script before deploying.
for DACPAC you need add the following type to task :
- task: SqlAzureDacpacDeployment#1
displayName: Execute Azure SQL : DacpacTask
inputs:
azureSubscription: '<Azure service connection>'
ServerName: '<Database server name>'
DatabaseName: '<Database name>'
SqlUsername: '<SQL user name>'
SqlPassword: '<SQL user password>'
DacpacFile: '<Location of Dacpac file in $(Build.SourcesDirectory) after compilation>'
for sql script add the following type to task :
- task: AzureMysqlDeployment#1
inputs:
ConnectedServiceName: # Or alias azureSubscription
ServerName:
#DatabaseName: # Optional
SqlUsername:
SqlPassword:
#TaskNameSelector: 'SqlTaskFile' # Optional. Options: SqlTaskFile, InlineSqlTask
#SqlFile: # Required when taskNameSelector == SqlTaskFile
#SqlInline: # Required when taskNameSelector == InlineSqlTask
#SqlAdditionalArguments: # Optional
#IpDetectionMethod: 'AutoDetect' # Options: AutoDetect, IPAddressRange
#StartIpAddress: # Required when ipDetectionMethod == IPAddressRange
#EndIpAddress: # Required when ipDetectionMethod == IPAddressRange
#DeleteFirewallRule: true # Optional
For detailed explanation please refer the following documentation on DACPAC task and refer this documentation for the sql script task.
As of now there are no Api to start-stop preexisting pipeline I have consulted this documentation for this.
I have been tasked lately, to ingest JSON responses onto Databricks Delta-lake. I have to hit the REST API endpoint URL 6500 times with different parameters and pull the responses.
I have tried two modules, ThreadPool and Pool from the multiprocessing library, to make each execution a little quicker.
ThreadPool:
How to choose the number of threads for ThreadPool, when the Azure Databricks cluster is set to autoscale from 2 to 13 worker nodes?
Right now, I've set n_pool = multiprocessing.cpu_count(), will it make any difference, if the cluster auto-scales?
Pool
When I use Pool to use processors instead of threads. I see the following errors randomly on each execution. Well, I understand from the error that Spark Session/Conf is missing and I need to set it from each process. But I am on Databricks with default spark session enabled, then why do I see these errors.
Py4JError: SparkConf does not exist in the JVM
**OR**
py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM
Lastly, planning to replace multiprocessing with 'concurrent.futures.ProcessPoolExecutor'. Does it make any difference?
if you're using thread pools, they will run only on the driver node, executors will be idle. Instead you need to use Spark itself to parallelize the requests. This is usually done by creating a dataframe with list of URLs (or parameters for URL if base URL is the same), and then use Spark user defined function to do actual requests. Something like this:
import urllib
df = spark.createDataFrame([("url1", "params1"), ("url2", "params2")],
("url", "params"))
#udf("body string, status int")
def do_request(url: str, params: str):
full_url = url + "?" + params # adjust this as required
with urllib.request.urlopen(full_url) as f:
status = f.status
body = f.read().decode("utf-8")
return {'status': status, 'body': body}
res = df.withColumn("result", do_requests(col("url"), col("params")))
This will return dataframe with a new column called result that will have two fields - status and body (JSON answer as string).
You can try following way to resolve
Py4JError: SparkConf does not exist in the JVM
**OR**
py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM
Error
Install findspark
$pip install findspark
Code:
import findsparkfindspark.init()
References: Py4JError: SparkConf does not exist in the JVM and py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM
I had to upgrade a docker container that was using the older version of microsoft azure's python packages to download data from an api, then upload a json to Azure Blob Storage. So since the pip install of the former "azure" metapackage is no longer allowed I have to use the new standalone packages (azure-storage-blob==12.6.0).
Switching from the function "create_blob_from_path" from the blockblobservice integrated in the old "azure" package, to the new standalone package and BlobClient.upload() fails on larger files with a timeout error that completely ignores the timeout parameter of the function.
I get a ServiceResponseError with the msg "Connection aborted / The write operation timed out"
Is there any way to solve that error ?
The new function feels like a huge step backwards from create_blob_from_path, the absence of progress_callback mainly is deplorable...
The correct solution, if your control flow allows it, seems to be setting the max_single_put_size to something smaller (like 4MB) when you create the BlobClient. You can do this with a keyword parameter when calling the constructor.
However, as near as I can tell, this parameter cannot be configured if creating a BlobClient through the BlobClient.from_blob_url control flow. The default value for this is 64MB, and it is easy to hit the default connection timeout before a 64MB PUT is done. In some applications, you may not have access to auth credentials for the storage account (i.e. you're just using a signed URL), so the only way to create a BlobClient is from a BlobClient.from_blob_url call.
It seems like the workaround is to set the poorly documented connection_timeout parameter on the upload_blob call, instead of the timeout parameter. So, something like:
upload_result = block_blob_client.upload_blob(
data,
blob_type="BlockBlob",
content_settings=content_settings,
length=file_size,
connection_timeout=600,
)
That parameter is documented on this page:
https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/storage/azure-storage-blob#other-client--per-operation-configuration
However, it is not currently documented on the official BlobClient documentation:
https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobclient?view=azure-python
I've filed this documentation bug: https://github.com/Azure/azure-sdk-for-python/issues/22936
I tested with code as following, it uploaded the file(~10M) successfully.
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
# Create a blob client using the local file name as the name for the blob
blob_client = blob_service_client.get_blob_client(container=container_name, blob=local_file_name)
# Upload content to block blob
with open(SOURCE_FILE, "rb") as data:
blob_client.upload_blob(data, blob_type="BlockBlob")
Not sure how do you set the timeout value, here is an example of upload blob with timeout setting:
with open(upload_file_path,"rb") as data:
blob_client.upload_blob(data=data,timeout=600) # timeout is set to 600 seconds
If the timeout is ignored, another workaround is that you can upload blob in chunk, code like below:
# upload data in chunk
block_list=[]
chunk_size=1024
with open(upload_file_path,'rb') as f:
while True:
read_data = f.read(chunk_size)
if not read_data:
break # done
blk_id = str(uuid.uuid4())
blob_client.stage_block(block_id=blk_id,data=read_data)
block_list.append(BlobBlock(block_id=blk_id))
blob_client.commit_block_list(block_list)
This worked for me:
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
blob_service_client.max_single_put_size = 4*1024*1024
blob_service_client.timeout = 180
container_client = blob_service_client.get_container_client(container_name)
container_client.upload_blob(data=file, name=key, max_concurrency=12)
Also check
https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobclient?view=azure-python
I am developing an application using with the Cloud Datastore Emulator (2.1.0) and the google-cloud-ndb Python library (1.6).
I find that there is an intermittent delay on entities being retrievable via a query.
For example, if I create an entity like this:
my_entity = MyEntity(foo='bar')
my_entity.put()
get_my_entity = MyEntity.query().filter(MyEntity.foo == 'bar').get()
print(get_my_entity.foo)
it will fail itermittently because the get() method returns None.
This only happens on about 1 in 10 calls.
To demonstrate, I've created this script (also available with ready to run docker-compose setup on GitHub):
import random
from google.cloud import ndb
from google.auth.credentials import AnonymousCredentials
client = ndb.Client(
credentials=AnonymousCredentials(),
project='local-dev',
)
class SampleModel(ndb.Model):
"""Sample model."""
some_val = ndb.StringProperty()
for x in range(1, 1000):
print(f'Attempt {x}')
with client.context():
random_text = str(random.randint(0, 9999999999))
new_model = SampleModel(some_val=random_text)
new_model.put()
retrieved_model = SampleModel.query().filter(
SampleModel.some_val == random_text
).get()
print(f'Model Text: {retrieved_model.some_val}')
What would be the correct way to avoid this intermittent failure? Is there a way to ensure the entity is always available after the put() call?
Update
I can confirm that this is only an issue with the datastore emulator. When testing on app engine and a Firestore in Datastore mode, entities are available immediately after calling put().
The issue turned out to be related to the emulator trying to replicate eventual consistency.
Unlike relational databases, Datastore does not gaurentee that the data will be available immediately after it's posted. This is because there are often replication and indexing delays.
For things like unit tests, this can be resolved by passing --consistency=1.0 to the datastore start command as documented here.
I am currently implementing the Aerospike Python Client in order to benchmark it along with our Redis implementation, to see which is faster and/or more stable.
I'm still on baby steps, currently Unit-Testing basic functionality, for example if I correctly add records in my Set. For that reason, I want to create a function to count them.
I saw in Aerospike's Documentation, that :
"to perform an aggregation on query, you first need to register a UDF
with the database".
It seems that this is the suggested way that aggregations, counting and other custom functionality should be run in Aerospike.
Therefore, to count the records in a set I have, I created the following module:
# "counter.lua"
function count(s)
return s : map(function() return 1 end) : reduce (function(a,b) return a+b end)
end
I'm trying to use aerospike python client's function to register a UDF(User Defined Function) module:
udf_put(filename, udf_type, policy)
My code is as follows:
# aerospike_client.py:
# "udf_put" parameters
policy = {'timeout': 1000}
lua_module = os.path.join(os.path.dirname(os.path.realpath(__file__)), "counter.lua") #same folder
udf_type = aerospike.UDF_TYPE_LUA # equals to "0", which is for "Lua"
self.client.udf_put(lua_module, udf_type, policy) # Exception is thrown here
query = self.client.query(self.aero_namespace, self.aero_set)
query.select()
result = query.apply('counter', 'count')
an exception is thrown:
exceptions.Exception: (-2L, 'Filename should be a string', 'src/main/client/udf.c', 82)
Is there anything I'm missing or doing wrong?
Is there a way to "debug" it without compiling C code?
Is there any other suggested way to count the records in my set? Or I'm fine with the Lua module?
First, I'm not seeing that exception, but I am seeing a bug with udf_put where the module is registered but the python process hangs. I can see the module appear on the server using AQL's show modules.
I opened a bug with the Python client's repo on Github, aerospike/aerospike-client-python.
There's a best practices document regarding UDF development here: https://www.aerospike.com/docs/udf/best_practices.html
In general using the stream-UDF to aggregate the records through the count function is the correct way to go about it. There are examples here and here.