if_modified_since not working with azure blob storage python - python

I am working with Azure functions to create triggers that aggregate data on an hourly basis. The triggers get data from blob storage, and to avoid aggregating the same data twice I want to add a condition that lets me only process blobs that were modified the last hour.
i am using the SDK, and my code for doing this looks like this:
''' Timestamp variables for t.now and t-1 '''
timestamp = datetime.now(tz=utc)
timestamp_negative1hr = timestamp+timedelta(hours=1)
''' Read data from input enivonment '''
data = BlockBlobService(account_name='accname', account_key='key')
generator = data.list_blobs('directory')
dataloaded = []
for blob in generator:
loader = data.get_blob_to_text('collection',blob.name, if_modified_since=timestamp_negative1hr)
trackerstatusobjects = loader.content.split('\n')
for trackerstatusobject in trackerstatusobjects:
dataloaded.append(json.loads(trackerstatusobject))
When I run this, the error I get is azure.common.AzureHttpError: The condition specified using HTTP conditional header(s) is not met. It is also specified that it is due to a timeout. The blobs are recieving data when I run it, so in any case it is not the correct return message. If I add .strftime("%Y-%m-%d %H:%M:%S:%z")to the end of my timestamp i get another error AttributeError: 'str' object has no attribute 'tzinfo'. This must mean that azure expects a datetime object, but for some reason it is not working for me.
Any ideas on how to solve it? Thanks

Related

PYTHON: Issues with Infusionsoft Deleting Method

I am new to Python and working with the Infusionsoft API and I am hitting a snag here. I am writing a script that retrieves all of the contacts in our system and adds them to a Pandas Data frame if they contain a given string. From what I can tell my code for retrieving the contacts is code and it will even break it down to just the ID number of the contact I want to receive. The issue comes when I try to pass that data into my delete method.
When I first started looking into this I looked into a github posting (see here: https://github.com/GearPlug/infusionsoft-python) And planned to use the method delete_contact = Client.delete_contact('ID') which takes the param 'ID' as a string. I have broken it down in my code so that the Ids will read into an array as a string and my program iterates over them and prints out all of the strings like so:
1
2
3
What has me thrown off is when I try to pass them into the method delete_contact = client.delete_contact('ID') it comes back with
File "C:\Users\Bryan\OneDrive\Desktop\Python_Scripts\site-packages\NEW_Infusion_Script.py", line 28, in <module>
delete_contact(infusion_id)
File "C:\Users\Bryan\OneDrive\Desktop\Python_Scripts\site-packages\NEW_Infusion_Script.py", line 26, in delete_contact
Client.delete_contact('id')
TypeError: Client.delete_contact() missing 1 required positional argument: 'id'
Here is my code with the obvious API keys removed:
import pandas as pd
import infusionsoft
from infusionsoft.client import Client
import xmlrpc.client
#Python has built-in support for xml-rpc. All you need to do is add the
#line above.
#Set up the API server variable
server = xmlrpc.client.ServerProxy("https://productname.infusionsoft.com:443/api/xmlrpc")
key = "#The encrypted API key"
test_rigor = []
var = server.DataService.findByField(key,"Contact",100,0,"Email","%testrigor-mail.com",["LastName","Id",'Email'] )
for result in var:
server.DataService.update(key,"Contact",result["Id"],{"LastName":" "})
test_rigor.append
##create a Data Frame from the info pull above
df = pd.DataFrame.from_dict(var)
print("Done")
print(var)
df
##Pull the data and put into a seperate array and feed that into the delete method
infusion_ids = []
for num in df['Id']:
infusion_ids.append(num)
def delete_contact(x):
Client.delete_contact('id')
for infusion_id in infusion_ids:
infusion_id[0]
delete_contact(infusion_id[0])
infusion_id.pop(0)
##print(df)
Any suggestions or obvious missteps would be greatly appreciated thanks!

PySpark groupby applyInPandas save objects as files problem

I run PySpark 3.1 on a Windows computer with local mode on Jupyter Notebook. I call "applyInPandas" on Spark DataFrame.
Below function applies a few data transformations to input Pandas DataFrame, and trains an SGBT model. Then it serializes the trained model into binary and saves to S3 bucket as object. Finally it returns the DataFrame. I call this function from a Spark DataFrame grouped by two columns in the last line. I receive no error and the returned DataFrame is as the same length as the input. Data for each group is returned.
The problem is the saved model objects. There are objects saved in S3 only for 2 groups when there were supposed to be models for each group. There is no missing/wrong data point that would cause model training to fail. (I'd receive an error or warning anyway.) What I have tried so far:
Replace S3 and save to local file system: The same result.
Replace "pickle" with "joblib" and "BytesIO": The same result.
Repartition before calling the function: Now I had more objects saved for different groups, but not all. [I did this by calling "val_large_df.coalesce(1).groupby('la..." in the last line.]
So I suspect this is about parallelism and distribution, but I could not figure it out. Thank you already.
def train_sgbt(pdf):
##Some data transformations here##
#Train the model
sgbt_mdl=GradientBoostingRegressor(--Params.--).fit(--Params.--)
sgbt_mdl_b=pickle.dumps(sgbt_mdl) #Serialize
#Initiate s3_client
s3_client = boto3.client(--Params.--)
#Put file in S3
s3_client.put_object(Body=sgbt_mdl_b, Bucket='my-bucket-name',
Key="models/BT_"+str(pdf.latGroup_m[0])+"_"+str(pdf.lonGroup_m[0])+".mdl")
return pdf
dummy_df=val_large_df.groupby("latGroup_m","lonGroup_m").applyInPandas(train_sgbt,
schema="fcast_error double")
dummy_df.show()
Spark evaluates the dummy_df lazy and therefore train_sgbt will only be called for the groups that are required to complete the Spark action.
The Spark action here is show(). This action prints only the first 20 rows, so train_sgbt is only called for the groups that have at least one element in the first 20 rows. Spark may evaluate more groups, but there is no guarantee for it.
One way to solve to problem would be to call another action, for example csv.

Access data on AML datastore from training script

I am looking for a working example how to access data on a Azure Machine Learning managed data store from within a train.py script. I followed the instructions in the link and my script is able to resolve the datastore.
However, whatever I tried (as_download(), as_mount()) the only thing I always got was a DataReference object. Or maybe I just don't understand how actually read data from a file with that.
run = Run.get_context()
exp = run.experiment
ws = run.experiment.workspace
ds = Datastore.get(ws, datastore_name='mydatastore')
data_folder_mount = ds.path('mnist').as_mount()
# So far this all works. But how to go from here?
You can pass in the DataReference object you created as the input to your training product (scriptrun/estimator/hyperdrive/pipeline). Then in your training script, you can access the mounted path via argument.
full tutorial: https://learn.microsoft.com/en-us/azure/machine-learning/service/tutorial-train-models-with-aml

Python API sentinelsat error in download

I am having a go at using the sentinelsat python API to download satellite imagery. However, I am receiving error messages when I try to convert to a pandas dataframe. This code works and downloads my requested sentinel satellite images:
from sentinelsat import SentinelAPI, read_geojson, geojson_to_wkt
from datetime import date
api = SentinelAPI('*****', '*****', 'https://scihub.copernicus.eu/dhus')
footprint = geojson_to_wkt(read_geojson('testAPIpoly.geojson'))
products = api.query(footprint, cloudcoverpercentage = (0,10))
#this works
api.download_all(products)
However if I instead attempt to convert to a pandas dataframe
#api.download_all(products)
#this does not work
products_df = api.to_dataframe(products)
api.download_all(products_df)
I receive an extensive error message that includes
"sentinelsat.sentinel.SentinelAPIError: HTTP status 500 Internal Server Error: InvalidKeyException : Invalid key (processed) to access Products
"
(where processed is also replaced with title, platformname, processingbaseline, etc.). I've tried a few different ways to convert to a dataframe and filter/sort results and have received an error message every time (note: I have pandas/geopandas installed). How can I convert to a dataframe and filter/sort with the sentinelsat API?
Instead of
api.download_all(products_df)
try
api.download_all(products_df.index)

Redis get data using python

I have done the following to get json file data into redis using this python script-
import json
import redis
r = redis.StrictRedis(host='127.0.0.1', port=6379, db=1)
with open('products.json') as data_file:
test_data = json.load(data_file)
r.set('test_json', test_data)
When I use the get commmand from redis-cli (get test_json) I get nil back.
I must be using the wrong command?
Please help for my understanding on this.
You should use hmset instead of set and hgetall instead of get to store multiple keys, your code should look like:
r.hmset('test_json', test_data) #to set multiple index data
r.hgetall('test_json') #to get multiple index data
I deleted the previous answer. Didn't noticed the problem there is that u specified 'db=1' in the redis constructor. So you are saving the data in the db 1. type 'select 1' in the redis client, or remove that from the constructor (by default, with redis-cli you connect to database 0)

Categories