Amazon Sagemaker - Unable to evaluate payload provided - python

I built a Sagemaker endpoint that I am attempting to evoke using Lambda+API Gateway. I'm getting the following error:
"An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from model with message \"unable to evaluate payload provided\"
I know why what it's complaining about, but I don't quite understand why it's occuring. I have confirmed that the shape of the input data of my lambda function is the same as how I trained the model. The following is my input payload in lambda:
X = pd.concat([X, rx_norm_dummies, urban_dummies], axis = 1)
payload = X.to_numpy()
response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
ContentType='application/json',
Body=payload)
In the jupyter notebook where I created my endpoint/trained my model, I can also access the model using a numpy ndarray so I'm confused why I'm getting this error.
y = X[0:10]
result = linear_predictor.predict(y)
print(result)
Here is a modificaiton I make to serialization of the endpoint:
from sagemaker.predictor import csv_serializer, json_deserializer
linear_predictor.content_type = 'text/csv'
linear_predictor.serializer = csv_serializer
linear_predictor.deserializer = json_deserializer
I'm new when it comes to Sagemaker/Lambda, so any help would be appreciated and I can send more code to add context if needed. Tried various foramts and cannot get this to work.

Not sure what algorithm you are trying to use in Sage but they have the quote below on their documentation. If your CSV has a header or if you don't specify the label_size a bad request (http status code 400) might be what you get because the model isn't compliant with what AWS expects to receive.
Amazon SageMaker requires that a CSV file does not have a header record and that the target variable is in the first column. To run unsupervised learning algorithms that don't have a target, specify the number of label columns in the content type. For example, in this case 'content_type=text/csv;label_size=0'.
From: https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html#cdf-csv-format

In your code, you're indicating to the SDK that you are passing JSON but payload in your code is actually a NumPy array.
I would suggest replacing payload = X.to_numpy() to payload = X.to_json(orient='records') or something similar to convert it to JSON first.

Related

how to generate the json format for google cloud predictions

I am trying to make predictions from my custom model on Vertex AI but am getting errors.
I have deployed the model to an endpoint and I request prediction using this line
gcloud beta ai endpoints predict 1234 --region=europe-west4 --json-request=request2.json
and I get this response
Using endpoint [https://europe-west4-prediction-aiplatform.googleapis.com/]ERROR: (gcloud.beta.ai.endpoints.predict) FAILED_PRECONDITION: "Prediction failed: Exception during xgboost prediction: feature_names mismatch:
I have created the data with this code ( and later renamed to request2.json)
test3 = {}
test3["instances"] = test_set2.head(20).values.tolist()
with open('05_GCP/test_v3.jsonl', 'w') as outfile:
json.dump(test3, outfile, indent = 2)
This generates a file which looks like this.
The error tells me that its expecting column names per value instead of nothings, which are then interpreted as f0, f1 etc.
My challenge is that I don't know how to generate data that looks like this( also from the help file)
Though the result with the mismatched column names suggests I need a different format.
I tried:
import json
test4= X_test.head(20).to_json(orient='records', lines=True)
with open('05_GCP/test_v4.json', 'w') as outfile:
json.dump(test4, outfile, indent = 2)
which gives me data that looks like:
ie with many line breaks in it and cloud shell tells me this isn't json.
I also replicated this format and was informed it is not a json file
So 2 questions,
how do I create a json file that has the appropriate format for a live prediction.
how do I create a jsonl file so I can run batch jobs. This is actually want I a trying to get to. I have used csv but it returns errors:
('Post request fails. Cannot get predictions. Error: Predictions are not in the response. Got: {"error": "Prediction failed: Could not initialize DMatrix from inputs: ('Expecting 2 dimensional numpy.ndarray, got: ', (38,))"}.', 2)
This csv data is the exact same that I use to measure the model error whilst training (I know not good practise but this is just a test run)
UPDATE
Following Raj's suggestion, I tried creating two extra models. One where I changed my training code to be X_train.values and another where I replaced all the column names to be F0:F426 as the response to the json file on the endpoint said it didn't match column names.
These both SUCCEDDED with the test3 json file above when endpoints where deployed.
However I want this to be returning batch predictions, and here it returns the same errors. This is therefore clearly a formatting error but I have no clue about how to get there.
it should be pointed out that the batch predictions need to be a jsonl file. Not a json file. Trying to pass a json file doesn't work. All I have done is change the extension on the json file when I loaded it to make it appear as a jsonl file. I have not found anything that helps me create that. Tips are welcome.
I noticed that my data was in double brackets. So I created a version which had only one bracket and ran this on one of the models but it also returned errors. This was also only one prediction per the other comment.
Appreciate the help,
James
This worked for me:
If you train the model with df.values as per raj's answer, you can then pass the instances you want to get batch predictions from in a jsonl file ("input.jsonl" for example) with the following format for each instance/row:
[3.0,1.0,30.0,1.0,0.0,16.1]
File would look something like this for 5 rows to predict:
[3.0,1.0,30.0,1.0,0.0,16.1]
[3.0,0.0,22.0,0.0,0.0,9.8375]
[2.0,0.0,45.0,1.0,1.0,26.25]
[1.0,0.0,21.0,0.0,0.0,26.55]
[3.0,1.0,16.0,4.0,1.0,39.6875]

Azure ML time series model inference error during data input (python)

In the Azure ML Studio I prepared a model with AutoML for time series forecasting. The data have some rare gaps in all data sets.
I am using the following code to call for a deployed Azure AutoML model as a web service:
import requests
import json
import pandas as pd
# URL for the web service
scoring_uri = 'http://xxxxxx-xxxxxx-xxxxx-xxxx.xxxxx.azurecontainer.io/score'
# Two sets of data to score, so we get two results back
new_data = pd.DataFrame([
['2020-10-04 19:30:00',1.29281,1.29334,1.29334,1.29334,1],
['2020-10-04 19:45:00',1.29334,1.29294,1.29294,1.29294,1],
['2020-10-04 21:00:00',1.29294,1.29217,1.29334,1.29163,34],
['2020-10-04 21:15:00',1.29217,1.29257,1.29301,1.29115,195]],
columns=['1','2','3','4','5','6']
)
# Convert to JSON string
input_data = json.dumps({'data': new_data.to_dict(orient='records')})
# Set the content type
headers = {'Content-Type': 'application/json'}
# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
print(resp.text)
I am getting an error:
{\"error\": \"DataException:\\n\\tMessage: No y values were provided. We expected non-null target values as prediction context because there is a gap between train and test and the forecaster depends on previous values of target. If it is expected, please run forecast() with ignore_data_errors=True. In this case the values in the gap will be imputed.\\n\\tInnerException: None\\n\\tErrorResponse \\n{\\n
I tried to add "ignore_data_errors=True" to different parts of the code without a success, hence, getting another error:
TypeError: __init__() got an unexpected keyword argument 'ignore_data_errors'
I would very much appreciate any help as I am stuck at this.
To avoid getting the provided error in time series forecasting, you should enable Autodetect for the Forecast Horizon. It means that only ideal time series data can use manually set feature which is not helping for real-world cases.
see the image

Access data on AML datastore from training script

I am looking for a working example how to access data on a Azure Machine Learning managed data store from within a train.py script. I followed the instructions in the link and my script is able to resolve the datastore.
However, whatever I tried (as_download(), as_mount()) the only thing I always got was a DataReference object. Or maybe I just don't understand how actually read data from a file with that.
run = Run.get_context()
exp = run.experiment
ws = run.experiment.workspace
ds = Datastore.get(ws, datastore_name='mydatastore')
data_folder_mount = ds.path('mnist').as_mount()
# So far this all works. But how to go from here?
You can pass in the DataReference object you created as the input to your training product (scriptrun/estimator/hyperdrive/pipeline). Then in your training script, you can access the mounted path via argument.
full tutorial: https://learn.microsoft.com/en-us/azure/machine-learning/service/tutorial-train-models-with-aml

Python API sentinelsat error in download

I am having a go at using the sentinelsat python API to download satellite imagery. However, I am receiving error messages when I try to convert to a pandas dataframe. This code works and downloads my requested sentinel satellite images:
from sentinelsat import SentinelAPI, read_geojson, geojson_to_wkt
from datetime import date
api = SentinelAPI('*****', '*****', 'https://scihub.copernicus.eu/dhus')
footprint = geojson_to_wkt(read_geojson('testAPIpoly.geojson'))
products = api.query(footprint, cloudcoverpercentage = (0,10))
#this works
api.download_all(products)
However if I instead attempt to convert to a pandas dataframe
#api.download_all(products)
#this does not work
products_df = api.to_dataframe(products)
api.download_all(products_df)
I receive an extensive error message that includes
"sentinelsat.sentinel.SentinelAPIError: HTTP status 500 Internal Server Error: InvalidKeyException : Invalid key (processed) to access Products
"
(where processed is also replaced with title, platformname, processingbaseline, etc.). I've tried a few different ways to convert to a dataframe and filter/sort results and have received an error message every time (note: I have pandas/geopandas installed). How can I convert to a dataframe and filter/sort with the sentinelsat API?
Instead of
api.download_all(products_df)
try
api.download_all(products_df.index)

if_modified_since not working with azure blob storage python

I am working with Azure functions to create triggers that aggregate data on an hourly basis. The triggers get data from blob storage, and to avoid aggregating the same data twice I want to add a condition that lets me only process blobs that were modified the last hour.
i am using the SDK, and my code for doing this looks like this:
''' Timestamp variables for t.now and t-1 '''
timestamp = datetime.now(tz=utc)
timestamp_negative1hr = timestamp+timedelta(hours=1)
''' Read data from input enivonment '''
data = BlockBlobService(account_name='accname', account_key='key')
generator = data.list_blobs('directory')
dataloaded = []
for blob in generator:
loader = data.get_blob_to_text('collection',blob.name, if_modified_since=timestamp_negative1hr)
trackerstatusobjects = loader.content.split('\n')
for trackerstatusobject in trackerstatusobjects:
dataloaded.append(json.loads(trackerstatusobject))
When I run this, the error I get is azure.common.AzureHttpError: The condition specified using HTTP conditional header(s) is not met. It is also specified that it is due to a timeout. The blobs are recieving data when I run it, so in any case it is not the correct return message. If I add .strftime("%Y-%m-%d %H:%M:%S:%z")to the end of my timestamp i get another error AttributeError: 'str' object has no attribute 'tzinfo'. This must mean that azure expects a datetime object, but for some reason it is not working for me.
Any ideas on how to solve it? Thanks

Categories