I'm trying to use Sagemaker to serve precomputed predictions. The predictions are in the following format in a python dictionary.
customer_group prediction
1 50
2 60
3 25
4 30
...
Currently the docker serve API code goes to s3 and downloads the data daily.
The problem is that downloading the data blocks the api from responding to the Sagemaker health endpoint calls.
This a case study of how zappos did it using Amazon DynamoDB. However, is there a way to do it in Sagemaker?
Where and how can I add the s3 download function to avoid interrupting the health check?
Could this work? -> https://github.com/seomoz/s3po
https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-x-email-support
app = flask.Flask(__name__)
#app.route('/ping', methods=['GET'])
def ping():
"""Determine if the container is working and healthy. In this sample container, we declare
it healthy if we can load the model successfully."""
health = ScoringService.get_model() is not None # You can insert a health check here
status = 200 if health else 404
return flask.Response(response='\n', status=status, mimetype='application/json')
#app.route('/invocations', methods=['POST'])
def transformation():
"""Do an inference on a single batch of data. In this sample server, we take data as CSV, convert
it to a pandas data frame for internal use and then convert the predictions back to CSV (which really
just means one prediction per line, since there's a single column.
"""
data = None
# Convert from CSV to pandas
if flask.request.content_type == 'text/csv':
data = flask.request.data.decode('utf-8')
s = StringIO.StringIO(data)
data = pd.read_csv(s, header=None)
else:
return flask.Response(response='This predictor only supports CSV data', status=415, mimetype='text/plain')
print('Invoked with {} records'.format(data.shape[0]))
# Do the prediction
predictions = ScoringService.predict(data)
# Convert from numpy back to CSV
out = StringIO.StringIO()
pd.DataFrame({'results':predictions}).to_csv(out, header=False, index=False)
result = out.getvalue()
return flask.Response(response=result, status=200, mimetype='text/csv')
Why not call batch transform instead and let AWS do the heavy lifting.
You can either schedule to be completed every day, or instead trigger it manually.
After this use either API Gateway with a Lambda function or CloudFront to display the results from S3.
Related
In the Azure ML Studio I prepared a model with AutoML for time series forecasting. The data have some rare gaps in all data sets.
I am using the following code to call for a deployed Azure AutoML model as a web service:
import requests
import json
import pandas as pd
# URL for the web service
scoring_uri = 'http://xxxxxx-xxxxxx-xxxxx-xxxx.xxxxx.azurecontainer.io/score'
# Two sets of data to score, so we get two results back
new_data = pd.DataFrame([
['2020-10-04 19:30:00',1.29281,1.29334,1.29334,1.29334,1],
['2020-10-04 19:45:00',1.29334,1.29294,1.29294,1.29294,1],
['2020-10-04 21:00:00',1.29294,1.29217,1.29334,1.29163,34],
['2020-10-04 21:15:00',1.29217,1.29257,1.29301,1.29115,195]],
columns=['1','2','3','4','5','6']
)
# Convert to JSON string
input_data = json.dumps({'data': new_data.to_dict(orient='records')})
# Set the content type
headers = {'Content-Type': 'application/json'}
# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
print(resp.text)
I am getting an error:
{\"error\": \"DataException:\\n\\tMessage: No y values were provided. We expected non-null target values as prediction context because there is a gap between train and test and the forecaster depends on previous values of target. If it is expected, please run forecast() with ignore_data_errors=True. In this case the values in the gap will be imputed.\\n\\tInnerException: None\\n\\tErrorResponse \\n{\\n
I tried to add "ignore_data_errors=True" to different parts of the code without a success, hence, getting another error:
TypeError: __init__() got an unexpected keyword argument 'ignore_data_errors'
I would very much appreciate any help as I am stuck at this.
To avoid getting the provided error in time series forecasting, you should enable Autodetect for the Forecast Horizon. It means that only ideal time series data can use manually set feature which is not helping for real-world cases.
see the image
I'm having a Dash Application which takes in multiple CSV files, and creates a combined Dataframe for analysis and visualization. Usually, this computation takes around 30-35 seconds for datasets of size 600-650 MB. I'm using Flask Filesystem cache to store this dataframe once, and every next time I request the data, it comes from Cache.
I used the code from Dash's example here
I'm having two problems here :
It seems since the cache is in Filesystem, it takes twice the amount of time (nearly 70 seconds) to get the Dataframe, the first try, then it comes quickly from the subsequent requests. Can I use any other Cache type to avoid this overhead?
I tried automatically clearing my cache by setting CACHE_THRESHOLD (for example, I had set it to 1), but it's not working and I see files getting on added in the directory.
Sample Code :
app = dash.Dash(__name__)
cache = Cache(app.server, config={
'CACHE_TYPE' : 'filesystem',
'CACHE_DIR' : 'my-cache-directory',
'CACHE_THRESHOLD': 1
})
app.layout = app_layout
#cache.memoize()
def getDataFrame():
df = createLargeDataFrame()
return df
#app.callback(...) # Callback that uses DataFrame
def useDataFrame():
df = getDataFrame()
# Using Dataframe here
return value
Can someone help me with this? Thanks.
I'm currently building an ETL on a Google Cloud based VM (Windows Server 2019 - 4 vCPUs) to execute the following process:
Extract some tables from a MySQL replica db
Adjust data types for Google BigQuery conformities
Upload the data to BigQuery using Python's pandas_gbq library
To illustrate, here are some parts of the actual code (Python, iterator over one table):
while True:
# GENERATES AN MYSQL QUERY BASED ON THE COLUMNS AND THEIR
# RESPECTIVE TYPES, USING A DICTIONARY TO CONVERT
# MYSQL D_TYPES TO PYTHON D_TYPES
sql_query = gen_query(cols_dict=col_types, table=table,
pr_key=p_key, offset=offset)
cursor = cnx.cursor(buffered=True)
cursor.execute(sql_query)
if cursor.rowcount == 0:
break
num_fields = len(cursor.description)
field_names = [i[0] for i in cursor.description]
records = cursor.fetchall()
df = pd.DataFrame(records, columns=columns)
offset += len(df.index)
print('Ok, df structured')
# CHECK FOR DATETIME COLUMNS
col_parse_date = []
for column in columns:
if col_types[column] == 'datetime64':
try:
df[column] = df[column].astype(col_types[column])
col_parse_date.append(column)
except:
df[column] = df[column].astype(str)
for i in to_bgq:
if i['name'] == column:
i['type'] = 'STRING'
# UPLOAD DATAFRAME TO GOOGLE BIGQUERY
df.to_csv('carga_etl.csv', float_format='%.2f',
index=False, sep='|')
print('Ok, csv recorded')
df = ''
df = pd.read_csv('carga_etl.csv', sep='|')
print('Ok, csv read')
df.to_gbq(destination_table='tr.{}'.format(table),
project_id='iugu-bi', if_exists='append', table_schema=to_bgq)
The logic is based on a query generator; it gets the MySQL table Schema and adjusts it to BigQuery formats (e.g. Blob to STRING, int(n) to INTEGER etc.), querying the full results (paginated with an offset, 500K rows per page) and saving it in a dataframe to then upload it to my new database.
Well, the ETL does its job, and I'm currently migrating my tables to the cloud. However, I'm worried I'm subutilizing my resources, due to network traffic gaps. Here is the network report (bytes/sec) from my VM reporting section:
VM Network Bytes report
According to that report, my in/out network data peaks at 2/3 MBs, which is really low compared to the average 1GBs available if I use the machine to download something from my browser, for example.
My point is, what am I doing wrong here? Is there any way to increase my MySQL query/fetch speed and my upload speed to BigQuery?
I understand that you are transforming datetime64 to a compatible BigQuery Data type, correct me if I am wrong.
I have a few recommendations:
You can use Dataflow as it is a ETL product and it is optimized for performance
Depending on your overall use case and if you are using CloudSQL/MySQL, you can use BigQuery Federated queries.
Again depending on your use case, you caould use a MySQL dump and upload the data in GCS or directly to BigQuery.
I'm using the Datalab for a Python notebook that loads data from Cloud Storage into BigQuery basically following this example.
I then saw that my original data in the Cloud Storage bucket is in the EU (eu-west3-a), the VM that executes the Datalab is in the same region, but the final data in BigQuery is in the US.
According to this post I tried setting the location for the dataset in code, but did not work. This is because there is no such option defined in the Datalab.Bigquery Python module.
So my question is: How do I set the location (zone and region) for the BigQuery dataset and its containing tables?
This is my code:
# data: https://www.kaggle.com/benhamner/sf-bay-area-bike-share/data
%%gcs read --object gs://my_bucket/kaggle/station.csv --variable stations
# CSV will be read as bytes first
df_stations = pd.read_csv(StringIO(stations))
schema = bq.Schema.from_data(df_stations)
# Create an empty dataset
#bq.Dataset('kaggle_bike_rentals').create(location='europe-west3-a')
bq.Dataset('kaggle_bike_rentals').create()
# Create an empty table within the dataset
table_stations = bq.Table('kaggle_bike_rentals.stations').create(schema = schema, overwrite = True)
# load data directly from cloud storage into the bigquery table. the locally loaded Pandas dataframe won't be used here
table_stations.load('gs://my_bucket/kaggle/station.csv', mode='append', source_format = 'csv', csv_options=bq.CSVOptions(skip_leading_rows = 1))
Update: Meanwhile I manually created the dataset in the BigQuery Web-UI and used it in code without creating it there. Now an exception will be raised if the dataset is not existing thus forbidding to create a one in code that will result in default location US.
Have you tried bq.Dataset('[your_dataset]').create(location='EU')?
BigQuery locations are set on a dataset level. Tables take their location based on the dataset they are in.
Setting the location of a dataset at least outside of Datalab:
from google.cloud import bigquery
bigquery_client = bigquery.Client(project='your_project')
dataset_ref = bigquery_client.dataset('your_dataset_name')
dataset = bigquery.Dataset(dataset_ref)
dataset.location = 'EU'
dataset = bigquery_client.create_dataset(dataset)
Based on the code snippet from here: https://cloud.google.com/bigquery/docs/datasets
I am working with Azure functions to create triggers that aggregate data on an hourly basis. The triggers get data from blob storage, and to avoid aggregating the same data twice I want to add a condition that lets me only process blobs that were modified the last hour.
i am using the SDK, and my code for doing this looks like this:
''' Timestamp variables for t.now and t-1 '''
timestamp = datetime.now(tz=utc)
timestamp_negative1hr = timestamp+timedelta(hours=1)
''' Read data from input enivonment '''
data = BlockBlobService(account_name='accname', account_key='key')
generator = data.list_blobs('directory')
dataloaded = []
for blob in generator:
loader = data.get_blob_to_text('collection',blob.name, if_modified_since=timestamp_negative1hr)
trackerstatusobjects = loader.content.split('\n')
for trackerstatusobject in trackerstatusobjects:
dataloaded.append(json.loads(trackerstatusobject))
When I run this, the error I get is azure.common.AzureHttpError: The condition specified using HTTP conditional header(s) is not met. It is also specified that it is due to a timeout. The blobs are recieving data when I run it, so in any case it is not the correct return message. If I add .strftime("%Y-%m-%d %H:%M:%S:%z")to the end of my timestamp i get another error AttributeError: 'str' object has no attribute 'tzinfo'. This must mean that azure expects a datetime object, but for some reason it is not working for me.
Any ideas on how to solve it? Thanks