how to generate the json format for google cloud predictions - python

I am trying to make predictions from my custom model on Vertex AI but am getting errors.
I have deployed the model to an endpoint and I request prediction using this line
gcloud beta ai endpoints predict 1234 --region=europe-west4 --json-request=request2.json
and I get this response
Using endpoint [https://europe-west4-prediction-aiplatform.googleapis.com/]ERROR: (gcloud.beta.ai.endpoints.predict) FAILED_PRECONDITION: "Prediction failed: Exception during xgboost prediction: feature_names mismatch:
I have created the data with this code ( and later renamed to request2.json)
test3 = {}
test3["instances"] = test_set2.head(20).values.tolist()
with open('05_GCP/test_v3.jsonl', 'w') as outfile:
json.dump(test3, outfile, indent = 2)
This generates a file which looks like this.
The error tells me that its expecting column names per value instead of nothings, which are then interpreted as f0, f1 etc.
My challenge is that I don't know how to generate data that looks like this( also from the help file)
Though the result with the mismatched column names suggests I need a different format.
I tried:
import json
test4= X_test.head(20).to_json(orient='records', lines=True)
with open('05_GCP/test_v4.json', 'w') as outfile:
json.dump(test4, outfile, indent = 2)
which gives me data that looks like:
ie with many line breaks in it and cloud shell tells me this isn't json.
I also replicated this format and was informed it is not a json file
So 2 questions,
how do I create a json file that has the appropriate format for a live prediction.
how do I create a jsonl file so I can run batch jobs. This is actually want I a trying to get to. I have used csv but it returns errors:
('Post request fails. Cannot get predictions. Error: Predictions are not in the response. Got: {"error": "Prediction failed: Could not initialize DMatrix from inputs: ('Expecting 2 dimensional numpy.ndarray, got: ', (38,))"}.', 2)
This csv data is the exact same that I use to measure the model error whilst training (I know not good practise but this is just a test run)
UPDATE
Following Raj's suggestion, I tried creating two extra models. One where I changed my training code to be X_train.values and another where I replaced all the column names to be F0:F426 as the response to the json file on the endpoint said it didn't match column names.
These both SUCCEDDED with the test3 json file above when endpoints where deployed.
However I want this to be returning batch predictions, and here it returns the same errors. This is therefore clearly a formatting error but I have no clue about how to get there.
it should be pointed out that the batch predictions need to be a jsonl file. Not a json file. Trying to pass a json file doesn't work. All I have done is change the extension on the json file when I loaded it to make it appear as a jsonl file. I have not found anything that helps me create that. Tips are welcome.
I noticed that my data was in double brackets. So I created a version which had only one bracket and ran this on one of the models but it also returned errors. This was also only one prediction per the other comment.
Appreciate the help,
James

This worked for me:
If you train the model with df.values as per raj's answer, you can then pass the instances you want to get batch predictions from in a jsonl file ("input.jsonl" for example) with the following format for each instance/row:
[3.0,1.0,30.0,1.0,0.0,16.1]
File would look something like this for 5 rows to predict:
[3.0,1.0,30.0,1.0,0.0,16.1]
[3.0,0.0,22.0,0.0,0.0,9.8375]
[2.0,0.0,45.0,1.0,1.0,26.25]
[1.0,0.0,21.0,0.0,0.0,26.55]
[3.0,1.0,16.0,4.0,1.0,39.6875]

Related

How to properly return dataframe as JSON using FastAPI?

I created an API using FastAPI that returned a JSON. First, I used to turn the Dataframe to JSON using the Pandas .to_json() method, which allowed me to choose the correct "orient" parameter. This saved a .json file and then opened it to make fastAPI return it as it follows:
DATA2.to_json("json_records.json",orient="records")
with open('json_records.json', 'r') as f:
data = json.load(f)
return(data)
This worked perfectly, but i was told that my script shouldn't save any files since this script would be running on my company's server, so I had to directly turn the dataframe into JSON and return it. I tried doing this:
data = DATA2.to_json(orient="records")
return(data)
But now the API's output is a JSON full of "\". I guess there is a problem with the parsing but i can't really find a way to do it properly.
The output now looks like this:
"[{\"ExtraccionHora\":\"12:53:00\",\"MiembroCompensadorCodigo\":117,\"MiembroCompensadorDescripcion\":\"OMEGA CAPITAL S.A.\",\"CuentaCompensacionCodigo\":\"1143517\",\"CuentaNeteoCodigo\":\"160234117\",\"CuentaNeteoDescripcion\":\"UNION FERRO SRA A\",\"ActivoDescripcion\":\"X17F3\",\"ActivoID\":8,\"FinalidadID\":2,\"FinalidadDescripcion\":\"Margenes\",\"Cantidad\":11441952,\"Monto\":-16924935.3999999985,\"Saldo\":-11379200.0,\"IngresosVerificados\":11538288.0,\"IngresosNoVerificado\":0.0,\"MargenDelDia\":0.0,\"SaldoConsolidadoFinal\":-16765847.3999999985,\"CuentaCompensacionCodigoPropia\":\"80500\",\"SaldoCuentaPropia\":-7411284.3200000003,\"Resultado\":\"0\",\"MiembroCompensadorID\":859,\"CuentaCompensacionID\":15161,\"CuentaNeteoID\":7315285}.....
What would be a proper way of turning my dataframe into a JSON using the "records" orient, and then returning it as the FastAPI output?
Thanks!
update: i changed the to_json() method to to_dict() using the same parameters and seems to work... don't know if its correct.
data = DATA2.to_dict(orient="records")
return(data)

PySpark groupby applyInPandas save objects as files problem

I run PySpark 3.1 on a Windows computer with local mode on Jupyter Notebook. I call "applyInPandas" on Spark DataFrame.
Below function applies a few data transformations to input Pandas DataFrame, and trains an SGBT model. Then it serializes the trained model into binary and saves to S3 bucket as object. Finally it returns the DataFrame. I call this function from a Spark DataFrame grouped by two columns in the last line. I receive no error and the returned DataFrame is as the same length as the input. Data for each group is returned.
The problem is the saved model objects. There are objects saved in S3 only for 2 groups when there were supposed to be models for each group. There is no missing/wrong data point that would cause model training to fail. (I'd receive an error or warning anyway.) What I have tried so far:
Replace S3 and save to local file system: The same result.
Replace "pickle" with "joblib" and "BytesIO": The same result.
Repartition before calling the function: Now I had more objects saved for different groups, but not all. [I did this by calling "val_large_df.coalesce(1).groupby('la..." in the last line.]
So I suspect this is about parallelism and distribution, but I could not figure it out. Thank you already.
def train_sgbt(pdf):
##Some data transformations here##
#Train the model
sgbt_mdl=GradientBoostingRegressor(--Params.--).fit(--Params.--)
sgbt_mdl_b=pickle.dumps(sgbt_mdl) #Serialize
#Initiate s3_client
s3_client = boto3.client(--Params.--)
#Put file in S3
s3_client.put_object(Body=sgbt_mdl_b, Bucket='my-bucket-name',
Key="models/BT_"+str(pdf.latGroup_m[0])+"_"+str(pdf.lonGroup_m[0])+".mdl")
return pdf
dummy_df=val_large_df.groupby("latGroup_m","lonGroup_m").applyInPandas(train_sgbt,
schema="fcast_error double")
dummy_df.show()
Spark evaluates the dummy_df lazy and therefore train_sgbt will only be called for the groups that are required to complete the Spark action.
The Spark action here is show(). This action prints only the first 20 rows, so train_sgbt is only called for the groups that have at least one element in the first 20 rows. Spark may evaluate more groups, but there is no guarantee for it.
One way to solve to problem would be to call another action, for example csv.

Prevent New Line Formation in json_normalize Data Frames

I am working to flatten some tweets into a wide data frame. I simply use the pandas.json_normalize function on my to perform this.
I then save this data frame into a CSV file. The CSV format when uploaded produces some rows that are associated with the above, rather than holding all the data on a single row. I discovered this issue when uploading the CSV into R and into Domo.
When I run the following command in a jupyter notebook the CSV loads fine,
sb_2019 = pd.read_csv('flat_tweets.csv',lineterminator='\n',low_memory=False)
Without the lineterminator I see this error:
Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
Needs:
I am looking for a post-processing step to eliminate the need for a the lineterminator. I need to open the CSV in platforms and languages that do not have this specification. How might I go about doing this?
Note:
I am working with over 700k tweets. The json_normalize function works great on small pieces of my data where issues are being found. When I run json_normalize on the whole dataset I am finding this issue.
Try using '\r\n' or '\r' as lineterminator, and not '\n'.
This solution would be helpful too, opening in universal-new-line mode:
sb_2019 = pd.read_csv(open('flat_tweets.csv','rU'), encoding='utf-8', low_memory=False)

How to avoid loading a large file into a python script repeatedly?

I've written a python script to take a large file (a matrix ~50k rows X ~500 cols) and use it as a dataset to train a random forest model.
My script has two functions, one to load the dataset and the other to train the random forest model using said data. These both work fine, but the file upload takes ~45 seconds and it's a pain to do this every time I want to train a subtly different model (testing many models on the same dataset). Here is the file upload code:
def load_train_data(train_file):
# Read in training file
train_f = io.open(train_file)
train_id_list = []
train_val_list = []
for line in train_f:
list_line = line.strip().split("\t")
if list_line[0] != "Domain":
train_identifier = list_line[9]
train_values = list_line[12:]
train_id_list.append(train_identifier)
train_val_float = [float(x) for x in train_values]
train_val_list.append(train_val_float)
train_f.close()
train_val_array = np.asarray(train_val_list)
return(train_id_list,train_val_array)
This returns a numpy array with col. 9 as the label and cols. 12-end as the metadata to train the random forest.
I am going to train many different forms of my model with the same data, so I just want to upload the file one time and have it available to feed into my random forest function. I want the file to be an object I think (I am fairly new to python).
If I understand you correctly, the data set does not change but the model parameters do change and you are changing the parameters after each run.
I would put the file load script in one file, and run this in the python interpreter. Then the file will load and be saved in memory with whatever variable you use.
Then you can import another file with your model code, and run that with the training data as argument.
If all your model changes can be determined as parameters in a function call, all you need is to import your model and then call the training function with different parameter settings.
If you need to change the model code between runs, save with a new filename and import that one, run again and send the source data to that one.
If you don't want to save each model modification with a new filename, you might be able to use the reload functionality depending on python version, but it is not recommended (see Proper way to reload a python module from the console)
Simplest way would be to cache the results, like so:
_train_data_cache = {}
def load_cached_train_data(train_file):
if train_file not in _train_data_cache:
_train_data_cache[train_file] = load_train_data(train_file)
return _train_data_cache[train_file]
Try to learn about Python data serialization. You would basically be storing the large file as a python specific, serialized binary object using python's marshal function. This would drastically speed up IO of the file. See these benchmarks for performance variations. However, if these random forest models are all trained at the same time then you could just train them against the data-set you already have in memory then release train data after completion.
Load your data in ipython.
my_data = open("data.txt")
Write your codes in a python script, say example.py, which uses this data. At the top of the script example.py add these lines:
import sys
args = sys.argv
data = args[1]
...
Now run the python script in ipython:
%run example.py $mydata
Now, when running your python script, you don't need to load data multiple times.

scikit-learn load_mlcomp exception "Could not find dataset with metadata line: name: 20news-18828"

I am trying some simple examples on a book.
But somehow it makes an error.
import sklearn.datasets
MLCOMP_DIR = r"~/my/data/"
data = sklearn.datasets.load_mlcomp("20news-18828", mlcomp_root=MLCOMP_DIR)
ValueError: Could not find dataset with metadata line: name: 20news-18828
If you want this function to work, with the path you specified you need to make sure you download the 379 dataset and extract it into the folder ~/my/data.
The problem is probably because of your file structure.
Make sure you have the folder ~/my/data/379 and in it should be the metadata file along with the folders test, train and raw.
Have you followed the instructions in the example?
From the file
The dataset used in this example is the 20 newsgroups dataset and should be
downloaded from the http://mlcomp.org (free registration required):
http://mlcomp.org/datasets/379
I am not sure what you mean by "on a book". This function is for extracting this specific dataset.
At first you should extract the zip file, dataset-379-20news-18828_WJQIG.zip (or a similar file that can be found in http://mlcomp.org/datasets/379#). After unzipping the file you get folder 379 which contains raw, train, test, and a metadata file.
To work with:
data = sklearn.datasets.load_mlcomp("20news-18828", 'train', mlcomp_root=MLCOMP_DIR)
you should set MLCOMP_DIR as, for example, "D:\data\ML" while folder 379 is inside ML folder.
So, MLCOMP_DIR should be "D:\data\ML", not "D:\data\ML\379" .

Categories