Access data on AML datastore from training script

Access data on AML datastore from training script - python

I am looking for a working example how to access data on a Azure Machine Learning managed data store from within a train.py script. I followed the instructions in the link and my script is able to resolve the datastore.
However, whatever I tried (as_download(), as_mount()) the only thing I always got was a DataReference object. Or maybe I just don't understand how actually read data from a file with that.
run = Run.get_context()
exp = run.experiment
ws = run.experiment.workspace
ds = Datastore.get(ws, datastore_name='mydatastore')
data_folder_mount = ds.path('mnist').as_mount()
# So far this all works. But how to go from here?

You can pass in the DataReference object you created as the input to your training product (scriptrun/estimator/hyperdrive/pipeline). Then in your training script, you can access the mounted path via argument.
full tutorial: https://learn.microsoft.com/en-us/azure/machine-learning/service/tutorial-train-models-with-aml

Related

How to create directory in ADLS gen2 from pyspark databricks

Summary:
I am working on a use-case where i want to write images via cv2 in the ADLS from within pyspark streaming job in databricks, however it doesn't work if the directory doesn't exist.
But i want to store image in specific structure depending on the image attributes.
so basically i need to check at runtime if directory exists or not and create it if it doesn't exist.
Initially I tried using dbutils, but dbutils can't be used inside pyspark api.
https://github.com/MicrosoftDocs/azure-docs/issues/28070
Expected Results:
To be able to able to create directory from within pyspark streaming job in ADLS Gen2
at runtime.
Reproducible code:
# Read images in batch for simplicity
df = spark.read.format('binaryFile').option('recursiveLookUp',True).option("pathGlobfilter", "*.jpg").load(path_to_source')
# Get necessary columns
df = df.withColumn('ingestion_timestamp',F.current_timestamp())
.withColumn('source_ingestion_date',F.to_date(F.split('path','/')[10]))
.withColumn('source_image_path',F.regexp_replace(F.col('path'),'dbfs:','/dbfs/')
.withColumn('source_image_time',F.substring(F.split('path','/')[12],0,8))
.withColumn('year', F.date_format(F.to_date(F.col('source_ingestion_date')),'yyyy'))
.withColumn('month', F.date_format(F.to_date(F.col('source_ingestion_date')),'MM'))
.withColumn('day', F.date_format(F.to_date(F.col('source_ingestion_date')),'dd'))
.withColumn('base_path', F.concat(F.lit('/dbfs/mnt/development/testing/'),F.lit('/year='),F.col('year'),
F.lit('/month='),F.col('month'),
F.lit('/day='),F.col('day'))
# function to be called in foreach call
def processRow(row):
source_image_path = row['source_image_path']
base_path = row['base_path']
source_image_time = row['source_image_time']
if not CheckPathExists(base_path):
dbutils.fs.mkdirs(base_path)
full_path = f"{base_path}/{source_image_time}.jpg"
im = image=cv2.imread(source_image_path)
cv2.imwrite(full_path,im)
# This fails
df.foreach(processRow)
# Due to below code block
if not CheckPathExists(base_path):
dbutils.fs.mkdirs(base_path)
full_path = f"{base_path}/{source_image_time}.jpg"
im = image=cv2.imread(source_image_path)
cv2.imwrite(full_path,im)
Do anyone have any suggestions please?

AFAIK, dbutils.fs.mkdirs(base_path) works for the path like dbfs:/mnt/mount_point/folder.
I have reproduced this and when I check the path like /dbfs/mnt/mount_point/folder with mkdirs function, the folder is not created in the ADLS even though it gave me True in databricks.
But for dbfs:/mnt/mount_point/folder it is working fine.
This might be the issue here. So, first check the path exists or not with this path /dbfs/mnt/mount_point/folder and if not then create the directory with dbfs:/ this path.
Example:
import os
base_path="/dbfs/mnt/data/folder1"
print("before : ",os.path.exists(base_path))
if not os.path.exists(base_path):
base_path2="dbfs:"+base_path[5:]
dbutils.fs.mkdirs(base_path2)
print("after : ",os.path.exists(base_path))
You can see the folder is created.
If you don't want to use os directly check if the path exists using the below list and create the directory.

WANDB Getting a run id based on tag

Context
Hi!
In wandb I can download a model based on a tag (prod for example), but I would like to also get all metrics associated to that run by using tags.
The problem is that I don't know how to a get specific run ID based a tag.
Example
Using the code bellow we can extract a run summary metrics, but setting run IDs is setting me back.
So if I can get run IDs based on tag or just explicitly download metrics with another API call, like with a special sintax in api.run, that would be great! In the code example bellow I would like to use the what_i_want_to_use string to call the API instead of what_i_use.
import wandb
from ast import literal_eval
api = wandb.Api()
what_i_use = "team_name/project_name/runID_h3h3h4h4h4h4"
# what_i_want_to_use = "team_name/project_name/artifact_name/prod_tag"
# run is specified by <entity>/<project>/<run_id>
run = api.run(what_i_use)
# save the metrics for the run to a csv file
metrics_dataframe = run.summary
print(metrics_dataframe['a_summary_metric'])
By running through the docs I didn't find any solution so far. Any ideias?
wandb public api run details
Thanks for reading!

It is possible to filter runs by tags as well. You can read more about it here:
You can filter by config.*, summary_metrics.*, tags, state, entity, createdAt, etc.

while working with gspread-pandas module, I want to change default_dir of the module

import json
from os import path, makedirs
_default_dir = path.expanduser('~/.config/gspread_pandas')
_default_file = 'google_secret.json'
def ensure_path(pth):
if not path.exists(pth):
makedirs(pth)
hi, I'm currently working on data collection via selenium and pandas to parse the data and edit it with pandas to send the data to google spread
however, while I'm working on gspread-pandas module, the module needs to put google_secret json file to '~/.config/gspread_pandas'. which is fixed location as described in the link below
https://pypi.python.org/pypi/gspread-pandas/0.15.1
I want to make some function to set the custom location to achieve independent working app environment.
for example, I want to locate the file to here
default_folder = os.getcwd()
the default_folder will be where my project is located(the same folder)
what can I do with it?

If you see the source https://github.com/aiguofer/gspread-pandas/blob/master/gspread_pandas/conf.py you can notice, that you can create your own config and pass it to Spread object constructor.
But yes, this part is really badly documented.
So, this code works well for me:
from gspread_pandas import Spread, conf
c = conf.get_config('[Your Path]', '[Your filename]')
spread = Spread('username', 'spreadname', config=c)

Thank you for this. It really should be documented better. I was getting so frustrated trying to get this to work with heroku, but it worked perfectly. I had to change to the following:
c = gspread_pandas.conf.get_config('/app/', 'google_secret.json')
spread = gspread_pandas.Spread('google_sheet_key_here_that_is_a_long_string', config=c)

Loading previously saved JModelica result-file

I got the following question:
I am loading a JModelica model and simulate it easily by doing:
from pymodelica import compile_fmu
from pyfmi import load_fmu
model = load_fmu(SOME_FMU);
res=model.simulate();
Everything works fine and it even saves a resulting .txt - file. Now, with this .txt the problem is that I did not find any funtionality so far within the jmodelica-python packages to actually load such an .txt result file again later on into a result-object ( like the one being returned by simulate() ) to easily extract the previous saved data.
Implementing that by hand is of course possible but I find it quiet nasty and just wanted to ask if anyone knows of method that does the job to load that JModlica-format result-file into an result object for me.
Thanks!!!!

The functionality that you need is located in the io module:
from pyfmi.common.io import ResultDymolaTextual
res = ResultDymolaTextual("MyResult.txt")
var = res.get_variable_data("MyVariable")
var.x #Trajectory
var.t #Corresponding time vector

How to avoid loading a large file into a python script repeatedly?

I've written a python script to take a large file (a matrix ~50k rows X ~500 cols) and use it as a dataset to train a random forest model.
My script has two functions, one to load the dataset and the other to train the random forest model using said data. These both work fine, but the file upload takes ~45 seconds and it's a pain to do this every time I want to train a subtly different model (testing many models on the same dataset). Here is the file upload code:
def load_train_data(train_file):
# Read in training file
train_f = io.open(train_file)
train_id_list = []
train_val_list = []
for line in train_f:
list_line = line.strip().split("\t")
if list_line[0] != "Domain":
train_identifier = list_line[9]
train_values = list_line[12:]
train_id_list.append(train_identifier)
train_val_float = [float(x) for x in train_values]
train_val_list.append(train_val_float)
train_f.close()
train_val_array = np.asarray(train_val_list)
return(train_id_list,train_val_array)
This returns a numpy array with col. 9 as the label and cols. 12-end as the metadata to train the random forest.
I am going to train many different forms of my model with the same data, so I just want to upload the file one time and have it available to feed into my random forest function. I want the file to be an object I think (I am fairly new to python).

If I understand you correctly, the data set does not change but the model parameters do change and you are changing the parameters after each run.
I would put the file load script in one file, and run this in the python interpreter. Then the file will load and be saved in memory with whatever variable you use.
Then you can import another file with your model code, and run that with the training data as argument.
If all your model changes can be determined as parameters in a function call, all you need is to import your model and then call the training function with different parameter settings.
If you need to change the model code between runs, save with a new filename and import that one, run again and send the source data to that one.
If you don't want to save each model modification with a new filename, you might be able to use the reload functionality depending on python version, but it is not recommended (see Proper way to reload a python module from the console)

Simplest way would be to cache the results, like so:
_train_data_cache = {}
def load_cached_train_data(train_file):
if train_file not in _train_data_cache:
_train_data_cache[train_file] = load_train_data(train_file)
return _train_data_cache[train_file]

Try to learn about Python data serialization. You would basically be storing the large file as a python specific, serialized binary object using python's marshal function. This would drastically speed up IO of the file. See these benchmarks for performance variations. However, if these random forest models are all trained at the same time then you could just train them against the data-set you already have in memory then release train data after completion.

Load your data in ipython.
my_data = open("data.txt")
Write your codes in a python script, say example.py, which uses this data. At the top of the script example.py add these lines:
import sys
args = sys.argv
data = args[1]
...
Now run the python script in ipython:
%run example.py $mydata
Now, when running your python script, you don't need to load data multiple times.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Access data on AML datastore from training script - python

Related

How to create directory in ADLS gen2 from pyspark databricks

WANDB Getting a run id based on tag

while working with gspread-pandas module, I want to change default_dir of the module

Loading previously saved JModelica result-file

How to avoid loading a large file into a python script repeatedly?

Categories

Resources