AWS SageMaker SKLearn entry point in a subdirectory?

AWS SageMaker SKLearn entry point in a subdirectory? - python

Can I specify SageMaker estimator's entry point script to be in a subdirectory? So far, it fails for me. Here is what I want to do:
sklearn = SKLearn(
entry_point="RandomForest/my_script.py",
source_dir="../",
hyperparameters={...
I want to do this so I don't have to break my directory structure. I have some modules, which I use in several sagemaker projects, and each project lives in its own directory:
my_git_repo/
RandomForest/
my_script.py
my_sagemaker_notebook.ipynb
TensorFlow/
my_script.py
my_other_sagemaker_notebook.ipynb
module_imported_in_both_scripts.py
If I try to run this, SageMaker fails because it seems to parse the name of the entry point script to make a module name out of it, and it does not do a good job:
/usr/bin/python3 -m RandomForest/my_script --bootstrap True --case nf_2 --max_features 0.5 --min_impurity_decrease 5.323785009485933e-06 --model_name model --n_estimators 455 --oob_score True
...
/usr/bin/python3: No module named RandomForest/my_script
Anyone knows a way around this other than putting my_script.py in the source_dir?
Related to this question

Unfortunately, this is a gap in functionality. There is some related work in https://github.com/aws/sagemaker-python-sdk/pull/941 which should also solve this issue, but for now, you do need to put my_script.py in source_dir.

What if you do source_dir = my_git_repo/RandomForest ?
Otherwise, you can also use a build functionality (such as CodeBuild - but it could also be some custom code eg in Lambda or Airflow) to send your script as a compressed artifact to s3, as this is how lower level SDKs such as boto3 expect your script anyway; this type of integration is shown in the boto3 section of the SageMaker Sklearn random forest demo

Related

Getting compiler error when trying to verify a contract importing from #uniswap/v3-periphery

I'm trying to perform a simple Swap from DAI to WETH with Uniswap in my own SmartContract on the Kovan Testnet. Unfortunately my transaction keeps getting reverted even after setting the gas limit manually.
I also discovered that I can not verify the contract on Kovan via etherscan-API nor manually. Instead I keep getting this error for every library I import:
Source "#uniswap/v3-periphery/contracts/interfaces/ISwapRouter.sol" not found: File import callback not supported
Accordingly I have the feeling something is going wrong during compilation and I'm stuck without any further ideas to work out my problem.
Here are a couple infos on what I've tried so far and how to reproduce:
Brownie Version 1.16.4, Tested on Windows 10 and Ubuntu 21.04
I've tried:
Importing libraries with Brownie package manager
Importing libraries with npm and using relative paths
All kinds of different compiler remappings in the brownie-config.yaml
Adding all dependency files to project folders manually
Here's a link to my code for reproducing my error:
https://github.com/MjCage/swap-demo
It'd be fantastic if someone could help.

It's very unlikely that something is "going wrong during compilation". If your contract compiles but what it does does not match the sources, you have found a very serious codegen bug in the compiler and you should report it so that it can be fixed quickly. From experience I'd say that it's much more likely that you have a bug in your contract though.
As for the error during verification - the problem is that to properly compile a multi-file project, you have to provide all the source files and have them in the right directories. This applies to library code as well so if your contract imports ISwapRouter.sol, you need to also submit that file and all files it in turn imports too.
The next hurdle is that as far as I can tell, the multi-file verification option at Etherscan only allows you to submit files from a single directory so it only gets their names, not the whole paths (not sure if it's different via the API). You need Etherscan to see the file as #uniswap/v3-periphery/contracts/interfaces/ISwapRouter.sol but it sees just ISwapRouter.sol instead and the compiler will not treat them as the same (both could exist after all).
The right solution is to use the Standard JSON verification option - this way you submit the whole JSON input that your framework passes to the compiler and that includes all files in the project (including libraries) and relevant compiler options. The issue is that Brownie does not give you this input directly. You might be able to recreate it from the JSON it stores on disk (Standard JSON input format is documented at Compiler Input and Output JSON Description) but that's a bit of manual work. Unfortunately Brownie does not provide any way to request this on the command line. The only other way to get it that I know of is to use Brownie's API and call compiler.generate_input_json().
Since this is a simple project with just one contract and does not have deep dependencies, it might be easier for you to follow #Jacopo Mosconi's answer and just "flatten" the contract by replacing all imports by sources pasted directly into the main contract. You might also try copying the file to your project dir and altering the import so that it only contains the file name, without any path component - this might pass the multi-file verification. Flattening is ultimately how Brownie and many other frameworks currently do verification and Etherscan's check is lax enough to allow sources modified in such a way - it only checks bytecode so you can still verify even if you completely change the import structure, names, comments or even any code that gets removed by the optimizer.

the compiler can't find ISwapRouter.sol
you can add the code of ISwapRouter.sol directly on your swap.sol and delate that line from your code, this is the code https://github.com/Uniswap/v3-periphery/blob/main/contracts/interfaces/ISwapRouter.sol

How can one download the outputs of historical Azure ML experiment Runs via the python API

I'm trying to write a script which can download the outputs from an Azure ML experiment Run after the fact.
Essentially, I want to know how I can get a Run by its runId property (or some other identifier).
I am aware that I have access to the Run object when I create it for the purposes of training. What I want is a way to recreate this Run object later in a separate script, possibly from a completely different environment.
What I've found so far is a way to get a list of ScriptRun objects from an experiment via the get_runs() function. But I don't see a way to use one of these ScriptRun objects to create a Run object representing the original Run and allowing me to download the outputs.
Any help appreciated.

I agree that this could probably be better documented, but fortunately, it's a simple implementation.
this is how you get a run object for an already submitted run for azureml-sdk>=1.16.0 (for the older approach see my answer here)
from azureml.core import Workspace
ws = Workspace.from_config()
run = ws.get_run('YOUR_RUN_ID')
once you have the run object, you can call methods like
.get_file_names() to see what files are available (the logs in azureml-logs/ and logs/azureml/ will also be listed)
.download_file() to download an individual file
.download_files() to download all files that match a given prefix (or all the files)
See the Run object docs for more details.

Failing import of zipped library with py-files

I have to maintain an oll code running with pyspark.
It's using a method I've never seen.
I have some reusable code zipped into a file ingestion.zip.
Then, this file is called using a pipeline.cfg file like this:
[spark]
master=spark://master
py-files=${HOME}/lib/ingestion.zip
spark-submit=${SPARK_HOME}/bin/spark-submit
When I'm trying to import the library as shown below, I cant make Pycharm understand that the lib should point to the zip file.
from ingestion.data import csv, storage
I've seen the zip is a solution proposed by spark-submit using py-files but how can I make it running on my IDE ?

I haven't used below method with pycharm, but it worked for us with spark-submit and we could import these modules using normal import statements.
Actually, we had a very few files to import and we needed something quick. So, if you also have the same use-case and if pycharm allows then maybe, you can give it a try.
--py-files s3://bucket-name/module1.py,s3://bucket-name/module2.py,s3://bucket-name/module3.py,s3://bucket-name/module4.py"
(Note - there shouldn't be any spaces.)
(Note - this suggestion is only an interim solution till someone replies with a better answer.)

How can the root directory in python chunk be specified?

Setting the root directory in a python-chunk with the following code line results in an error while for an ordinary r-chunk it works just fine
knitr::opts_knit$set(root.dir ="..")
knitr::opts_knit$set(root.dir ="..")
Optimally there should exist the following options for each knitr-chunk:
- directory to find code to be imported / executed
- directory to find files / dependencies that are needed for code execution
- directory to save any code output
Does something similar exist?

What it looks like here is that you have told it that it is to look for python code:
```{python}
knitr::opts_knit$set(root.dir ="..")
```
When you run this in R studio it will give you an error:
Error: invalid syntax (, line 1)
You fed it python code instead. This makes sense as the call knit::opt_knit$set means to look in the knitr package for the opts_knit$set and set it to…. This doesn’t exist in python… yet. The python compiler does not recognize it as python code and returns a syntax error. Whereas when you run it as an R chunk, it knows to look into the knitr package. Error handling can be huge issue to deal with. It makes more sense to handle error categories than to account for every type of error. If you want to control the settings for a code chunk, you would do so in the parenthesis ie:
```{python, args }
codeHere
```
I have not seen args for any other language than R, but that does not mean it doesn’t exist. I have just not seen them. I also do not believe that this will fix your issue. You could try some of the following ideas:
Writing your python in a separate file and link to it. This would allow for you to take advantage of the language and utilize things like the OS import. This may be something you want to consider as even python has its ways of navigating around the various operating systems. This may be helpful if you are just running quick scripts and not loading or running python programs.
# OS module
import os
# Your os name
print(os.name)
# Gets PWD or present working directory
os.getcwd()
# change your directory
os.chdir("path")
You could try using the reticulate library within an R chunk and load your python that way
Another thought is that you could try
library(reticulate)
use_python(“path”)
Knitr looks in the same directory as your markdown file to find other files if needed. This is just like about any other IDE
At one point in time knitr would not accept R’s setwd() command. Trying to call setwd() may not do you any good.
It may not the best idea to compute paths relative to what's being executed. If possible they should be determined relative to where the user calls the code from.
This site may help.
The author of the knitr package is very active and involved. Good luck with what you are doing!

Lambda not supporting NLTK file size

I am writing a python script that analyses a piece of text and returns the data in JSON format. I am using NLTK, to analyze the data. Basically, this is my flow:
Create an endpoint (API gateway) -> calls my lambda function -> returns JSON of required data.
I wrote my script, deployed to lambda but I ran into this issue:
Resource \u001b[93mpunkt\u001b[0m not found. Please use the NLTK
Downloader to obtain the resource:
\u001b[31m>>> import nltk
nltk.download('punkt') \u001b[0m
Searched in:
- '/home/sbx_user1058/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- '/var/lang/nltk_data'
- '/var/lang/lib/nltk_data'
Even after downloading 'punkt', my script still gave me the same error. I tried the solutions here :
Optimizing python script extracting and processing large data files
but the issue is, the nltk_data folder is huge, while lambda has a size restriction.
How can I fix this issue?
Or where else can I use my script and still integrate API call?
I am using serverless to deploy my python scripts.

There are two things that you can do:
The errors seems like the path is not being defined properly, maybe set it as an env Variable?
sys.path.append(os.path.abspath('/var/task/nltk_data/')
or this way
Once you run nltk.download(), then copy it to the root folder of your AWS lambda application. (Name the dir to be called "nltk_data".)
In the lambda function dashboard (in the AWS console), add NLTK_DATA=./nltk_data as a key-var Environment Variable.
reduce the size of the nltk downloads, since you won't be needing all of them.
Delete all the zip files, keep only the needed section, for example: stopwords. That can be moved into: save nltk_data/corpora/stopwords and delete the rest.
Or If you need tokenizers save to nltk_data/tokenizers/punkt. Most of these can be separately downloaded: python -m nltk.downloader punkt, then copy over the files.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.