Lambda not supporting NLTK file size - python

I am writing a python script that analyses a piece of text and returns the data in JSON format. I am using NLTK, to analyze the data. Basically, this is my flow:
Create an endpoint (API gateway) -> calls my lambda function -> returns JSON of required data.
I wrote my script, deployed to lambda but I ran into this issue:
Resource \u001b[93mpunkt\u001b[0m not found. Please use the NLTK
Downloader to obtain the resource:
\u001b[31m>>> import nltk
nltk.download('punkt') \u001b[0m
Searched in:
- '/home/sbx_user1058/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- '/var/lang/nltk_data'
- '/var/lang/lib/nltk_data'
Even after downloading 'punkt', my script still gave me the same error. I tried the solutions here :
Optimizing python script extracting and processing large data files
but the issue is, the nltk_data folder is huge, while lambda has a size restriction.
How can I fix this issue?
Or where else can I use my script and still integrate API call?
I am using serverless to deploy my python scripts.

There are two things that you can do:
The errors seems like the path is not being defined properly, maybe set it as an env Variable?
sys.path.append(os.path.abspath('/var/task/nltk_data/')
or this way
Once you run nltk.download(), then copy it to the root folder of your AWS lambda application. (Name the dir to be called "nltk_data".)
In the lambda function dashboard (in the AWS console), add NLTK_DATA=./nltk_data as a key-var Environment Variable.
reduce the size of the nltk downloads, since you won't be needing all of them.
Delete all the zip files, keep only the needed section, for example: stopwords. That can be moved into: save nltk_data/corpora/stopwords and delete the rest.
Or If you need tokenizers save to nltk_data/tokenizers/punkt. Most of these can be separately downloaded: python -m nltk.downloader punkt, then copy over the files.

Related

ADO how to read build information from python script started from ADO-agent

I am building a pipeline in Azure Devops. I have a Yaml file which is starting a python script on a self-hosted agent. Is there any way for me to read information about the pipeline which started the python script inside of the python script without sending it as arguments?
The information i need is information as:
build.definitionName,
system.teamProjectId,
system.pipelineStartTime
Python can access these variables using os.environ['Name']. As per the documentation, predefined variables are converted into uppercase and any '.' are replace with '_'
For example, to access these variables on a windows OS it should be as simple as:
import os
os.environ['BUILD_DEFINITIONNAME']
os.environ['SYSTEM_TEAMPROJECTID']
os.environ['SYSTEM_PIPELINESTARTTIME']

Getting compiler error when trying to verify a contract importing from #uniswap/v3-periphery

I'm trying to perform a simple Swap from DAI to WETH with Uniswap in my own SmartContract on the Kovan Testnet. Unfortunately my transaction keeps getting reverted even after setting the gas limit manually.
I also discovered that I can not verify the contract on Kovan via etherscan-API nor manually. Instead I keep getting this error for every library I import:
Source "#uniswap/v3-periphery/contracts/interfaces/ISwapRouter.sol" not found: File import callback not supported
Accordingly I have the feeling something is going wrong during compilation and I'm stuck without any further ideas to work out my problem.
Here are a couple infos on what I've tried so far and how to reproduce:
Brownie Version 1.16.4, Tested on Windows 10 and Ubuntu 21.04
I've tried:
Importing libraries with Brownie package manager
Importing libraries with npm and using relative paths
All kinds of different compiler remappings in the brownie-config.yaml
Adding all dependency files to project folders manually
Here's a link to my code for reproducing my error:
https://github.com/MjCage/swap-demo
It'd be fantastic if someone could help.
It's very unlikely that something is "going wrong during compilation". If your contract compiles but what it does does not match the sources, you have found a very serious codegen bug in the compiler and you should report it so that it can be fixed quickly. From experience I'd say that it's much more likely that you have a bug in your contract though.
As for the error during verification - the problem is that to properly compile a multi-file project, you have to provide all the source files and have them in the right directories. This applies to library code as well so if your contract imports ISwapRouter.sol, you need to also submit that file and all files it in turn imports too.
The next hurdle is that as far as I can tell, the multi-file verification option at Etherscan only allows you to submit files from a single directory so it only gets their names, not the whole paths (not sure if it's different via the API). You need Etherscan to see the file as #uniswap/v3-periphery/contracts/interfaces/ISwapRouter.sol but it sees just ISwapRouter.sol instead and the compiler will not treat them as the same (both could exist after all).
The right solution is to use the Standard JSON verification option - this way you submit the whole JSON input that your framework passes to the compiler and that includes all files in the project (including libraries) and relevant compiler options. The issue is that Brownie does not give you this input directly. You might be able to recreate it from the JSON it stores on disk (Standard JSON input format is documented at Compiler Input and Output JSON Description) but that's a bit of manual work. Unfortunately Brownie does not provide any way to request this on the command line. The only other way to get it that I know of is to use Brownie's API and call compiler.generate_input_json().
Since this is a simple project with just one contract and does not have deep dependencies, it might be easier for you to follow #Jacopo Mosconi's answer and just "flatten" the contract by replacing all imports by sources pasted directly into the main contract. You might also try copying the file to your project dir and altering the import so that it only contains the file name, without any path component - this might pass the multi-file verification. Flattening is ultimately how Brownie and many other frameworks currently do verification and Etherscan's check is lax enough to allow sources modified in such a way - it only checks bytecode so you can still verify even if you completely change the import structure, names, comments or even any code that gets removed by the optimizer.
the compiler can't find ISwapRouter.sol
you can add the code of ISwapRouter.sol directly on your swap.sol and delate that line from your code, this is the code https://github.com/Uniswap/v3-periphery/blob/main/contracts/interfaces/ISwapRouter.sol

How can one download the outputs of historical Azure ML experiment Runs via the python API

I'm trying to write a script which can download the outputs from an Azure ML experiment Run after the fact.
Essentially, I want to know how I can get a Run by its runId property (or some other identifier).
I am aware that I have access to the Run object when I create it for the purposes of training. What I want is a way to recreate this Run object later in a separate script, possibly from a completely different environment.
What I've found so far is a way to get a list of ScriptRun objects from an experiment via the get_runs() function. But I don't see a way to use one of these ScriptRun objects to create a Run object representing the original Run and allowing me to download the outputs.
Any help appreciated.
I agree that this could probably be better documented, but fortunately, it's a simple implementation.
this is how you get a run object for an already submitted run for azureml-sdk>=1.16.0 (for the older approach see my answer here)
from azureml.core import Workspace
ws = Workspace.from_config()
run = ws.get_run('YOUR_RUN_ID')
once you have the run object, you can call methods like
.get_file_names() to see what files are available (the logs in azureml-logs/ and logs/azureml/ will also be listed)
.download_file() to download an individual file
.download_files() to download all files that match a given prefix (or all the files)
See the Run object docs for more details.

AWS SageMaker SKLearn entry point in a subdirectory?

Can I specify SageMaker estimator's entry point script to be in a subdirectory? So far, it fails for me. Here is what I want to do:
sklearn = SKLearn(
entry_point="RandomForest/my_script.py",
source_dir="../",
hyperparameters={...
I want to do this so I don't have to break my directory structure. I have some modules, which I use in several sagemaker projects, and each project lives in its own directory:
my_git_repo/
RandomForest/
my_script.py
my_sagemaker_notebook.ipynb
TensorFlow/
my_script.py
my_other_sagemaker_notebook.ipynb
module_imported_in_both_scripts.py
If I try to run this, SageMaker fails because it seems to parse the name of the entry point script to make a module name out of it, and it does not do a good job:
/usr/bin/python3 -m RandomForest/my_script --bootstrap True --case nf_2 --max_features 0.5 --min_impurity_decrease 5.323785009485933e-06 --model_name model --n_estimators 455 --oob_score True
...
/usr/bin/python3: No module named RandomForest/my_script
Anyone knows a way around this other than putting my_script.py in the source_dir?
Related to this question
Unfortunately, this is a gap in functionality. There is some related work in https://github.com/aws/sagemaker-python-sdk/pull/941 which should also solve this issue, but for now, you do need to put my_script.py in source_dir.
What if you do source_dir = my_git_repo/RandomForest ?
Otherwise, you can also use a build functionality (such as CodeBuild - but it could also be some custom code eg in Lambda or Airflow) to send your script as a compressed artifact to s3, as this is how lower level SDKs such as boto3 expect your script anyway; this type of integration is shown in the boto3 section of the SageMaker Sklearn random forest demo

How to iterate a list of string, using Sikuli

I am using sikuli to automate an application; it process a file and save the output of this file.
I am taking a snapshot of the file itself, so Sikuli can find it, but I have to process 30 files; so taking 30 snapshot of each file is really not that logic. Is there a way to loop through a list of files, as string, so Sikuli can read the file name and retrieve it from a folder, instead of me taking snapshots of everything?
I did try to use the file name passed as text, but I get an error from Sikuli, since it can't find the file.
I call findText("myfile.txt") when the file prompt is on screen, but I get an error:
[error] TextRecognizer: init: export tessdata not possible - run setup with option 3
[error] TextRecognizer not working: tessdata stuff not available at:
/User/test/Library/Application Support/Sikulix/SikulixTesseract/tessdata
[error] FindFailed ( null )
I did check with Google and found not much. I am aware that Sikuli is mainly for snapshot automation, but it has python bindings for Java, so it can use python logic like if cycles and other construct, so I assume there should be a way to process multiple files via code.
I still don't completely understand what are you trying to do but the findText() function that you are using is actually attempting to find text on the screen by using OCR extraction of text in the region. Are you sure that's what you want to do? If yes you have to:
Setup Sikuli properly to include the tesseract libraries. You have a detailed instruction on SikuliX website.
Be aware that OCR feature is rather flaky and usually unreliable unless you do some work on tweaking the OCR engine which outside SikuliX scope.

Categories