How to include the current timestamp in spacy's projects.yml scripts? - python

I'm using spacy projects for a pipeline for training a dependancy parser pipeline for a custom Bulgarian language model. I want to retrain and reevaluate mutiple times with different datasets and tokenizer rulesets. For that, I need to put a timestamp on the evaluation metrics.
I do this by running spacy evaluate ./models/model-best test.spacy -dp ./visualizations/ --output metrics-$(date +"%FT%H%M").json to get a metrics file with the current time included in the name.
This scrtipt, however, does not work when it's part of the project.yml file. I tried different ways to escape the " and % characters, but nothing led to success. Any help will be much appriciated.

In theory, you could run this command from the project with sh -c if you can figure out the YAML escaping (I couldn't quickly), so if I wanted to do this, I'd just move the command into a shell script and call that from the project instead.

Related

Run ner.manual in Prodigy on csv file

I am new to Prodigy and haven't fully figured out the paradigm.
For a project, I would like to manually annotate names from texts. My team has developed our own model to recognize the names, so I only want to use the annotated texts (produced with Prodigy) as a golden standard for our model.
To do so, I have a csv file texts.csv with the text in one of the columns. Do I need to convert this file into a json, or can I also run Prodigy on the csv file?
Also, what is the code that I need to run to start the ner_manual with this dataset?
I suppose, I have to start with:
!python -m prodigy ner.manual
However, it is unclear to me how I should run the rest. Can someone help me with this?
File Format
I believe for the recipes that say "Text Source" you can use jsonl, json, csv, or txt (reference the section that says "Text Source": https://prodi.gy/docs/api-loaders). Ner.manual says "Text Source" so I think it should work. (reference: https://prodi.gy/docs/recipes#ner-manual)
ner.manual
In regards to running ner.manual try taking a look at this documentation https://prodi.gy/docs/
The documentation contains a good example:
python -m prodigy ner.manual ner_news_headlines blank:en ./news_headlines.jsonl --label PERSON,ORG,PRODUCT,LOCATION
ner_news_headlines is the name of the dataset (it could be named anything)
blank:en is a blank english model
./news_headlines.jsonl is the name of the jsonl file that you will be annotating (use whatever file name your file is)
PERSON,ORG,PRODUCT,LOCATION are the labels that you will annotate your data with (change these to whatever labels you want to use, be sure to separate with commas not spaces)
I'm also pretty new to prodigy so someone else may have a better answer.

Save jupyter cell computation work and load from where I left

I am currently reading the book "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems" and I am experimenting with some housing analysis example.
My question is: every time I open Jupyter Notebook I have to run all the cells from the beginning one by one in order to continue with my work. I am new to Jupyter and I didn't find any solution to this.
I have imported packages like pandas, a function that downloads tar files and extracts them, also another one loading the data and many others.
Is there any way to safe and load everything (packages, functions, etc) so I can continue my work from the last checkpoint?
I have tried the kernel-restart & run all but is not working
Also, I have tried the cell-Run all but is not working either
I am using Jupyter 6.1.4 installed through anaconda latest version
Any help would be appreciated. Thank you
You can pickle and store the variables in the computer (disk). Next time you can just unpickle that and get going.
I am assuming you do not want to perform operations repeatedly on some data each time you work on a notebook
# For storing
import pickle
with open('pickle_file_name.pkl','w') as file_object:
pickle.dump(<your_data_variable>, file_object)
# For loading
with open('pickle_file_name.pkl','r') as file_object:
your_data_variable= pickle.load(file_object)
`

AWS SageMaker SKLearn entry point in a subdirectory?

Can I specify SageMaker estimator's entry point script to be in a subdirectory? So far, it fails for me. Here is what I want to do:
sklearn = SKLearn(
entry_point="RandomForest/my_script.py",
source_dir="../",
hyperparameters={...
I want to do this so I don't have to break my directory structure. I have some modules, which I use in several sagemaker projects, and each project lives in its own directory:
my_git_repo/
RandomForest/
my_script.py
my_sagemaker_notebook.ipynb
TensorFlow/
my_script.py
my_other_sagemaker_notebook.ipynb
module_imported_in_both_scripts.py
If I try to run this, SageMaker fails because it seems to parse the name of the entry point script to make a module name out of it, and it does not do a good job:
/usr/bin/python3 -m RandomForest/my_script --bootstrap True --case nf_2 --max_features 0.5 --min_impurity_decrease 5.323785009485933e-06 --model_name model --n_estimators 455 --oob_score True
...
/usr/bin/python3: No module named RandomForest/my_script
Anyone knows a way around this other than putting my_script.py in the source_dir?
Related to this question
Unfortunately, this is a gap in functionality. There is some related work in https://github.com/aws/sagemaker-python-sdk/pull/941 which should also solve this issue, but for now, you do need to put my_script.py in source_dir.
What if you do source_dir = my_git_repo/RandomForest ?
Otherwise, you can also use a build functionality (such as CodeBuild - but it could also be some custom code eg in Lambda or Airflow) to send your script as a compressed artifact to s3, as this is how lower level SDKs such as boto3 expect your script anyway; this type of integration is shown in the boto3 section of the SageMaker Sklearn random forest demo

Training Tesseract OCR for ambiguities

I am pretty new to data scraping and I am facing a minor issue.
I am trying to extract text from a Hindi pdf using textract and Tesseract OCR.
Following is the code in Python:
import textract
text = textract.parsers.process("test.pdf", encoding='utf_8', method='tesseract', language = 'hin')
Now, many of the words from the PDF are correctly extracted. However, there are some things that are messed up. I read the documentation and about how ambiguities can be overridden by using a file lang.unicharambigs. However, I need to run combine_tessdata in order to actually bring it into effect and override certain trained data.
However, when I try to run the command I get the following:
-bash: combine_tessdata: command not found
I have installed tesseract from the source and I can't seem to understand why this is happening. Any ideas on how to troubleshoot this?
Thanks in advance!
Tesseract training executables are built separately.
https://github.com/tesseract-ocr/tesseract/wiki/Compiling

Monitoring mongoDB in the cluster

I'm trying to monitor and analyze the results of sharded MongoDB instance in the cluster. There's a nice monitoring tool provided by mongo- MMS. However, I need to analyze and plot CPU/Disk IO, shard load graphs on my own. The question: is it possible to get data from MMS (i.e. timestamps,opcoutns, CPU utilization) in CVS or something that would be possible to load in R/Python?
You can build your own tool, although I highly doubt that it will be better then MMS. As Asya suggested, you can use db.serverStatus() to read some of the data. You can check here for more commands and tools for collecting data.
You can do a dirty test with some other parameters from mongostats command. Also the fields it output are slightly different from what you put in the brackets, but you try to build it easy. All you need is just redirect the output from this command to a text file.
In window you will do this with mongostat > stats.txt and if I remember this correctly in linux mongostat stats.txt. Then just parse the file with R/python and plot whatever you want.

Categories