Sagemaker: How to debug Model monitoring(data quality and model quality)?

Sagemaker: How to debug Model monitoring(data quality and model quality)? - python

I have created a Data Quality monitoring from Sagemaker Studio UI and also created using sagemaker SDK code, I referred to create model Data Quality monitoring job.
Errors:
when there is no captured data (this is expected)
Monitoring job failure reason:
Job inputs had no data
From logs, I can see that it is using Java in background. Not sure how to debug?
org.json4s.package$MappingException: Do not know how to convert
JObject(List(0,JDouble(38.0))) into class java.lang.String.
Once we create the DataQuality monitoring job using Sagemaker Studio UI or Sagemkaer python sdk, it is taking a hour to start. I would like to know is there a way to debug monitoring job without waiting for a hour every time we get a error?

For development, it might be easier to trigger execution of the monitoring job manually. Take a look at this python code
If you want to see how it's used, open the lab 5 notebook of the workshop and scroll almost to the end, to the cells right after the "Triggering execution manually" title.

Related

Real Time Cluster Log Delivery in a Databricks Cluster

I have some Python code that I am running on a Databricks Job Cluster. My Python code will be generating a whole bunch of logs and I want to be able to monitor these logs in real time (or near real time), say through something like a dashboard.
What I have done so far is, I have configured my cluster log delivery location and my logs are delivered to the specified destination every 5 minutes.
This is explained here,
https://learn.microsoft.com/en-us/azure/databricks/clusters/configure
Here is an extract from the same article,
When you create a cluster, you can specify a location to deliver the
logs for the Spark driver node, worker nodes, and events. Logs are
delivered every five minutes to your chosen destination. When a
cluster is terminated, Azure Databricks guarantees to deliver all logs
generated up until the cluster was terminated.
Is there some way I can have these logs delivered somewhere in near real time, rather than every 5 minutes? It does not have to be through the same method either, I am open to other possibilities.

As shown in below screenshot by default it is 5 minutes. Unfortunately, it cannot be changed. There is no information given in official documentation.
However, you can raise feature request here

Kivy app takes 30 seconds to open in android device

After my kivy app is pushed to my android device through buildozer,
First I can see Kivy loading symbol,then screen is blank for 30 seconds. after that my app is getting opened.
And this is happening on first run as well as subsequent runs.
I have read some answers and got to know that "we can avoid this problem by starting with minimal GUI
and loading the rest more lazily".
Could any one please let me know,how we can load like this when the app opens?

For an example, if you used on_pre_enter function and if you give this function to do many things, its normal you to wait much a time. But there is a no any code so i can't analyze your code and give any tip.Your computer's and android's processing time depends on background applications, hardware and many things does that. So try to share your minimal code just like your starting functions or you can create minimal application which has these functions so you can test your codes partly.

How do I run a python code on a cloud service to automate it's run for 5 days

I am working on a web scraping project using python and an API
I want the python script to be ran everyday for 5 days for 12 hours as a job
I don't want to keep my system alive to either do it in CMD or in Jupyter so I was looking for a solution wherein any cloud service would help me automate the process

One way to do this would be to write a web scraper in Python, and run it on an AWS Lambda, which is essentially a serverless function with no underlying ops to manage. Depending on your use case, you could either perform some action based on the contents of that page data, or you could write the result out to S3 as a file.
To have your function execute in a recurring fashion, you can then set your AWS Lambda event trigger to be a CloudWatch event (in this case, some recurring timer at whatever frequencies/times you'd like, such as once each hour for a 12 hour window during Mon-Fri).
This is typically going to be an easier approach when compared to spinning up a virtual server (EC2 instance), and managing a persistent process that could error during waits/operation for any number of reasons.

How to reconnect to the ongoing process on GoogleColab

I recently started to use Google Colab to train my CNN model. It always needs about 10+ hours to train once. But I cannot stay in the same place during these 10+ hours, so I always poweroff my notebook and let the process keep going.
My code will save models automatically. I figured out that when I disconnect from the Colab, the process are still saving models after disconnection.
Here are the questions:
When I try to reconnect to the Colab notebook, it always stuck at "INITIALIZAING" stage and can't connect. I'm sure that the process is running. How do I know if the process is OVER?
Is there any way to reconnect to the ongoing process? It will be nice to me to observe the training losses during the training.
Sorry for my poor English, thanks alot.

Output your loss results to a log file saved in your drive, and periodically check this file.
You can run your training process like:
!log_file = "/content/drive/My Drive/path/log.log"
!python train.py > "${log_file}"

first question: restart runtime from runtime menu
second question: i think you can use tensorboard to monitor your work.

It seems there's no normal way to do this. But you can save your model to Google Drive with current training epoch number, so when you see something like "my_model_epoch_1000" on your google drive, you will know that the process is over.

Airflow "This DAG isnt available in the webserver DagBag object "

when I put a new DAG python script in the dags folder, I can view a new entry of DAG in the DAG UI but it was not enabled automatically. On top of that, it seems does not loaded properly as well. I can only click on the Refresh button few times on the right side of the list and toggle the on/off button on the left side of the list to be able to schedule the DAG. These are manual process as I need to trigger something even though the DAG Script was put inside the dag folder.
Anyone can help me on this ? Did I missed something ? Or this is a correct behavior in airflow ?
By the way, as mentioned in the post title, there is an indicator with this message "This DAG isn't available in the webserver DagBag object. It shows up in this list because the scheduler marked it as active in the metdata database" tagged with the DAG title before i trigger all this manual process.

It is not you nor it is correct or expected behavior.
It is a current 'bug' with Airflow.
The web server is caching the DagBag in a way that you cannot really use it as expected.
"Attempt removing DagBag caching for the web server" remains on the official TODO as part of the roadmap, indicating that this bug may not yet be fully resolved, but here are some suggestions on how to proceed:
only use builders in airflow v1.9+
Prior to airflow v1.9 this occurs when a dag is instantiated by a function which is imported into the file where instantiation happens. That is: when a builder or factory pattern is used. Some reports of this issue on github 2 and JIRA 3 led to a fix released with in airflow v1.9.
If you are using an older version of airflow, don't use builder functions.
airflow backfill to reload the cache
As Dmitri suggests, running airflow backfill '<dag_id>' -s '<date>' -e '<date>' for the same start and end date can sometimes help. Thereafter you may end up with the (non)-issue that Priyank points, but that is expected behavior (state: paused or not) depending on the configuration you have in your installation.

Restart the airflow webserver solves my issue.

This error can be misleading. If hitting refresh button or restarting airflow webserver doesn't fix this issue, check the DAG (python script) for errors.
Running airflow list_dags can display the DAG errors (in addition to listing out the dags) or even try running/testing your dag as a normal python script.
After fixing the error, this indicator should go away.

The issue is because the DAG by default is put in the DagBag in paused state so that the scheduler is not overwhelmed with lots of backfill activity on start/restart.
To work around this change the below setting in your airflow.cfg file:
# Are DAGs paused by default at creation
dags_are_paused_at_creation = False
Hope this helps. Cheers!

I have a theory about possible cause of this issue in Google Composer. There is section about dag failures on webserver in troubleshooting documentation for Composer, which says:
Avoid running heavyweight computation at DAG parse time. Unlike the
worker and scheduler nodes, whose machine types can be customized to
have greater CPU and memory capacity, the webserver uses a fixed
machine type, which can lead to DAG parsing failures if the parse-time
computation is too heavyweight.
And I was trying to load configuration from external source (which actually took negligible amount of time comparing to other operations to create DAG, but still broke something, because webserver of Airflow in composer runs on App Engine, which has strange behaviours).
I found the workaround in discussion of this Google issue, and it is to create separate DAG with task which loads all the data needed and stores that data in airflow variable:
Variable.set("pipeline_config", config, serialize_json=True)
Then I could do
Variable.get("pipeline_config", deserialize_json=True)
And successfully generate pipeline from that. Additional benefit is that I get logs from that task, which I get from web server, because of this issue.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sagemaker: How to debug Model monitoring(data quality and model quality)? - python

For development, it might be easier to trigger execution of the monitoring job manually. Take a look at this python code If you want to see how it's used, open the lab 5 notebook of the workshop and scroll almost to the end, to the cells right after the "Triggering execution manually" title.

Related

Real Time Cluster Log Delivery in a Databricks Cluster

Kivy app takes 30 seconds to open in android device

How do I run a python code on a cloud service to automate it's run for 5 days

How to reconnect to the ongoing process on GoogleColab

Airflow "This DAG isnt available in the webserver DagBag object "

Categories

Resources