Real Time Cluster Log Delivery in a Databricks Cluster

Real Time Cluster Log Delivery in a Databricks Cluster - python

I have some Python code that I am running on a Databricks Job Cluster. My Python code will be generating a whole bunch of logs and I want to be able to monitor these logs in real time (or near real time), say through something like a dashboard.
What I have done so far is, I have configured my cluster log delivery location and my logs are delivered to the specified destination every 5 minutes.
This is explained here,
https://learn.microsoft.com/en-us/azure/databricks/clusters/configure
Here is an extract from the same article,
When you create a cluster, you can specify a location to deliver the
logs for the Spark driver node, worker nodes, and events. Logs are
delivered every five minutes to your chosen destination. When a
cluster is terminated, Azure Databricks guarantees to deliver all logs
generated up until the cluster was terminated.
Is there some way I can have these logs delivered somewhere in near real time, rather than every 5 minutes? It does not have to be through the same method either, I am open to other possibilities.

As shown in below screenshot by default it is 5 minutes. Unfortunately, it cannot be changed. There is no information given in official documentation.
However, you can raise feature request here

Related

Sagemaker: How to debug Model monitoring(data quality and model quality)?

I have created a Data Quality monitoring from Sagemaker Studio UI and also created using sagemaker SDK code, I referred to create model Data Quality monitoring job.
Errors:
when there is no captured data (this is expected)
Monitoring job failure reason:
Job inputs had no data
From logs, I can see that it is using Java in background. Not sure how to debug?
org.json4s.package$MappingException: Do not know how to convert
JObject(List(0,JDouble(38.0))) into class java.lang.String.
Once we create the DataQuality monitoring job using Sagemaker Studio UI or Sagemkaer python sdk, it is taking a hour to start. I would like to know is there a way to debug monitoring job without waiting for a hour every time we get a error?

For development, it might be easier to trigger execution of the monitoring job manually. Take a look at this python code
If you want to see how it's used, open the lab 5 notebook of the workshop and scroll almost to the end, to the cells right after the "Triggering execution manually" title.

Configure logging retention policy for Apache airflow

I could not find in Airflow docs how to set up the retension policy I need.
At the moment, we keep all airflow logs forever on our servers which is not the best way to go.
I wish to create global logs configurations for all the different logs I have.
How and where do I configure:
Number of days to keep
Max file size

I ran into the same situation yesterday, the solution for me was to use a DAG that handles all the log cleanup and schedule it as any other DAG.
Check this repo, you will find a step-by-step guide on how to set it up. Basically what you will achieve is to delete files located on airflow-home/log/ and airflow-home/log/scheduler based on a given period defined on a Variable. The DAG dynamically creates one task for each directory targeted for deletion based on your previous definition.
In my case, the only modification I made to the original DAG was to allow deletion only to the scheduler folder by replacing the initial value of DIRECTORIES_TO_DELETE. All credits to the creators! works very well out of the box, and it's easy to customize.

Persist Completed Pipeline in Luigi Visualiser

I'm starting to port a nightly data pipeline from a visual ETL tool to Luigi, and I really enjoy that there is a visualiser to see the status of jobs. However, I've noticed that a few minutes after the last job (named MasterEnd) completes, all of the nodes disappear from the graph except for MasterEnd. This is a little inconvenient, as I'd like to see that everything is complete for the day/past days.
Further, if in the visualiser I go directly to the last job's URL, it can't find any history that it ran: Couldn't find task MasterEnd(date=2015-09-17, base_url=http://aws.east.com/, log_dir=/home/ubuntu/logs/). I have verified that it ran successfully this morning.
One thing to note is that I have a cron that runs this pipeline every 15 minutes to check for a file on S3. If it exists, it runs, otherwise it stops. I'm not sure if that is causing the removal of tasks from the visualiser or not. I've noticed it generates a new PID every run, but I couldn't find a way to persist one PID/day in the docs.
So, my questions: Is it possible to persist the completed graph for the current day in the visualiser? And is there a way to see what has happened in the past?
Appreciate all the help

I'm not 100% positive if this is correct, but this is what I would try first. When you call luigi.run, pass it --scheduler-remove-delay. I'm guessing this is how long the scheduler waits before forgetting a task after all of its dependents have completed. If you look through luigi's source, the default is 600 seconds. For example:
luigi.run(["--workers", "8", "--scheduler-remove-delay","86400")], main_task_cls=task_name)

If you configure the remove_delay setting in your luigi.cfg then it will keep the tasks around for longer.
[scheduler]
record_task_history = True
state_path = /x/s/hadoop/luigi/var/luigi-state.pickle
remove_delay = 86400
Note, there is a typo in the documentation ("remove-delay" instead of remove_delay") which is being fixed under https://github.com/spotify/luigi/issues/2133

Amazon Elastic MapReduce - SIGTERM

I have an EMR streaming job (Python) which normally works fine (e.g. 10 machines processing 200 inputs). However, when I run it against large data sets (12 machines processing a total of 6000 inputs, at about 20 seconds per input), after 2.5 hours of crunching I get the following error:
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 143
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:372)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:586)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
If I am reading this correctly, the subprocess failed with code 143 because someone sent a SIGTERM signal to the streaming job.
Is my understanding correct? If so: When would the EMR infrastructure send a SIGTERM?

I figured out what was happening, so here's some information if anyone else experiences similar problems.
The key to me was to look at the "jobtracker" logs. These live in your task's logs/ folder on S3, under:
<logs folder>/daemons/<id of node running jobtracker>/hadoop-hadoop-jobtracker-XXX.log.
There were multiple lines of the following kind:
2012-08-21 08:07:13,830 INFO org.apache.hadoop.mapred.TaskInProgress
(IPC Server handler 29 on 9001): Error from attempt_201208210612_0001_m_000015_0:
Task attempt_201208210612_0001_m_000015_0 failed to report status
for 601 seconds. Killing!
So my code was timing out, and it was being killed (it was going beyond the 10 minute task timeout). 10 minutes I wasn't doing any I/Os, which was certainly not expected (I would typically do an I/O every 20 seconds).
I then discovered this article:
http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-code
"In one of our science projects, we have a few Hadoop Streaming jobs that run over ruby and rely on libxml to parse documents. This creates a perfect storm of badness – the web is full of really bad html and libxml occasionally goes into infinite loops or outright segfaults. On some documents, it always segfaults."
It nailed it. I must be experiencing one of these "libxml going into infinite loop" situations (I am using libxml heavily -- only with Python, not Ruby).
The final step for me was to trigger skip mode (instructions here: Setting hadoop parameters with boto?).

I ran into this output from Amazon EMR ("subprocess failed with code 143"). My streaming job was using PHP curl to send data to a server that didn't have the MapReduce job servers part of its security group. Therefore the reducer was timing out and being killed. Ideally I'd like to add my jobs to the same security group but I opted to simply add a URL security token param infront of my API.

Weblogic domain and cluster creation with WLST

I want to create a cluster with 2 managed servers on 2 different physical machines.
I have following tasks to be performed (please correct me if I miss something)
Domain creation.
Set admin server properties and create AdminServer under SSL
Create logical machines for the physical ones
Create managed servers
create cluster with the managed servers
I have following questions.
Which of the above mentioned tasks can be done offline if any ?
Which of the above mentioned tasks must also be performed on the 2nd physical machine ?

I eventually found the answer. I am posting here for reference.
Out of the 5 mentioned tasks, all can be performed with an offline wlst script. All of them have to be performed on the node where AdminServer is supposed to live.
Now, for updating the domain information on the second node, there is an nmEnroll command in wlst which hast to be performed online
So, to summarize,
Execute an offline wlst script to perform all the 5 tasks mentioned in the question. This has to be done on the node (physical computer) where we want our AdminServer to run.
Start nodemanager on all the nodes to be used in the cluster,
Start the AdminServer on the node where we executed the domain creation script.
On all the other nodes execute the script which looks like following.
connect('user','password','t3://adminhost:adminport')
nmEnroll('path_to_the_domain_dir')

There are two steps missed after the step 1, you need to copy the configuration from the machine where the AdminServer is running run to the other machine in the cluster using the command pack content in Weblogic installation:
1.1 On the machine where the AdminServer is running run ./pack.shdomain=/home/oracle/config/domains/my_domain
-template=/home/oracle/my_domain.jar -template_name=remote_managed -managed=true
1.2 Go on the other machines and copy the jar file produced in the previous step and run ./unpack.sh -domain=/home/oracle/config/domains/my_domain -template=/home/oracle/my_domain.jar SAML_IDP_FromScript
Now you have copied all the file you need to start the NodeManager and the ManagedServers on the other machines.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.