Azure Batch JobPreparationTask fails with "UserError"

Azure Batch JobPreparationTask fails with "UserError" - python

I am trying to mount a File Share (Not a blob storage) during the JobPerparationTask. My node OS is Ubuntu 16.04.
To do this, I am doing the following:
job_user = batchmodels.AutoUserSpecification(
scope=batchmodels.AutoUserScope.pool,
elevation_level=batchmodels.ElevationLevel.admin)
start_task = batch.models.JobPreparationTask(command_line=start_commands, user_identity=batchmodels.UserIdentity(auto_user=job_user))
end_task = batch.models.JobReleaseTask(command_line=end_commands,user_identity=batchmodels.UserIdentity(auto_user=job_user))
job = batch.models.JobAddParameter(
job_id,
batch.models.PoolInformation(pool_id=pool_id),job_preparation_task=start_task, job_release_task=end_task)
My start_commands and end_commands are fine, but there is something wrong with the User permissions...
I get no output in the stderr.txt or in the stdout.txt file.
I do not see any logs what-so-ever (where are they?). All I am able to find is a message showing this:
Exit code
1
Retry count
0
Failure info
Category
UserError
Code
FailureExitCode
Message
The task exited with an exit code representing a failure
Details
Message: The task exited with an exit code representing a failure
Very detailed error message!
Anyway, I have also tried changing AutoUserScope.oool to AutoUserScope.task, but there is no change.
Anyone have any ideas?

I had this issue which was frustrating me because I couldn't get any logs from my application.
What I ended up doing is RDP'ing into the node my job ran on, going to %AZ_BATCH_TASK_WORKING_DIR% as specified in Azure Batch Compute Environment Variables and then checking the stdout.txt and stderr.txt in my job.
The error was that I formulated my CloudTask's commandline incorrectly, so it could not launch my application in the first place.
To RDP into your machine, in Azure Portal:
Batch Account
Pools (select your pool)
Nodes
Select the node that ran your job
Select "Connect" link at the top.

Related

Windows Task Scheduler unable to run Python Script, error value: 2147944320

I have created a new task, with one action: Start a program. I followed the instructions from https://www.jcchouinard.com/python-automation-using-task-scheduler/
my python script looks like this:
if __name__ == '__main__':
num = 1
Under general properties, I have 'run whether user is logged on or not' checked, 'run with highest privileges' checked, and myself as the user running the task. I am an administrator.
Action Parameters...
Program/script: C:\Users\myuser\AppData\Local\Microsoft\WindowsApps\python.exe (pasted from command where python)
Add arguments: py_test.py
Start in: C:\Users\myuser\Desktop
I tested from the command line that I can run this command successfully:
C:\Users\myuser\AppData\Local\Microsoft\WindowsApps\python.exe C:\Users\myuser\Desktop\py_test.py
When I click 'run' from the task scheduler library, I get the error 'The file cannot be accessed by the system. (0x80070780)
When I go into the history for the task, I see this error:
Task Scheduler failed to launch action "C:\Users\myuser\AppData\Local\Microsoft\WindowsApps\python.exe" in instance "{6204cea7-bedc-40f9-bc10-ac95b9e02460}" of task "\TestPythonJob". Additional Data: Error Value: 2147944320.
I confirmed under the executable file's properties that I and SYSTEM have access to it. I tried researching this error value but could not find anything. What could be the issue?

Maybe try adding python to your path environment variables. Then make a .bat to run it. Here's how to use a .bat with task scheduler https://www.python.org/ftp/python/3.9.5/python-3.9.5-amd64.exe.

Heroku how to see logs of clock process

I recently implemented a clock process in my heroku app (Python) to scrape data into a database for me every X hours. In the script, I have a line that is supposed to send me an email when the scraping begins and again when it ends. Today was the first day that it was supposed to run at 8AM UTC, and it seems to have ran perfectly fine as the data on my site has been updated.
However, I didn't receive any emails from the scraper, so I was trying to find the logs for that specific dyno to see if it hinted at why the email wasn't sent. However I am unable to see anything that even shows the process ran this morning.
With the below command all I see is that the dyno process is up as of my last Heroku deploy. But there is nothing that seems to suggest it ran successfully today... even though I know it did.
heroku logs --tail --dyno clock
yields the following output, which corresponds to the last time I deployed my app to heroku.
2021-04-10T19:25:54.411972+00:00 heroku[clock.1]: State changed from up to starting
2021-04-10T19:25:55.283661+00:00 heroku[clock.1]: Stopping all processes with SIGTERM
2021-04-10T19:25:55.402083+00:00 heroku[clock.1]: Process exited with status 143
2021-04-10T19:26:07.132470+00:00 heroku[clock.1]: Starting process with command `python clock.py --log-file -`
2021-04-10T19:26:07.859629+00:00 heroku[clock.1]: State changed from starting to up
My question is, is there any command or place to check on Heroku to see any output from my logs? For example any exceptions that were thrown? If I had any PRINT statements in my clock-process, where would those be printed to?
Thanks!

Although this is not the full answer, from the Ruby gem 'ruby-clock', we get an insight from the developer
Because STDOUT does not flush until a certain amount of data has gone
into it, you might not immediately see the ruby-clock startup message
or job output if viewing logs in a deployed environment such as Heroku
where the logs are redirected to another process or file. To change
this behavior and have logs flush immediately, add $stdout.sync = true
to the top of your Clockfile.
So I'm guessing that it has something to do with flushing STDOUT when logging although I am not sure how to do that in Python.
I did a quick search and found this stackoverflow post
Namely
In Python 3, print can take an optional flush argument:
print("Hello, World!", flush=True)
In Python 2, after calling print, do:
import sys
sys.stdout.flush()

Live ECS logging into Cloudwatch

I am using an ECS task which runs a Docker container to execute some terraform commands.
I would like to logs the results of the terraform commands into Cloudwatch, if possible live. I am using the logging package of Python 3.
The function I use to output the result of the command is the following:
def execute_command(command):
"""
This method is used to execute the several commands
:param command: The command to be executed
:return decoded: The result of the command execution
"""
logging.info('Executing: {}'.format(command))
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
communicate = process.communicate()
decoded = (communicate[0].decode('utf-8'), communicate[1].decode('utf-8'))
for stdout in decoded[0].split('\n'):
if stdout != '':
logging.info(stdout)
for stderr in decoded[1].split('\n'):
if stderr != '':
logging.warning(stderr)
return decoded
Which is called the following way:
apply_command = 'terraform apply -input=false -auto-approve -no-color {}'.format(plan_path)
terraform_apply_output = utils.execute_command(apply_command)
if terraform_apply_output[1] is not '':
logging.info('Apply has failed. See above logs')
aws_utils.remove_message_from_queue(metadata['receipt_handle'])
utils.exit_pipeline(1)
When the terraform command succeed, I can see its output after the command has been executed (i.e: see the result of the apply command after the resources have been applied), which is expected by the code.
When the terraform command failed (let's say because some resources were already deployed and not saved in a .tfstate), then I cannot see the login and the ECS task quit without error message.
I can see 2 reasons for it:
The result of the failed terraform command returns a non-zero code, which means the ECS task exits before outputing the logs into stdout (and so, into Cloudwatch).
The result of the failed terraform command is sent to stderr, which is not correctly logged.
What is my error here, and how could I fix it? Any help greatly appreciated :)

This question sounds suspectly familiar to me. Anyway.
Adding a sleep(10) just before exiting the task will fix the issue.
From AWS support:
I’ve been investigating the issue further and I noticed an internal
ticket regarding CloudWatch logs sometimes being truncated for Fargate
tasks. The problem was reported as a known issue in the latest Fargate
platform version (1.3.0). [1] Looking at our internal tickets for the
same, as you mentioned in the case description, the current workaround
to avoid this situation is extending the lifetime of the existing
container by adding a delay (~>10 seconds) between the logging output
of the application and the exit of the process (exit of the
container). I can confirm that our service team are still working to
get a permanent resolution for this reported issue. Unfortunately,
there is no ETA shared for when the fix will be deployed. However,
I've taken this opportunity to add this case to the internal ticket to
inform the team of the similar and try to expedite the process. In
addition, I'd recommend keeping an eye on the ECS release notes for
updates to the Fargate platform version which address this behaviour:
-- https://aws.amazon.com/new/
-- https://docs.aws.amazon.com/AmazonECS/latest/developerguide/document_history.html
"

Ambari - Trouble Checking Status for Custom Application

I've been playing around with creating a custom application inside of our Ambari installation. After a little bit of toying, I've successfully got this configured to do the installation and startup actions with the appropriate log creation\output and pid creation. The piece that I'm struggling with now is having Ambari maintain the status of this newly installed application. After following some of the instructions here : http://mozartanalytics.com/how-to-create-a-software-stack-for-ambari/ (specifically the Component Status section), I've been able to make some progress -- however it's not exactly what I want.
When including the following in master.py, Ambari will see the service as momentarily active after initial startup, but then the application will appear as red (offline). It marks it as offline, even though when I check the server, I see the appropriate process running.
def status(self, env):
import params
print 'Checking status of pid file'
check=format("{params.pid}/Application.pid")
check_process_status(check)
However, when I modify it to look like the following, Ambari has no problem tracking the status and monitors it appropriately
def status(self, env):
import params
print 'Checking status of pid file'
dummy_master_pid_file = "/var/run/Application/Application.pid"
check_process_status(dummy_master_pid_file)
Has anyone else run into this issue? Is there something that I'm missing in regards to creating this custom application inside of Ambari? Any help or pointing in the right direction will be appreciated.
FYI. This is Ambari 2.1 running on Centos 6.7

Recently, I have solved a similar problem. And the solution is to put a string "{"securityState": "UNKNOWN"}" into the file - /var/lib/ambari-agent/data/structured-out-status.json.
The way to find this solution is by watching ambari-agent log : PythonExecutor.py:149 - {'msg': 'Unable to read structured output from /var/lib/ambari-agent/data/structured-out-status.json'}. Hope it will help.

maybe this is your parameter problem.
def status(self, env):
import params
print 'Checking status of pid file'
pid_path = params.pid
check=format("{pid_path}/Application.pid")
check_process_status(check)

Azure HDInsights Issue With Hive/Python Map-Reduce

Running a very simple test example using Azure HDInsights and Hive/Python. Hive does not appear to be loading the Python script.
Hive contains a small test table with a field called 'dob' that I'm trying to transform using a Python script via map-reduce.
Python script is blank and located at asv:///mapper_test.py. I made the script blank because I wanted to first isolate the issue of Hive accessing this script.
Hive Code:
ADD FILE asv:///mapper_test.py;
SELECT
TRANSFORM (dob)
USING 'python asv:///mapper_test.py' AS (dob)
FROM test_table;
Error:
Hive history file=c:\apps\dist\hive-0.9.0\logs/hive_job_log_RD00155DD090CC$_201308202117_1738335083.txt
Logging initialized using configuration in file:/C:/apps/dist/hive-0.9.0/conf/hive-log4j.properties
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201308201542_0025, Tracking URL = http://jobtrackerhost:50030/jobdetails.jsp?jobid=job_201308201542_0025
Kill Command = c:\apps\dist\hadoop-1.1.0-SNAPSHOT\bin\hadoop.cmd job -Dmapred.job.tracker=jobtrackerhost:9010 -kill job_201308201542_0025
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-08-20 21:18:04,911 Stage-1 map = 0%, reduce = 0%
2013-08-20 21:19:05,175 Stage-1 map = 0%, reduce = 0%
2013-08-20 21:19:32,292 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201308201542_0025 with errors
Error during job, obtaining debugging information...
Examining task ID: task_201308201542_0025_m_000002 (and more) from job job_201308201542_0025
Exception in thread "Thread-24" java.lang.RuntimeException: Error while reading from task log urlatorg.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getStackTraces(TaskLogProcessor.java:242)
at org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:227)
at org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:92)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.IOException: Server returned HTTP response code: 400 for URL: http://workernode1:50060/tasklog?taskid=attempt_201308201542_0025_m_000000_7&start=-8193
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1616)
at java.net.URL.openStream(URL.java:1035)
at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getStackTraces(TaskLogProcessor.java:193)
... 3 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

I had a similar experience while using PIG + Python in Azure HDInsight Version 2.0. One thing I found out was, Python was available only in the head node and not in all the nodes in the cluster. You can see a similar question here
You can remote login to the head node of the cluster, and from the head node, find out the IP of the Task Tracker node and remote login to any of the task traker nodes to find out whether python is installed in the node.
This issue is fixed in HDInsight Version 2.1 cluster. But still Python is not added to the "PATH". You may need to do this yourself.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.