Python: Increasing timeout value in EMR using yelps MRJOB

Python: Increasing timeout value in EMR using yelps MRJOB - python

I am using the yelp MRjob for writing some of the mapreduce programs. I am running it on EMR. My program has reducer code which takes a long time to execute. I am noticing that because of the default timeout period in EMR I am getting this error
Task attempt_201301171501_0001_r_000000_0 failed to report status for 600 seconds.Killing!
I want a way to increase the timeout of the EMR. I read the mrjobs official documentation about the same but I was not able to understand the procedure. Can someone suggest a way to solve this issue.

I've dealt with a similar issue with EMR in the past, the property you are looking for mapred.task.timeout which corresponds to the number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string.
With MRJob, you could add the following option:
--jobconf mapred.task.timeout=1800000
EDIT: It appears that some EMR AMIs appear do not support setting parameters like timeout with jobconf at run time. Instead, you must use Bootstrap-time configuration like this:
--bootstrap-action="s3://elasticmapreduce/bootstrap-actions/configure-hadoop -m mapred.task.timeout=1800000"
I would still try the first one to start with and see if you can get it to work, otherwise try the bootstrap action.
To run any of these parameters, just create your job extending from MRJob, this class has a jobconf method that will read your --jobconf parameters, so you should specify these as regular options on command line:
python job.py --num-ec2-instances 42 --python-archive t.tar.gz -r emr --jobconf mapred.task.timeout=1800000 /path/to/input.txt

Related

AWS Python lambda times out forever after first timeout

Problem:
I have a python lambda that constantly receives data every second and puts it into DynamoDB. I noticed that after the first time DynamoDB takes a little more and the function times out, all the following calls also timeout and it never recovers.
The way to bring the lambda back to normal is to redeploy it.
When it starts timing out, it does not display any logs. It times out without executing any of the code.
Below is a picture of our console that represents the issue.
In order to reproduce the issue faster with this function I did the following:
Redeploy it and see it is working fine.
Reduce the memory available to the lambda to the minimum and timeout to 1 second. This will cause the first timeout
Increase back the memory of the lambda to normal and even increase the timeout. However, the timeouts persist
Is there a way to resolve this issue without having to redeploy?
I have seen the same description of issue but with nodejs in this post: https://forums.aws.amazon.com/thread.jspa?threadID=234417.
I haven't seen any description related with the python lambda env
More information about the setup:
Lambda environments tested: python3.6 and python3.7
Tool to deploy lambda: serverless 1.57.0
serverless plugins used: serverless-python-requirements, serverless-wsgi
I am not using any VPC for the lambda
Thank you for the help,

Figured out the trigger for the bug.
When the lambda function zip uploaded is too large, after the first time it times out, it never recovers!
My solution was to carefully strip out the unnecessary dependencies to make the package smaller.
I created a repository using a docker container for people to reproduce the issue more easily:
https://github.com/pedrohbtp/bug-aws-lambda-infinite-timeout
Thanks for the messages in the comments. I appreciate whoever takes time to try to help here in SO.

What is the best way to run python scripts in AWS?

I have three python scripts, 1.py, 2.py, and 3.py, each having 3 runtime arguments to be passed.
All three python programs are independent of each other. All 3 may run in a sequential manner in a batch or it may happen any two may run depending upon some configuration.
Manual approach:
Create EC2 instance, run python script, shut it down.
Repeat the above step for the next python script.
The automated way would be trigger the above process through lambda and replicate the above process using some combination of services.
What is the best way to implement this in AWS?

AWS Batch has a DAG scheduler, technically you could define job1, job2, job3 and tell AWS Batch to run them in that order. But I wouldn't recommend that route.
For the above to work you would basically need to create 3 docker images. image1, image2, image3. and then put these in ECR (Docker Hub can also work if not using Fargate launch type).
I don't think that makes sense unless each job is bulky has its own runtime that's different from the others.
Instead I would write a Python program that calls 1.py 2.py and 3.py, put that in a Docker image and run a AWS batch job or just ECS Fargate task.
main.py:
import subprocess
exit_code = subprocess.call("python3 /path/to/1.py", shell=True)
# decide if you want call 2.py and so on ...
# 1.py will see the same stdout, stderr as main.py
# with batch and fargate you can retrieve these form cloudwatch logs ...
Now you have a Docker image that just needs to run somewhere. Fargate is fast to startup, bit pricey, has a 10GB max limit on temporary storage. AWS Batch is slow to startup on a cold start, but can use spot instances in your account. You might need to make a custom AMI for AWS batch to work. i.e. if you want more storage.
Note: for anyone who wants to scream at shell=True, both main.py and 1.py came from the same codebase. It's a batch job, not an internet facing API that took that from user request.

You can run your EC2 instance via a Python Script, using the AWS boto3 library (https://aws.amazon.com/sdk-for-python/). So, a possible solution would be to trigger a Lambda function periodically (you can use Amazon Cloudwatch for periodic events), and inside that function you can boot up your EC2 instance using Python script.
In your instance you can configure your OS to run a Python script every time it boots up, I would suggest you to use Crontab (See this link https://www.instructables.com/id/Raspberry-Pi-Launch-Python-script-on-startup/)
At the end of your script, you can trigger a Amazon SQS event to a function that will shutdown your first instance and than call another function that will start the second script.

You could use meadowrun - disclaimer I am one of the maintainers so obviously biased.
Meadowrun is a python library/tool that manages EC2 instances for you, moves python code + environment dependencies to them, and runs a function without any hassle.
For example, you could put your scripts in a Git repo and run them like so:
import asyncio
from meadowrun import AllocCloudInstance, Deployment, run_function
from script_1 import run_1
async def main():
results = await run_function(
# the function to run on the EC2 instance
lambda: run_1(arguments),
# properties of the VM that runs the function
AllocCloudInstance(
logical_cpu_required=2,
memory_gb_required=16,
interruption_probability_threshold=15,
cloud_provider="EC2"),
# code+env to deploy on the VM, there's other options here
Deployment.git_repo(
"https://github.com/someuser/somerepo",
conda_yml_file="env.yml",
)
)
It will then create an EC2 instance with the given requirements for you (or reuse one if it's already there - could be useful for running your scripts in sequence), creates python code + enviroment there, runs the function and returns any results and output.

For 2022, depending on your infrastructure constraints, i'd say the easiest way would be to set the scripts on Lambda and then call them from the CloudWatch with the required parameters (create a rule):
https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/RunLambdaSchedule.html
That way you can configure them to run independently or sequential and not having to worry about setting up and turning on and off the infrastructure.
This applies to scripts that are not too recursive intensive and that don't run for more than 15 minutes at a time (Lambda time limit)

How to toggle process state via command line?

We are trying to create a simple command line utility that tracks some metrics over an unknown period of time (start/stop triggered via command line externally).
For example,
python metrics-tool.py --start-collect
... run some additional commands external to metrics-tool ...
python metrics-tool.py --stop-collect
Does anyone have any ideas or suggestions on how an application can receive a command for a "second time"? Is this even possible, or a good way to do this?
It almost sounds like this should be a service, configurable at runtime by an endpoint?

This does sound more like a service which can be started and stopped, via Systemd (or Supervisor) for example.
Using Systemd means there's no need to daemonise your Python process yourself: https://stackoverflow.com/a/30189540/736221
You can of course do that if you want to, if you're using something other than Systemd: https://pagure.io/python-daemon

Can Jenkins handle an gui/non-gui interactive python or java program?

I want to create a build pipeline, and developers need to set up a few things into a properties file which gets populated using a front end GUI.
I tried running sample CLI interactive script using python that just asked for a name and prints it out afterwards, but Jenkins just waited for ages then hanged. I see that it asked for the input, but there was no way for the user to input the data.
EDIT: Currently running Jenkins as a service..Or is there a good plugin anyone recommends or is it the way I created the python script?
Preference:
I would prefer to use Python because it is a little lightweight, but if people had success with other languages I can comprise.
Using a GUI menu to populate the data, would be cool because I can use option boxes, drop down menus and make it fancy but it isn't a necessity, a CLI is considerably better than our current deployment.
BTW, running all this on Windows 7 laptop running Python 2.7 and Java 1.7
Sorry for the essay! Hopefully people can help me!

Sorry, but Jenkins is not an interactive application. It is designed for automated execution.
The only viable way to get input to a Jenkins job (and everything that is executed from that job) is with the job parameters that are populated before the job is started. Granted, Jenkins GUI for parameter entry is not the greatest, but it does the job. Once the Jenkins job collected the job parameters at the start of the job, it can pass those parameters to anything it executes (Python, shell, whatever) at any time during the job. Two things have to be true for that to happen:
You need to collect all the input data before the job starts
Whatever your job calls (Python, shell, etc) need to be able to receive their input not interactively, but through command line.
How to get input into program
A well designed script should be able to simply accept parameters on the command line:
./goodscript.sh MyName will be the simplest way of doing it, where value MyName will be stored in $1 first parameter of the script. Subsequent command line parameters will be available in variables $2, $3 and so on.
./goodscript.sh -name MyName -age 30 will be a better way of doing it, where the script can take multiple parameters regardless of their order by specifying a parameter name before parameter value. You can read about using getopt for this method of parameter passing
Both examples above assume that the goodscript.sh is written well enough to be able to process those command line parameters. If the script does not explicitly process command line parameters, doing the above will be useless.
You can "pipe" some output to an interactive script that is not designed to handle command line parameters explicitly:
echo MyName | ./interactivescript.sh will pass value MyName to the first interactive prompt that interactivescript.sh provides to the user. Problem with this is that you can only pass a value to the first interactive prompt.
Jenkins job parameters GUI
Like I said above, you can use Jenkins GUI to gather all sorts of job parameters (dropdown lists, checkboxes, text entry). I assume you know how to setup Jenkins job with parameters. If not, in the job configuration click "This build is parameterized" checkbox. If you can't figure out how to set this up, that's a different question and will need to be explained separately.
However, once your Jenkins job collected all the parameters up front, you can reference them in your "execute shell" step. If you are using Windows, you will reference them as %PARAM_NAME%, and for Linux as $PARAM_NAME.
Explain what you need help with: getting your script to accept command line parameters, or passing those command line parameters from jenkins job GUI, and I will expand this answer further

Automate Python Script

I'm running a python script manually that fetches data in JSON format.How do I automate this script to run automatically on an hourly basis?
I'm working on Windows7.Can I use tools like Task scheduler?If I can use it,what do I need to put in the batch file?

Can I use tools like Task scheduler?
Yes. Any tool that can run arbitrary programs can run your Python script. Pick the one you like best.
If I can use it,what do I need to put in the batch file?
What batch file? Task Scheduler takes anything that can be run, with arguments—a C program, a .NET program, even a document with a default app associated with it. So, there's no reason you need a batch file. Use C:\Python33\python.exe (or whatever the appropriate path is) as your executable, and your script's path (and its arguments, if any) as the arguments. Just as you do when running the script from the command line.
See Using the Task Scheduler in MSDN for some simple examples, and Task Scheduler Schema Elements or Task Scheduler Scripting Objects for reference (depending on whether you want to create the schedule in XML, or via the scripting interface).
You want to create an ExecAction with Path set to "C:\Python33\python.exe" and Arguments set to "C:\MyStuff\myscript.py", and a RepetitionPattern with Interval set to "PT1H". You should be able to figure out the rest from there.
As sr2222 points out in the comments, often you end up scheduling tasks frequently, and needing to programmatically control their scheduling. If you need this, you can control Task Scheduler's scripting interface from Python, or build something on top of Task Scheduler, or use a different tool that's a bit easier to get at from Python and has more helpful examples online, etc.—but when you get to that point, take a step back and look at whether you're over-using OS task scheduling. (If you start adding delays or tweaking times to make sure the daily foo1.py job never runs until 5 minutes after the most recent hourly foo0.py has finished its job, you're over-using OS task scheduling—but it's not always that obvious.)

May I suggest WinAutomation or AutoMate. These two do the exact same thing, except the UI is a little different. I prefer WinAutomation, because the scripts are a little easier to build.

Yes, you can use the Task Scheduler to run the script on an hourly bases.
To execute a python script via a Batch File, use the following code:
start path_to_python_exe path_to_python_file
Example:
start C:\Users\harshgoyal\AppData\Local\Continuum\Anaconda3\python.exe %UserProfile%\Documents\test_script.py
If python is set as Window’s Environment Window then you can reduce the syntax to:
start python %UserProfile%\Documents\test_script.py
What I generally do is run the batch file once via Task Scheduler and within the python script I call a thread/timer every hour.
class threading.Timer(interval, function, args=None, kwargs=None)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.