How to add cronjob/scheduler for Python scripts on EC2 AWS? - python

I have a question regarding one of my React apps that I recently developed. It's basically a landing page, which is using React frontend and Node+Express backend and its scraping data from various pages (scrapers are developed in Python).
Right now, the React app itself is hosted in Heroku and the execution of scrapers is working, but it's not scheduled automatically.
Current setup is the following:
EC2 for Python scrapers
AWS RDS MYSQL for database where I write the data to from the EC2 scrapers
I have created a separate file to execute all the others scrapers.
main.py
import time
import schedule
import os
from pathlib import Path
print('python script executed')
# make sure, what is the current working directory to add the right paths to scrapers
path = os.getcwd()
print(path)
#exec(open("/home/ec2-user/testing_python/lhvscraper.py").read())
filenames = [
#output table: fundsdata
Path("/home/ec2-user/testing_python/lhvscraper.py"),
Path("/home/ec2-user/testing_python/luminorscrapertest.py"),
Path("/home/ec2-user/testing_python/sebscraper.py"),
Path("/home/ec2-user/testing_python/swedscraper.py"),
Path("/home/ec2-user/testing_python/tulevascraper.py"),
#output table: feesdata
Path("/home/ec2-user/testing_python/feesscraper.py"),
#output table: yield_y1_data
Path("/home/ec2-user/testing_python/yield_1y_scraper.py"),
#output table: navdata
#Path("/home/ec2-user/testing_python/navscraper.py"),
]
def main_scraper_scheduler():
print("scheduler is working")
for filename in filenames:
print(filename)
with open(filename) as infile:
exec(infile.read())
time.sleep(11)
schedule.every(10).seconds.do(main_scraper_scheduler)
while True:
schedule.run_pending()
time.sleep(1)
I have successfully established a connection between MYSQL and EC2 and tested it on Putty -
which means, if I execute my main.py, all of the scrapers are working, inserting new data to the MYSQL database tables, and then repeat again (see the code above). The only thing is that when I close Putty (kill the connection), then the main.py function stops running.
So my question is: how to set it up like that, so that main.py file would always keep running (let's say, once a day at 12 PM) without me executing it?
I understand it's about setting up cron job or scheduler (or smth like that), but I didn't manage to set it up right now, so yours' help is very much needed.
Thanks in advance!

To avoid making crontab files overly long, Linux has canned entries that run things hourly, daily, weekly, or monthly. You don't have to modify any crontab to use this. ANY executable script that is located in /etc/cron.hourly will automatically be run once an hour. ANY executable script that is located in /etc/cron.daily will automatically be run once per day (usually at 6:30 AM), and so on. Just make sure to include a #! line for Python, and chmod +x to make it executable. Remember that it will run as root, and you can't necessarily predict which directory it will start in. Make no assumptions.
The alternative is to add a line to your own personal crontab. You can list your crontab with crontab -l, and you can edit it with crontab -e. To run something once a day at noon, you might add:
0 12 * * * /home/user/src/my_daily_script.py

Related

Steam browser protocol failing silently when run over ssh

I am trying to launch a steam game on my computer through an ssh connection (into a Win10 machine). When run locally, the following python call works.
subprocess.run("start steam://rungameid/[gameid]", shell=True)
However, whenever I run this over an ssh connection—either in an interactive interpreter or by invoking a script on the target machine—my steam client suddenly exits.
I haven't noticed anything in the steam logs except that Steam\logs\connection_log.txt contains logoff and a new session start each time. This is not the case when I run the command locally on my machine. Why is steam aware of the different sources of this command, and why is this causing the steam connection to drop? Can anyone suggest a workaround?
Thanks.
Steam is likely failing to launch the application because Windows services, including OpenSSH server, cannot access the desktop, and, hence, cannot launch GUI applications. Presumably, Steam does not expect to run an application in an environment in which it cannot interact with the desktop, and this is what eventually causes Steam to crash. (Admittedly, this is just a guess—it's hard to be sure exactly what is happening when the crash does not seem to appear in the logs or crash dumps.)
You can see a somewhat more detailed explanation of why starting GUI applications over SSH fails when the server is run as a Windows service in this answer by domih to this question about running GUI applications over SSH on Windows.
domih also suggests some workarounds. If it is an option for you, the simplest one is probably to download and run OpenSSH server manually instead of running the server as a service. You can find the latest release of Win32-OpenSSH/Windows for OpenSSH here.
The other workaround that still seems to work is to use schtasks. The idea is to create a scheduled task that runs your command—the Task Scheduler can access the desktop. Unfortunately, this is only an acceptable solution if you don't mind waiting until the next minute at least; schtasks can only schedule tasks to occur exactly on the minute. Moreover, to be safe to run at any time, code should probably schedule the task for at least one minute into the future, meaning that wait times could be anywhere between 1–2 minutes.
There are also other drawbacks to this approach. For example, it's probably harder to monitor the running process this way. However, it might be an acceptable solution in some circumstances, so I've written some Python code that can be used to run a program with schtasks, along with an example. The code depends on the the shortuuid package; you will need to install it before trying the example.
import subprocess
import tempfile
import shortuuid
import datetime
def run_with_schtasks_soon(s, delay=2):
"""
Run a program with schtasks with a delay of no more than
delay minutes and no less than delay - 1 minutes.
"""
# delay needs to be no less than 2 since, at best, we
# could be calling subprocess at the end of the minute.
assert delay >= 2
task_name = shortuuid.uuid()
temp_file = tempfile.NamedTemporaryFile(mode="w", suffix=".bat", delete=False)
temp_file.write('{}\nschtasks /delete /tn {} /f\ndel "{}"'.format(s, task_name, temp_file.name))
temp_file.close()
run_time = datetime.datetime.now() + datetime.timedelta(minutes=delay)
time_string = run_time.strftime("%H:%M")
# This is locale-specific. You will need to change this to
# match your locale. (locale.setlocale and the "%x" format
# does not seem to work here)
date_string = run_time.strftime("%m/%d/%Y")
return subprocess.run("schtasks /create /tn {} /tr {} /sc once /st {} /sd {}".format(task_name,
temp_file.name,
time_string,
date_string),
shell=True)
if __name__ == "__main__":
# Runs The Witness (if you have it)
run_with_schtasks_soon("start steam://rungameid/210970")

How to watch a Directory in HDFS for incoming Files using Python? (Python Script is executed by Docker Container; No Cronjob in HDFS)

Scenario: My Python Script is running in a docker container which is deployed in rancher (kubernetes cluster). Therefore the container is always running. I want to implement a method which is watching a directory in my HDFS for incoming files. if new files are there , i want the script to execute further actions (preprocessing steps to wrangle data). When the new files have been processed they should be deleted. after that the script is waiting for new incoming files to process them as well.
Therefore it should not be a cronjob in HDFS. i need the code in the script which is executed by docker container. Currently i am using hdfs cli to connect to my HDFS. For Java I found INotify but i need to do it with python.
Does anybody know a Python Lib or some other possibility to get this going?
#Schedule below script in crontab for interval of 1 min or 5 min based on your requirement
#Update the parameters(HDFSLocation,FileName,etc) as per the requirement
#Update the script to trigger alert(send mail/trigger another script if newHDFSFileCount > #previousHDFSFileCount)
import subprocess
import os
#Parameters
cwd=os.getcwd()
file='HDFSFileCount.txt'
fileWithPath=cwd+"/"+file
HDFSLocation="/tmp"
previousHDFSFileCount=0
newHDFSFileCount=0
#Calculate New HDFS file count
out = subprocess.Popen(['hadoop','fs','-ls', '/tmp/'], stdout=subprocess.PIPE).communicate()
if out[0][0]==0:
newHDFSFileCount=0
else:
newHDFSFileCount=out[0][6]
#
if os.path.exists(fileWithPath):
f=open(fileWithPath,"r")
previousHDFSFileCount=f.read()
else:
f=open(fileWithPath,"w+")
f.write(newHDFSFileCount)
previousHDFSFileCount=newHDFSFileCount
f.close()
if (newHDFSFileCount>previousHDFSFileCount):
f=open(fileWithPath,"w")
f.write(newHDFSFileCount)
#print(previousHDFSFileCount)
#print(newHDFSFileCount)
f.close()

django __init__.py file not executing pygame code

I'm currently developing a website using django that stores the IP address of people who access the site in a database. The idea is, that every week at midnight the website completes a traceroute on every item in the database and then maps the geographical location of each hop to the destination onto a map using pygame. This map is then stored as an image and shown on the website.
Currently, everything works individually as it should.
My init.py file currently looks something like this:
import subprocess, sys
print "Starting scripts..."
P = subprocess.Popen([sys.executable, "backgroundProcess.py"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
print "...Scripts started"
When running this from the command prompt, or even from the gui it works fine, and the map is drawn correctly if the time is right. However when the script is run when the website is started, the text is printed correctly (Starting scripts... and ...scripts started) but the map is not drawn. In short, my question is: Does django limit what you can do in the init.py files?
Code at a module top level (whether it's a package's __init__.py or just any regular module) is only executed once per process on the first import of this module. Since Django is served using long-running processes, this code will only get executed when a new process is started.
The simplest solution for a scheduled background task is to write a Django management command and call it from a cron job (or whatever available scheduler on your system). You really don't need that subprocess stuff here...

Running a Python Script every n Days in Django App

I am looking to run a Python script every n days as part of a Django app.
I need this script to add new rows to the database for each attribute that a user has.
For example a User has many Sites each that have multiple Metric data points. I need to add a new Metric data point for each Site every n days by running a .py script.
I have already written the script itself, but it just works locally.
Is it the right approach to:
1) Get something like Celery or just a simple cron task running to run the python script every n days.
2) Have the python script run through all the Metrics for each Site and add a new data point by executing a SQL command from the python script itself?
You can use Django's manage commad, then use crontab to run this command circularly.
The first step is to write the script - which you have already done. This script should run the task which is to be repeated.
The second step is to schedule the execution.
The de-facto way of doing that is to write a cron entry for the script. This will let the system's cron daemon trigger the script at the interval you select.
To do so, create an entry in the crontab file for the user that owns the script; the command to do so is crontab -e, which will load a plain text file in your preferred editor.
Next, you need to add an entry to this file. The format is:
minutes hours day-of-month month day-of-week script-to-execute
So if you want to run your script every 5 days:
0 0 */5 * * /bin/bash /home/user/path/to/python.py

Python crontab stops without error

I have a Python/Django application running on an Ubuntu Server. The application updates the stock for a webshop.
I have made a job in crontab to update the stock every night.
# daily stock update, starts at 23:30
30 23 * * * ./Cronjobs/stockUpdate.sh
stockUpdate.sh:
#!/bin/bash
cd Projecten
source booksVenv/bin/activate
cd Books/books
cat stockUpdate.py | python manage.py shell
stockUpdate.py:
from core.models import Stock
Stock().getAllStock()
Running Stock().getAllStock() by hand works fine.
Example: I login on the server via ssh, start the virtual environment, start django shell and run the getAllStock method.
However, contrab just seems to stop while running getAllStock without an error. A log is placed in /var/mail/.
When I open the file with nano here is what I get.
# more than 500 pages of prints
Go to next page.
>>>
Here is what I think could be going wrong:
* I use too many print statements in my code, they mess up the job.
* The job starts at 23:30 but takes a few hours, it is stopped the next day (after a half hour).
Can someone please tell me why this is occurring and give me some details as to how to debug and fix the issue.

Categories