I was wondering is there any easy way to pull all the dependencies in my Python script into a file(s) that I could include with my script in order to run it on Airflow? This would be a huge time saver for me.
To be clear it CANNOT be an exe, the Airflow scheduler runs python scripts, not Exes.
(My company uses Airflow scheduler and the team that supports it is very slow. Every time I need to install a new package on Airflow to run a script that's a dependency, it takes months of confusion, multiple tickets and wasted time. I don't have the access level to fix it myself and they don't give that access level ever to developers.)
Related
I have a python script that runs locally via a scheduled task each day. Most of the time, this is fine -- except when I'm on vacation and the computer it runs on needs to be manually restarted. Or when my internet/power is down.
I am interested in putting it on some kind of rented server time. I'm a totally newbie at this (having never had a production-type process like this). I was unable to find any tutorials that seemed to address this type of use case. How would I install my python environment and any config, data files, or programs that the script needs (e.g., it does some web scraping and uses headless chrome w/a defined user profile).
Given the nature of the program, is it possible to do or would I need to get a dedicated server whose environment can be better set up for my specific needs? The process runs for about 20 seconds a day.
setting up a whole dedicated server for 20s worth of work is really a suboptimal thing to do. I see a few options:
Get a cloud-based VM that gets spin up and down only to run your process. That's relatively easy to automate on Azure, GCP and AWS.
Dockerize the application, along with the whole environment and running it as an image on the cloud - e.g. on a service like Beanstalk (AWS) or App Service (Azure) - this is more complex, but should be cheaper as it consumes less resources
Get a dedicated VM (droplet?) on a service like Digital Ocean, Heroku or pythonanywhere.com - dependent upon the specifics of your script, it may be quite easy and cheap to set up. This is the easiest and most flexible solution for a newbie I think, but it really depends on your script - you might hit some limitations.
In terms of setting up your environment - there are multiple options, with the most often used being:
pyenv (my preferred option)
anaconda (quite easy to use)
virtualenv / venv
To efficiently recreate your environment, you'll need to come up with a list of dependencies (libraries your script uses).
A summary of the steps:
run $pip freeze > requirements.txt locally
manually edit the requirements.txt file by removing all packages that are not used by your script
create a new virtual environment via pyenv, anaconda or venv and activate it wherever you want to run the script
copy your script & requirements.txt to the new location
run $pip install -r requirements.txt to install the libraries
ensure the script works as expected in its new location
set up the cornjob
If the script only runs for 20 seconds and you are not worried about scalability, running it directly on a NAS or raspberry could be a solution for a private environment if you have the hardware on hand.
If you don’t have the necessary hardware available, you may want to have a look at PythonAnywhere which offers a free version.
https://help.pythonanywhere.com/pages/ScheduledTasks/
https://www.pythonanywhere.com/
However, in any professional environment I would opt for a tool like Apache Airflow. Your process of “it does some web scraping and uses headless chrome w/a defined user profile” describes an ETL workflow.
https://airflow.apache.org/
So far when dealing with web scraping projects, I've used GAppsScript, meaning that I can easily trigger the script to be run once a day.
Is there an equivalent service when dealing with python scripts? I have a RaspberryPi, so I guess I can keep it on 24/7 and use cronjobs to trigger the script daily. But that seems rather wasteful, since I'm talking about a few small scripts that take only a few seconds to run.
Is there any service that allows me to trigger a python script once a day? (without a need to keep a local machine on 24/7) The simpler the solution the better, wouldn't want to overengineer such a basic use case if a ready-made system already exists.
The only service I've found so far to do this with is WayScript and here's a python example running in the cloud. The free tier that should be enough for most simple/hobby-tier usecases.
I have a python folder that contains the main file and its supporting files.
I run the Main file and it calls the other files.
I now need to package this and schedule it on Azure so it runs every hour.
Can someone guide me on how to do this
A great way to run Python code in Azure without the need for any supporting infrastructure to maintain is Azure Functions--you only pay for the execution of your code. Functions can be triggered using a timer, such as every hour.
To get started, see: https://learn.microsoft.com/en-us/azure/azure-functions/functions-reference-python
I currently have a handful of small Python scripts on my laptop that are set to run every 1-15 minutes, depending on the script in question. They perform various tasks for me like checking for new data on a certain API, manipulating it, and then posting it to another service, etc.
I have a NAS/personal server (unRAID) and was thinking about moving the scripts to there via Docker, but since I'm relatively new to Docker I wasn't sure about the best approach.
Would it be correct to take something like the Phusion Baseimage which includes Cron, package my scripts and crontab as dependencies to the image, and write the Dockerfile to initialize all of this? Or would it be a more canonical approach to modify the scripts so that they are threaded with recursive timers and just run each script individually in it's own official Python image?
No dude just install python on the docker container/image, move your scripts and run them as normal.
You may have to expose some port or add firewall exception but your container can be as native linux environment.
I have a Python script that runs in several modes. One of those modes monitors certain files, and if those files have been modified, the script restores them. The way I do this is to run the script every minute via cron.
Another cron job exists (actually the same script called with a different argument) to remove the script from the crontab when the scheduled time has elapsed. Initially, I was attempting to work with a crontab in /etc/cron.d. The script behaves as expected if executed on the command line, but does not edit the crontab when it is run from cron.
I then switched to writing a temporary file and executing crontab tempfile (via subprocess.Popen) from the script. This doesn't work either, as the crontab is simply not created. Executing crontab tempfile from the commandline and using the temporary file created by the script works as expected.
I can't use the python-crontab library as this is a commercial project and that library is GPLed.
Are there any inherent limitations in cron that prevent either approach from working?
The GPL is not anti-commercial. python-crontab can be used in commercial products and services. You must only follow the copy-left rules witch state that the actual code itself can't be made proprietary. You can sell it as much as you like, and as the author I encourage you to make money from my work.
Besides that error, it doesn't look like your problem requires python-crontab anyway. You could just open the files yourself and if that doesn't work, it was never going to work with python-crontab anyway.