Job scheduling for data scraping on Python

Job scheduling for data scraping on Python - python

I'm scraping (extracting) data from a certain website. The data contains two values that I need, namely (grid) frequency value and time.
The data on the website is being updated every second. I'd like to continuously save these values (append them) into a list or a tuple using python. To do that I tried using schedule library. The following job schedule commands run the data scraping function (socket_freq) every second.
import schedule
schedule.every(1).seconds.do(socket_freq)
while True:
schedule.run_pending()
I'm facing two problems:
I don't know how to restrict the schedule to run during a chosen time interval. For example, i'd like to run it for 5 or 10 minutes. how do I define that? I mean how to I tell the schedule to stop after a certain time.
if I run this code and stop it after few seconds (using break), then I often get multiple entries, for example here is one result, where the first list[ ] in the tuple refers to the time value and the second list[ ] is the values of frequency:
out:
(['19:27:02','19:27:02','19:27:02','19:27:03','19:27:03','19:27:03','19:27:03','19:27:03','19:27:03','19:27:03','19:27:04','19:27:04','19:27:04', ...],
['50.020','50.020','50.020','50.018','50.018','50.018','50.018','50.018','50.018','50.018','50.017','50.017','50.017'...])
As you can see, the time variable is entered (appended) multiple times, although I used a schedule that runs every 1 second. What i'd actually would expect to retrieve is:
out:
(['19:27:02','19:27:03','19:27:04'],['50.020','50.018','50.017'])
Does anybody know how to solve these problems?
Thanks!
(I'm using python 2.7.9)

Ok, so here's how I would tackle these problems:
Try to obtain a timestamp at the start of your program and then simply check if it has been working long enough each time you execute piece of code you are scheduling.
Use time.sleep() to put your program to sleep for a period of time.
Check my example below:
import schedule
import datetime
import time
# Obtain current time
start = datetime.datetime.now()
# Simple callable for example
class DummyClock:
def __call__(self):
print datetime.datetime.now()
schedule.every(1).seconds.do(DummyClock())
while True:
schedule.run_pending()
# 5 minutes == 300 seconds
if (datetime.datetime.now() - start).seconds >= 300:
break
# And here we halt execution for a second
time.sleep(1)
All refactoring is welcome

Related

Schedule a Job every minute on the exact minute during specific times using Python Schedule Library?

I am using the Python Schedule Library and I have been using the following line of code to schedule a job to run every minute, on the exact minute regardless what time the program is started. For instance, if the program is ran at 13:51:30, rather than starting one minute after that time which would be 13:52:30, it will start at 13:52:00.
This is the line of code used to achieve this:
schedule.every(1).minutes.at(":00").do(job)
Now, how can I get schedule to do this between specific times? For instance, if I want this schedule to occur during 10:00 till 11:00?

I hope I understood the question correctly. I use this:
def func():
now_datetime = datetime.now()
if (now_datetime.hour >= 10) & (now_datetime.hour < 11) :
print(now_datetime)
def main():
while True:
schedule.every().minute.at(':00').do(func)
while True:
schedule.run_pending()
time.sleep(1)

Python schedule not running as scheduled

I am using below code to excute a python script every 5 minutes but when it executes next time its not excecuting at excact time as before.
example if i am executing it at exact 9:00:00 AM, next time it executes at 9:05:25 AM and next time 9:10:45 AM. as i run the python script every 5 minutes for long time its not able to record at exact time.
import schedule
import time
from datetime import datetime
# Functions setup
def geeks():
print("Shaurya says Geeksforgeeks")
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print("Current Time =", current_time)
# Task scheduling
# After every 10mins geeks() is called.
schedule.every(2).minutes.do(geeks)
# Loop so that the scheduling task
# keeps on running all time.
while True:
# Checks whether a scheduled task
# is pending to run or not
schedule.run_pending()
time.sleep(1)
Is there any easy fix for this so that the script runs exactly at 5 minutes next time.
please don't suggest me to use crontab as I have tried crontabs ut not working for me.
I am using python script in different os

your geeks function will cost time to execute,and schedule job start calculate 5min after geeks done,that's why long time its not able to record at exact time.
if you want your function run at exact time,you can trying this:
# After every 10mins geeks() is called.
#schedule.every(2).minutes.do(geeks)
for _ in range(0,60,5):
schedule.every().hour.at(":"+str(_).zfill(2)).do(geeks)
# Loop so that the scheduling task

It's because schedule does not account for the time it takes for the job function to execute. Use ischedule instead. The following would work for your task.
import ischedule
ischedule.schedule(geeks, interval=2*60)
ischedule.run_loop()

Run a function every minute python?

I want to run a function at the start of every minute, without it lagging over time. Using time.sleep(60) eventually lags.
while True:
now = datetime.datetime.now().second
if now == 0:
print(datetime.datetime.now())
The function doesn't take a minute to run so as long as it runs a the beginning it should be fine, I'm not sure if this code is resource-efficient, as its checking every millisecond or so and even if it drifts the if function should correct it.

Repeat scheduling shouldn't really be done in python, especially by using time.sleep. The best way would be to get your OS to schedule running the script, using something like cron if you're on Linux or Task Scheduler if you're on Windows

Assuming that you've examined and discarded operating-based solutions such as cron or Windows Scheduled Tasks, what you suggest will work but you're right in that it's CPU intensive. You would be better off sleeping for one second after each check so that:
It's less resource intensive; and, more importantly
It doesn't execute multiple times per at the start of each minute if the job takes less than a second.
In fact, you could sleep for even longer immediately after the payload by checking how long to the next minute, and use the minute to decide in case the sleep takes you into a second that isn't zero. Something like this may be a good start:
# Ensure we do it quickly first time.
lastMinute = datetime.datetime.now().minute - 1
# Loop forever.
while True:
# Get current time, do payload if new minute.
thisTime = datetime.datetime.now()
if thisTime.minute != lastMinute:
doPayload()
lastMinute = thisTime.minute
# Try to get close to hh:mm:55 (slow mode).
# If payload took more than 55s, just go
# straight to fast mode.
afterTime = datetime.datetime.now()
if afterTime.minute == thisTime.minute:
if afterTime.second < 55:
time.sleep (55 - afterTime.second)
# After hh:mm:55, check every second (fast mode).
time.sleep(1)

Timed method in Python

How do I have a part of python script(only a method, the whole script runs in 24/7) run everyday at a set-time, exactly at every 20th minutes? Like 12:20, 12:40, 13:00 in every hour.
I can not use cron, I tried periodic execution but that is not as accurate as I would... It depends from the script starting time.

Module schedule may be useful for this. See answer to
How do I get a Cron like scheduler in Python? for details.

You can either put calling this method in a loop, which would sleep for some time
from time import sleep
while True:
sleep(1200)
my_function()
and be triggered once in a while, you could use datetime to compare current timestamp and set next executions.
import datetime
function_executed = False
trigger_time = datetime.datetime.now()
def set_trigger_time():
global function executed = False
return datetime.datetime.now() + datetime.timedelta(minutes=20)
while True:
if function_executed:
triggertime = set_trigger_time()
if datetime.datetime.now() == triggertime:
function_executed = True
my_function()
I think however making a system call the script would be a nicer solution.

Use for example redis for that and rq-scheduler package. You can schedule tasks with specific time. So you can run first script, save to the variable starting time, calculate starting time + 20 mins and if your current script will end, at the end you will push another, the same task with proper time.

Python: Restrict the code to be run for an hour

I have written a scraper that does html scraping and then use API to get some data, since its a very lengthy code I haven't put it here. I have implemented random sleep method and using it within my code to monitor throttle. But I want to make sure I don't over run this code, so my idea is to run for an 3-4 hours then taker breather and then run again. I haven't done anything like this in python I was trying to search but not really sure where to start from, it would be great if I get some guidance on this. If python has a specific module link to that would be a great help.
Also is this relevant? I don't I need this level of complication?
Suggestions for a Cron like scheduler in Python?
I have functions for every single scraping task, and I have main method calling all those functions.

You can use a threading.Timer object to schedule an interrupt signal to the main thread after the time is exceeded:
import thread, threading
def longjob():
try:
# do your job
while True:
print '*',
except KeyboardInterrupt:
# do your cleanup
print 'ok, giving up'
def terminate():
print 'sorry, pal'
thread.interrupt_main()
time_limit = 5 # terminate in 5 seconds
threading.Timer(time_limit, terminate).start()
longjob()
Put this in your crontab and run every time_limit + 2 minutes.

You could just note the time you have started and each time you want to run something make sure you haven't exceeded the given maximum. Something like this should get you started:
from datetime import datetime
MAX_SECONDS = 3600
# note the time you have started
start = datetime.now()
while True:
current = datetime.now()
diff = current-start
if diff.seconds >= MAX_SECONDS:
# break the loop after MAX_SECONDS
break
# MAX_SECONDS not exceeded, run more tasks
scrape_some_more()
Here's the link to the datetime module documentation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Job scheduling for data scraping on Python - python

Related

Schedule a Job every minute on the exact minute during specific times using Python Schedule Library?

Python schedule not running as scheduled

Run a function every minute python?

Timed method in Python

Python: Restrict the code to be run for an hour

Categories

Resources