I have jobs scheduled thru apscheduler. I have 3 jobs so far, but soon will have many more. i'm looking for a way to scale my code.
Currently, each job is its own .py file, and in the file, I have turned the script into a function with run() as the function name. Here is my code.
from apscheduler.scheduler import Scheduler
import logging
import job1
import job2
import job3
logging.basicConfig()
sched = Scheduler()
#sched.cron_schedule(day_of_week='mon-sun', hour=7)
def runjobs():
job1.run()
job2.run()
job3.run()
sched.start()
This works, right now the code is just stupid, but it gets the job done. But when I have 50 jobs, the code will be stupid long. How do I scale it?
note: the actual names of the jobs are arbitrary and doesn't follow a pattern. The name of the file is scheduler.py and I run it using execfile('scheduler.py') in python shell.
import urllib
import threading
import datetime
pages = ['http://google.com', 'http://yahoo.com', 'http://msn.com']
#------------------------------------------------------------------------------
# Getting the pages WITHOUT threads
#------------------------------------------------------------------------------
def job(url):
response = urllib.urlopen(url)
html = response.read()
def runjobs():
for page in pages:
job(page)
start = datetime.datetime.now()
runjobs()
end = datetime.datetime.now()
print "jobs run in {} microseconds WITHOUT threads" \
.format((end - start).microseconds)
#------------------------------------------------------------------------------
# Getting the pages WITH threads
#------------------------------------------------------------------------------
def job(url):
response = urllib.urlopen(url)
html = response.read()
def runjobs():
threads = []
for page in pages:
t = threading.Thread(target=job, args=(page,))
t.start()
threads.append(t)
for t in threads:
t.join()
start = datetime.datetime.now()
runjobs()
end = datetime.datetime.now()
print "jobs run in {} microsecond WITH threads" \
.format((end - start).microseconds)
Look #
http://furius.ca/pubcode/pub/conf/bin/python-recursive-import-test
This will help you import all python / .py files.
while importing you can create a list which keeps keeps a function call, for example.
[job1.run(),job2.run()]
Then iterate through them and call function :)
Thanks Arjun
Related
Background: I'm trying to do 100's of dymola simulations with the python-dymola interface. I managed to run them in a for-loop. Now I want them to run while multi-threading so I can run multiple models parallel (which will be much faster). Since probably nobody uses the interface, I wrote some simple code that also shows my problem:
1: Turn a for-loop into a definition that is run into another for-loop BUT both the def and the for-loop share the same variable 'i'.
2: Turn a for-loop into a definition and use multi-threading to execute it. A for-loop runs the command one by one. I want to run them parallel with a maximum of x threads at the same time. The result should be the same as when executing the for-loop
Example-code:
import os
nSim = 100
ndig='{:01d}'
for i in range(nSim):
os.makedirs(str(ndig.format(i)))
Note that the name of the created directories are just the numbers from the for-loop (this is important). Now instead of using the for-loop, I would love to create the directories with multi-threading (note: probably not interesting for this short code but when calling and executing 100's of simulation models it definitely is interesting to use multi-threading).
So I started with something simple I thought, turning the for-loop into a function that then is run inside another for-loop and hoped to have the same result as with the for-loop code above but got this error:
AttributeError: 'NoneType' object has no attribute 'start'
(note: I just started with this, because I did not use the def-statement before and the thread package is also new. After this I would evolve towards the multi-threading.)
1:
import os
nSim = 100
ndig='{:01d}'
def simulation(i):
os.makedirs(str(ndig.format(i)))
for i in range(nSim):
simulation(i=i).start
After that failed, I tried to evolve to multi-threading (converting the for-loop into something that does the same but with multi-threading and by that running the code parallel instead of one by one and with a maximum number of threads):
2:
import os
import threading
nSim = 100
ndig='{:01d}'
def simulation(i):
os.makedirs(str(ndig.format(i)))
if __name__ == '__main__':
i in range(nSim)
simulation_thread[i] = threading.Thread(target=simulation(i=i))
simulation_thread[i].daemon = True
simulation_thread[i].start()
Unfortunately that attempt failed as well and now I got the error:
NameError: name 'i' is not defined
Does anybody has suggestions for issues 1 or 2?
Both examples are incomplete. Here's a complete example. Note that target gets passed the name of the function target=simulation and a tuple of its arguments args=(i,). Don't call the function target=simulation(i=i) because that just passes the result of the function, which is equivalent to target=None in this case.
import threading
nSim = 100
def simulation(i):
print(f'{threading.current_thread().name}: {i}')
if __name__ == '__main__':
threads = [threading.Thread(target=simulation,args=(i,)) for i in range(nSim)]
for t in threads:
t.start()
for t in threads:
t.join()
Output:
Thread-1: 0
Thread-2: 1
Thread-3: 2
.
.
Thread-98: 97
Thread-99: 98
Thread-100: 99
Note you usually don't want more threads that CPUs, which you can get from multiprocessing.cpu_count(). You can use create a thread pool and use queue.Queue to post work that the threads execute. An example is in the Python Queue documentation.
Cannot call .start like this
simulation(i=i).start
on an non-threading object. Also, you have to import the module as well
It seems like you forgot to add 'for' and indent the code in your loop
i in range(nSim)
simulation_thread[i] = threading.Thread(target=simulation(i=i))
simulation_thread[i].daemon = True
simulation_thread[i].start()
to
for i in range(nSim):
simulation_thread[i] = threading.Thread(target=simulation(i=i))
simulation_thread[i].daemon = True
simulation_thread[i].start()
If you would like to have max number of thread in a pool, and to run all items in the queue. We can continue #mark-tolonen answer and do like this:
import threading
import queue
import time
def main():
size_of_threads_pool = 10
num_of_tasks = 30
task_seconds = 1
q = queue.Queue()
def worker():
while True:
item = q.get()
print(my_st)
print(f'{threading.current_thread().name}: Working on {item}')
time.sleep(task_seconds)
print(f'Finished {item}')
q.task_done()
my_st = "MY string"
threads = [threading.Thread(target=worker, daemon=True) for i in range(size_of_threads_pool)]
for t in threads:
t.start()
# send the tasks requests to the worker
for item in range(num_of_tasks):
q.put(item)
# block until all tasks are done
q.join()
print('All work completed')
# NO need this, as threads are while True, so never will stop..
# for t in threads:
# t.join()
if __name__ == '__main__':
main()
This will run 30 tasks of 1 second in each, using 10 threads.
So total time would be 3 seconds.
$ time python3 q_test.py
...
All work completed
real 0m3.064s
user 0m0.033s
sys 0m0.016s
EDIT: I found another higher-level interface for asynchronously executing callables.
Use concurrent.futures, see the example in the docs:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
Note the max_workers=5 that will tell the max number of threads, and
note the for loop for url in URLS that you can use.
I've tried to create a scraper using python in combination with Thread to make the execution time faster. The scraper is supposed to parse all the shop names along with their phone numbers traversing multiple pages.
The script is running without any issues. As I'm very new to work with Thread, I can hardly understand I'm doing it in the right way.
This is what I've tried so far with:
import requests
from lxml import html
import threading
from urllib.parse import urljoin
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def get_information(url):
for pagelink in [url.format(page) for page in range(20)]:
response = requests.get(pagelink).text
tree = html.fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span[itemprop=name]")[0].text
try:
phone = title.cssselect("div[itemprop=telephone]")[0].text
except Exception: phone = ""
print(f'{name} {phone}')
thread = threading.Thread(target=get_information, args=(link,))
thread.start()
thread.join()
The problem being I can't find any difference in time or performance whether I run the above script using Thread or without using Thread. If I'm going wrong, how can I execute the above script using Thread?
EDIT: I've tried to change the logic to use multiple links. Is it possible now? Thanks in advance.
You can use Threading to scrape several pages in paralel as below:
import requests
from lxml import html
import threading
from urllib.parse import urljoin
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def get_information(url):
response = requests.get(url).text
tree = html.fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span[itemprop=name]")[0].text
try:
phone = title.cssselect("div[itemprop=telephone]")[0].text
except Exception: phone = ""
print(f'{name} {phone}')
threads = []
for url in [link.format(page) for page in range(20)]:
thread = threading.Thread(target=get_information, args=(url,))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
Note that sequence of data will not be preserved. It means that if to scrape pages one by one sequence of extracted data will be:
page_1_name_1
page_1_name_2
page_1_name_3
page_2_name_1
page_2_name_2
page_2_name_3
page_3_name_1
page_3_name_2
page_3_name_3
while with Threading data will be mixed:
page_1_name_1
page_2_name_1
page_1_name_2
page_2_name_2
page_3_name_1
page_2_name_3
page_1_name_3
page_3_name_2
page_3_name_3
I've created a simple python program that scrapes my favorite recipe website and returns the individual recipe URLs from the main site. While this is a relatively quick and simple process, I've tried scaling this out to scrape multiple webpages within the site. When I do this, it takes about 45 seconds to scrape all of the recipe URLs from the whole site. I'd like this process to be much quicker so I tried implementing threads into my program.
I realize there is something wrong here as each thread returns the whole URL thread over and over again instead of 'splitting up' the work. Does anyone have any suggestions on how to better implement the threads? I've included my work below. Using Python 3.
from bs4 import BeautifulSoup
import urllib.request
from urllib.request import urlopen
from datetime import datetime
import threading
from datetime import datetime
startTime = datetime.now()
quote_page='http://thepioneerwoman.com/cooking_cat/all-pw-recipes/'
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
all_recipe_links = []
#get all recipe links on current page
def get_recipe_links():
for link in soup.find_all('a', attrs={'post-card-permalink'}):
if link.has_attr('href'):
if 'cooking/' in link.attrs['href']:
all_recipe_links.append(link.attrs['href'])
print(datetime.now() - startTime)
return all_recipe_links
def worker():
"""thread worker function"""
print(get_recipe_links())
return
threads = []
for i in range(5):
t = threading.Thread(target=worker)
threads.append(t)
t.start()
I was able to distribute the work to the workers by having the workers all process data from a single list, instead of having them all run the whole method individually. Below are the parts that I changed. The method get_recipe_links is no longer needed, since its tasks have been moved to other methods.
all_recipe_links = []
links_to_process = []
def worker():
"""thread worker function"""
while(len(links_to_process) > 0):
link = links_to_process.pop()
if link.has_attr('href'):
if 'cooking/' in link.attrs['href']:
all_recipe_links.append(link.attrs['href'])
threads = []
links_to_process = soup.find_all('a', attrs={'post-card-permalink'})
for i in range(5):
t = threading.Thread(target=worker)
threads.append(t)
t.start()
while len(links_to_process)>0:
continue
print(all_recipe_links)
I ran the new methods several times, and on average it takes .02 seconds to run this.
import threading
import urllib2
import time
import webapp2
import main
start = time.time()
url = "http://exmple.com?phone="
class BatchSuscriber(webapp2.RequestHandler):
def get(self):
template = main.JINJA_ENVIRONMENT.get_template('batch.html')
self.response.out.write(template.render())
def post(self):
address = self.request.get('address')
numbers = str(self.request.get('numbers')).split(',')
threads = [threading.Thread(target=self.fetch_url, args=(phone,)) for phone in numbers]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
self.response.write("Elapsed Time: %s" % (time.time() - start))
self.response.write("<br>")
def fetch_url(self,phone):
urlHandler = urllib2.urlopen(url+phone)
html = urlHandler.read()
self.response.write(html)
self.response.write("<br>")
self.response.write("'%s\' fetched in %ss" % (url+phone, (time.time() - start)))
self.response.write("<br>")
trying to use the above code to make urlfetch asynchronously. From my log, it seems the call is actually serially instead of being parallel. What ways can i achieve this in gae. Thanks.
Trying to use threads is entirely the wrong approach here. GAE already includes an asynchronous requests service in google.appengine.api.urlfetch; you should use that.
I have python code that I am developing for a website that, among other things, creates an excel sheet and then converts it into a json file. I need for this code to run continuously unless it is killed by the website administrator.
To this end, I am using APscheduler.
The code runs perfectly without APscheduler but when I attempt to add the rest of the code one of two things happens; 1) It runs forever and will not stop despite using "ctrl+C" and I need to stop it using task manager or 2) It only runs once, and then it stops
Code That doesn't Stop:
from apscheduler.scheduler import Scheduler
import logging
import time
logging.basicConfig()
sched = Scheduler()
sched.start()
(...)
code to make excel sheet and json file
(...)
#sched.interval_schedule(seconds = 15)
def job():
excelapi_final()
while True:
time.sleep(10)
sched.shutdown(wait=False)
Code that stops running after one time:
from apscheduler.scheduler import Scheduler
import logging
import time
logging.basicConfig()
sched = Scheduler()
(...)
#create excel sheet and json file
(...)
#sched.interval_schedule(seconds = 15)
def job():
excelapi_final()
sched.start()
while True:
time.sleep(10)
sched.shutdown(wait=False)
I understand from other questions, a few tutorials and the documentation that sched.shutdown should allow for the code to be killed by ctrl+C - however that is not working. Any ideas? Thanks in advance!
You could use the standalone mode:
sched = Scheduler(standalone=True)
and then start the scheduler like this:
try:
sched.start()
except (KeyboardInterrupt):
logger.debug('Got SIGTERM! Terminating...')
Your corrected code should look like this:
from apscheduler.scheduler import Scheduler
import logging
import time
logging.basicConfig()
sched = Scheduler(standalone=True)
(...)
code to make excel sheet and json file
(...)
#sched.interval_schedule(seconds = 15)
def job():
excelapi_final()
try:
sched.start()
except (KeyboardInterrupt):
logger.debug('Got SIGTERM! Terminating...')
This way the program will stop when Ctrl-C is pressed
You can gracefully shut it down:
import signal
from apscheduler.scheduler import Scheduler
import logging
import time
logging.basicConfig()
sched = Scheduler()
(...)
#create excel sheet and json file
(...)
#sched.interval_schedule(seconds = 15)
def job():
excelapi_final()
sched.start()
def gracefully_exit(signum, frame):
print('Stopping...')
sched.shutdown()
signal.signal(signal.SIGINT, gracefully_exit)
signal.signal(signal.SIGTERM, gracefully_exit)