I've written a couple of twitter scrapers in python, and am writing another script to keep them running even if they suffer a timeout, disconnection, etc.
My current solution is as follows:
Each scraper file has a doScrape/1 function in it, which will start up a scraper and run it once, eg:
def doScrape(logger):
try:
with DBWriter(logger=logger) as db:
logger.log_info("starting", __name__)
s = PastScraper(db.getKeywords(), TwitterAuth(), db, logger)
s.run()
finally:
logger.log_info("Done", __name__)
Where run is a near-infinite loop, which won't break unless there is an exception.
In order to run one of each kind of scraper at once, I'm using this code (with a few extra imports):
from threading import Thread
class ScraperThread(Thread):
def __init__(self, module, logger):
super(ScraperThread, self).__init__()
self.module = module # Module should contain a doScrape(logger) function
self.logger = logger
def run(self):
while True:
try:
print "Starting!"
print self.module.doScrape
self.module.doScrape(self.logger)
except: # if for any reason we get disconnected, reconnect
self.logger.log_debug("Restarting scraper", __name__)
if __name__ == "__main__":
with Logger(level="all", handle=open(sys.argv[1], "a")) as l:
past = ScraperThread(PastScraper, l)
stream = ScraperThread(StreamScraper, l)
past.start()
stream.start()
past.join()
stream.join()
However, it appears that my call of doScrape from above is returning immediately, hence "Starting!" is printed in the console repeatedly, and that "Done" message in the finally block is not written to the log, whereas when run individually like so:
if __name__ == "__main__":
# Example instantiation
from Scrapers.Logging import Logger
with Logger(level="all", handle=open(sys.argv[1], "a")) as l:
doScrape(l)
The code runs forever, as expected. I'm a bit stumped.
Is there anything silly that I might have missed?
get rid of the diaper pattern in your run() method, as in: get rid of that catch-all exception handler. You'll probably get the error printed there then. I think there may be something wrong in the DBWriter or other code you're calling from your doScrape function. Perhaps it is not thread-safe. That would explain why running it from the main program directly works, but calling it from a thread fails.
Aha, solved it! It was actually that I didn't realise that a default argument (here in TwitterAuth()) is evaluated at definition time. TwitterAuth reads the API key settings from a file handle, and the default argument opens up the default config file. Since this file handle is generated at definition time, both threads had the same handle, and once one had read it, the other one tried to read from the end of the file, throwing an exception. This is remedied by resetting the file before use, and using a mutex.
Cheers to Irmen de Jong for pointing me in the right direction.
Related
I am solving this question from LeetCode: 1116. Print Zero Even Odd
I am running this solution in VS Code with my own main function to understand the issue in depth.
After reading this question and the suggested solutions. In addition to reading this explanation.
I added this code to the code from the solution:
from threading import Semaphore
import threading
def threaded(fn):
def wrapper(*args, **kwargs):
threading.Thread(target=fn, args=args, kwargs=kwargs).start()
return wrapper
and before those functions from the question I added: #threaded
I added a printNumber function and main function to run it on VS Code.
def printNumber(num):
print(num, end="")
if __name__ == "__main__":
a = ZeroEvenOdd(7)
handle = a.zero(printNumber)
handle = a.even(printNumber)
handle = a.odd(printNumber)
Running this code gives me a correct answer but I do not get a new line printed in the terminal after that, I mean for input 7 in my main function, the output is: 01020304050607hostname and not what I want it to be:
01020304050607
hostname
So, I added print("\n") in the main and I saw that I get a random output like:
0102
0304050607
or
0
1020304050607
still without a new line in the end.
When I try to use the join function handle.join() then I get the error:
Exception has occurred: AttributeError 'NoneType' object has no
attribute 'join'
I tried to do this:
handle1 = a.zero(printNumber)
handle2 = a.even(printNumber)
handle3 = a.odd(printNumber)
handle1.join()
handle2.join()
handle3.join()
Still got the same error.
Where in the code should I do the waiting until the threads will terminate?
Thanks.
When I try to use...handle.join()...I get the error: "...'NoneType' object has no attribute, 'join'
The error message means that the value of handle was None at the point in your program where your code tried to call handle.join(). There is no join() operation available on the None value.
You probably wanted to join() a thread (i.e., the object returned by Threading.thread(...). For a single thread, you could do this:
t = Threading.thread(...)
t.start()
...
t.join()
Your program creates three threads, so you won't be able to just use a single variable t. You could use three separate variables, or you could create a list, or... I'll leave that up to you.
When i run the following code (using "sudo python servers.py") the process seem to just finish immediately with just printing "test".
why doesn't the functions "proxy_server" won't run ? or maybe they do but i do not realize that. (because the first line in proxy function doesn't print anything)
this is an impotent code, i didn't want to put unnecessary content, yet it still demonstrate my problem:
import os,sys,thread,socket,select,struct,time
HTTP_PORT = 80
FTP_PORT=21
FTP_DATA_PORT = 20
IP_IN = '10.0.1.3'
IP_OUT = '10.0.3.3'
sys_http = 'http_proxy'
sys_ftp = 'ftp_proxy'
sys_ftp_data = 'ftp_data_proxy'
def main():
try:
thread.start_new_thread(proxy_server, (HTTP_PORT, IP_IN,sys_http,http_handler))
thread.start_new_thread(proxy_server, (FTP_PORT, IP_IN,sys_ftp,http_handler))
thread.start_new_thread(proxy_server, (FTP_DATA_PORT, IP_OUT,sys_ftp_data,http_handler))
print "test"
except e:
print 'Error!'
sys.exit(1)
def proxy_server(host,port,fileName,handler):
print "Proxy Server Running on ",host,":",port
def http_handler(src,sock):
return ''
if __name__ == '__main__':
main()
What am i missing or doing wrong ?
First, you have indentation problems related to using mixed tabs and spaces for indentation. While they didn't cause your code to misbehave in this particular case, they will cause you problems later if you don't stick to consistently using one or the other. They've already broken the displayed indentation in your question; see the print "test" line in main, which looks misaligned.
Second, instead of the low-level thread module, you should be using threading. Your problem is occurring because, as documented in the thread module documentation,
When the main thread exits, it is system defined whether the other threads survive. On SGI IRIX using the native thread implementation, they survive. On most other systems, they are killed without executing try ... finally clauses or executing object destructors.
threading threads let you explicitly define whether other threads should survive the death of the main thread, and default to surviving. In general, threading is much easier to use correctly.
I currently have a process running that should call a method every 10 seconds. I see that it actually calls the method at that interval, but it seems to not execute something in the code. Weird thing is, is that when I cancel the loop, and start it new it does actually do it the first time. Then when I keep it running it does not do anything.
def main():
try:
while True:
read()
time.sleep(10)
except KeyboardInterrupt:
pass
Above is the loop, and the code here is actually the beginning of the method that is being called, and I found out that it does not actually get results in the results, while the file has changed. In this case it gets data from a .json file
def read():
message = Query()
results = DB.search(message.pushed == False)
Am I overlooking something?
Solved. I had the DB declared globally and that did not go so well. It is being fixed by declaring it just before the statement.
I have a python script I've written which uses atexit.register() to run a function to persist a list of dictionaries when the program exits. However, this code is also running when the script exits due to a crash or runtime error. Usually, this results in the data becoming corrupted.
Is there any way to block it from running when the program exits abnormally?
EDIT: To clarify, this involves a program using flask, and I'm trying to prevent the data persistence code from running on an exit that results from an error being raised.
You don't want to use atexit with Flask. You want to use Flask signals. It sounds like you are specifically looking for the request_finished signal.
from flask import request_finished
def request_finished_handler(sender, response, **extra):
sender.logger.debug('Request context is about to close down. '
'Response: %s', response)
# do some fancy storage stuff.
request_finished.connect(request_finished_handler, app)
The benefit of request_finished is that it only fires after a successful response. That means that so long as there isn't an error in another signal, you should be good.
One way: at global level in main program:
abormal_termination = False
def your_cleanup_function():
# Add next two lines at the top
if abnormal_termination:
return
# ...
# At end of main program:
try:
# your original code goes here
except Exception: # replace according to what *you* consider "abnormal"
abnormal_termination = True # stop atexit handler
Not pretty, but straightforward ;-)
I'm writing a program that adds normal UNIX accounts (i.e. modifying /etc/passwd, /etc/group, and /etc/shadow) according to our corp's policy. It also does some slightly fancy stuff like sending an email to the user.
I've got all the code working, but there are three pieces of code that are very critical, which update the three files above. The code is already fairly robust because it locks those files (ex. /etc/passwd.lock), writes to to a temporary files (ex. /etc/passwd.tmp), and then, overwrites the original file with the temporary. I'm fairly pleased that it won't interefere with other running versions of my program or the system useradd, usermod, passwd, etc. programs.
The thing that I'm most worried about is a stray ctrl+c, ctrl+d, or kill command in the middle of these sections. This has led me to the signal module, which seems to do precisely what I want: ignore certain signals during the "critical" region.
I'm using an older version of Python, which doesn't have signal.SIG_IGN, so I have an awesome "pass" function:
def passer(*a):
pass
The problem that I'm seeing is that signal handlers don't work the way that I expect.
Given the following test code:
def passer(a=None, b=None):
pass
def signalhander(enable):
signallist = (signal.SIGINT, signal.SIGQUIT, signal.SIGABRT, signal.SIGPIPE, signal.SIGALRM, signal.SIGTERM, signal.SIGKILL)
if enable:
for i in signallist:
signal.signal(i, passer)
else:
for i in signallist:
signal.signal(i, abort)
return
def abort(a=None, b=None):
sys.exit('\nAccount was not created.\n')
return
signalhander(True)
print('Enabled')
time.sleep(10) # ^C during this sleep
The problem with this code is that a ^C (SIGINT) during the time.sleep(10) call causes that function to stop, and then, my signal handler takes over as desired. However, that doesn't solve my "critical" region problem above because I can't tolerate whatever statement encounters the signal to fail.
I need some sort of signal handler that will just completely ignore SIGINT and SIGQUIT.
The Fedora/RH command "yum" is written is Python and does basically exactly what I want. If you do a ^C while it's installing anything, it will print a message like "Press ^C within two seconds to force kill." Otherwise, the ^C is ignored. I don't really care about the two second warning since my program completes in a fraction of a second.
Could someone help me implement a signal handler for CPython 2.3 that doesn't cause the current statement/function to cancel before the signal is ignored?
As always, thanks in advance.
Edit: After S.Lott's answer, I've decided to abandon the signal module.
I'm just going to go back to try: except: blocks. Looking at my code there are two things that happen for each critical region that cannot be aborted: overwriting file with file.tmp and removing the lock once finished (or other tools will be unable to modify the file, until it is manually removed). I've put each of those in their own function inside a try: block, and the except: simply calls the function again. That way the function will just re-call itself in the event of KeyBoardInterrupt or EOFError, until the critical code is completed.
I don't think that I can get into too much trouble since I'm only catching user provided exit commands, and even then, only for two to three lines of code. Theoretically, if those exceptions could be raised fast enough, I suppose I could get the "maximum reccurrsion depth exceded" error, but that would seem far out.
Any other concerns?
Pesudo-code:
def criticalRemoveLock(file):
try:
if os.path.isFile(file):
os.remove(file)
else:
return True
except (KeyboardInterrupt, EOFError):
return criticalRemoveLock(file)
def criticalOverwrite(tmp, file):
try:
if os.path.isFile(tmp):
shutil.copy2(tmp, file)
os.remove(tmp)
else:
return True
except (KeyboardInterrupt, EOFError):
return criticalOverwrite(tmp, file)
There is no real way to make your script really save. Of course you can ignore signals and catch a keyboard interrupt using try: except: but it is up to your application to be idempotent against such interrupts and it must be able to resume operations after dealing with an interrupt at some kind of savepoint.
The only thing that you can really to is to work on temporary files (and not original files) and move them after doing the work into the final destination. I think such file operations are supposed to be "atomic" from the filesystem prospective. Otherwise in case of an interrupt: restart your processing from start with clean data.