I have the function extract data like this:
#controller.route('/scrap_all_categories')
def scrap_all_categories():
result = ETL_jobs.extract_all_category(url, conn)
return result
Extract_all_category is a function which extracts data from a website using BeautifulSoup. So it takes about 30 min to finish and it runs until the job is done without using interrupt key.
And I need create a function to cancel this process. Which functions in Python I can use to interrupt this job.
Thanks.
Related
I have an AWS Lambda function that gets invoked from another function. The first function processes the data and invokes the other when it is finished. The second function will get n instances to run at the same time.
For example the second function takes about 5 seconds (for each invoke) to run; I want this function to run all at the time they are invoked for a total run time of about 5 seconds.
The function takes longer than that and runs each function one at a time until the one prior is finished; this process takes 5*n seconds.
I see that I can scale the function to run up to 1,000 in my region as stated by AWS. How can I make this run concurrently? Don't need a code example, just a general process I can look into to fix the problem.
The first function header looks like this: (I have other code that gets the json_file that I left out)
def lambda_handler(event=None, context=None):
for n in range(len(json_file)):
response = client.invoke(
FunctionName='docker-selenium-lambda-prod-demo',
InvocationType='RequestResponse',
Payload=json.dumps(json_file[n])
)
responseJson = json.load(response['Payload'])
where json_file[n] is being sent to the other function to run.
As you can see in boto3 docs about invoke function:
Invokes a Lambda function. You can invoke a function synchronously (and wait for the response), or asynchronously. To invoke a function asynchronously, set InvocationType to Event .
If you are using RequestResponse, your code will wait until the lambda called is terminated.
You can either change InvocationType to Event or use something like ThreadPoolExecutor and wait until all executions are finished
I'm creating a script that scrapes data from sites. I have at least 10 sites to scrape. Each site is one .ipynb file (Then I transform them to .py to execute). It could happen that one site changes so the scraping code would need to be changed.
I have the following
def ex_scrape_site1():
%run "scrape\\scrape_site1.py"
def ex_scrape_site2():
%run "scrape\\scrape_site2.py"
def ex_scrape_site3():
%run "scrape\\scrape_site3.py"
.
.
.
(10 so far)
I'm currently using a list with all functions and then I'm doing a for loop on the list to generate a thread per each function. Like this:
funcs = [ex_scrape_site1, ex_scrape_site2, ex_scrape_site3]
Then, I'm executing them doing the following:
while True:
threads = []
for func in funcs:
threads.append(Thread(target = func))
[thread.start() for thread in threads] # start threads
[thread.join() for thread in threads] # Wait for all to complete
So here it's executing all the functions in paralell which is OK. However, if 1 crashes I have to stop everything and fix the error.
Is there a way to:
When something happens in one of the scraping functions, I want to be able to amend the broken function, but continue running all the others.
Since I'm using a 'join()' I have to wait until all the scrapes finish, then it'll iterate again. How could I iterate over each function individually, do not wait until all of them finish, then start the process again?
I though of use Airflow, do you think this could make sense to implement?
I am using the threading module and have 3 different functions where they return the same value but use different methods to return that value.
I want and ID and will call it my_id
For example:
Function #1: Scrape website using a mobile endpoint and parsing the json for my_id
Function #2: Scrape website using desktop endpoint and parse json for my_id
Function #3: Scrape desktop website HTML and find my_id
What I would like to do is run each function at the same time and whichever one returns my_id the fastest, I take it and continue with my code.
What is the best way to go about this?
You can make use of concurrent.futures.
Create three threads and launch them using concurrent.futures.Executor.submit()
This returns you the future objects for each of the thread.
Then you can
concurrent.futures.wait(fs, timeout=None, return_when=FIRST_COMPLETED)
which will block the main thread, until one of the 3 child threads complete.
Then you can go ahead and use your result.
concurrent.futures.wait Returns a named 2-tuple of sets. The first set, named done and not_done
You can get your result from the completed futures object using the result() method, and you can safely shutdown the executor using the Executor.shutdown()
You can add your objects to a list, and start them like:
futures = []
for task in task_list:
futures.append(executor.submit(task.run))
concurrent.futures.wait(futures,timeout=None,return_when=FIRST_COMPLETED)
I am currently running a piece of Python code on a cluster. Part of the rules enforced upon me by slurm are that there is a timelimit on the wallclock run time of my code. This isn't really a problem most times as I can simply checkpoint my code using pickle and then restart it.
At the end of the code I, however, need to write out all my data (I can't write until all calculations have been finished) which can take some time as very large pieces of data can be gathered.
My problem is now that in some cases the code gets terminated by slurm because it exceeded its run time allowance.
Is there some way of interrupting a write operation, stopping the code and then restarting where I left off?
Assuming you put your data in a list or tuple.
perhaps a generator function?
#Create generator function
def Generator():
data=['line1','line2','line3','line4']:
for i in data:
yield i
output=Generator() #reference it
.......
......
if [time conditions is true]:
file-open("myfile","a")
file.write(str(next(output))
else:
[Do something]
You can also use try capturing the exception and restart your main function
try:
MainFunction() #main function with generator next calls
except [you error Error]:
MainFunction() #restart main function
I'm trying to accomplish something without using threading
I'd like to execute a function within a function, but I dont want the first function's flow to stop. Its just a procedure and I don't expect any return and I also need this to keep the execution for some reasons.
Here is a snippet code of what I'd like to do:
function foo():
a = 5
dosomething()
# I dont wan't to wait until dosomething finish. Just call and follow it
return a
Is there any way to do this?
Thanks in advance.
You can use https://docs.python.org/3/library/concurrent.futures.html to achieve fire-and-forget behavior.
import concurrent.futures
def foo():
a = 5
with ThreadPoolExecutor(max_workers=1) as executor:
future = executor.submit(dosomething)
future.add_done_callback(on_something_done)
#print(future.result())
#continue without waiting dosomething()
#future.cancel() #To cancel dosomething
#future.done() #return True if done.
return a
def on_something_done(future):
print(future.result())
[updates]
concurrent.futures is built-in since python 3
for Python 2.x you can download futures 2.1.6 here
Python is synchronous, you'll have to use asynchronous processing to accomplish this.
While there are many many ways that you can execute a function asynchronously, one way is to use python-rq. Python-rq allows you to queue jobs for processing in the background with workers. It is backed by Redis and it is designed to have a low barrier to entry. It should be integrated in your web stack easily.
For example:
from rq import Queue, use_connection
def foo():
use_connection()
q = Queue()
# do some things
a = 5
# now process something else asynchronously
q.enqueue(do_something)
# do more here
return a