Scraping website using Celery - python

Currently, my structure is Flask, Redis, RabbitMQ and Celery. In my scraping, I am using requests and BeautifulSoup.
My flask is running on apache and wsgi. This is on prod. With app.run(threaded=True)
I have 25 APIs. 10 are to scrape the URL like headers, etc. , and the rest is to use a 3rd party API for that URL.
I am using chord for processing my APIs and getting data from the APIs using requests.
For my chord header I have 3 workers, while on my callback I only have 1.
I am having a bottleneck issue of having ConnectTimeoutError and MaxRetryError. As I read some thread it said to do a timeout for every process, because having this error means you are overloading the remote server.
The problem is since I am using a chord there is no sense to use a time sleep since the 25 API call will be run at the same time. Have anyone encountered this? Or am I doing this wrong?
The thread I read seem to be saying to change the requests to pycurl or use Scrapy. But I dont think that's the case since ConnectTimeoutError is about my host overloading a specific URLs server.
My chord process:
callback = create_document.s(url, company_logo, api_list)
header = [api_request.s(key) for key in api_list.keys()]
result = chord(header)(callback)
In api_request task requests is used.

If you're wanting to limit the number of scrapes running at the same time you can create an enqueue task that checks to see if another task is running that shares the same properties as the task you are wanting to run. If the task is running you tell it to sleep for a few seconds and check again. When it sees that one is not running you can then queue the task you want to run. This will allow you to have sleeps with asynchronous tasks. You can even count the tasks and run more if only a certain number are running. With this you can run 5 at a time and see if it is throttled enough then queue another when you see one has finished etc.
::EDIT::
Documentation for Celery Inspect

Related

Listening for Multiple Requests at a same time (Android to Python Script)

Just point me in the right direction, please. I have no idea how to deal with this.
So, I want to connect my scraping script written in Python, connected to a front-end Android app. I have already written the script and front-end is ready as well. However, I dont know how these two things would communicate with each other, in which the script constantly listens for requests from the Android App (Through Firebase maybe?).
However, there is one more thing. Since multiple users would use the app at the same time, so there will be parallel requests sent from the App as well. How do I let the script to process the requests concurrently without waiting for first one to be completed. All the scraping is done through requests library. I researched a bit, and found some hints related to Threading, Queue, Async etc.
Kindly, tell me which way do I go?
In the given scenario, you can put the script in an azure python function and make a call to it whenever required. However, as you mentioned there will be multiple parallel request which might pose a warning due to the single threaded architecture of Python.
It is documented in our Python Functions Developer reference on how to handle such scenario’s: functions-reference-python ( also check asyncio-eventloop )
Here are methods to handle this:
Use Async calls
Add more Language worker processes per host, this can be done by using application setting : FUNCTIONS_WORKER_PROCESS_COUNT upto a maximum value of 10.
[Please note that each new language worker is spawned every 10 seconds until they are warm.]
Here is a GitHub issue which talks about this issue in detail : https://github.com/Azure/azure-functions-python-worker/issues/236

Creating queue for Flask backend that can handle multiple users

I am creating a robot that has a Flask and React (running on raspberry pi zero) based interface for users to request it to perform tasks. When a user requests a task I want the backend to put it in a queue, and have the backend constantly looking at the queue and processing it on a one-by-one basis. Each tasks can take anywhere from 15-60 seconds so they are pretty lengthy.
Currently I just immediately do the task in the same python process that is running the Flask server, and from testing locally It seems like i can go to the react app in two different browsers and request tasks at the same time and it looks like the raspberry pi is trying to run them in parallel (from what I'm seeing in the printed logs).
What is the best way to allow multiple users to go to the front-end and queue up tasks? When multiple users go to the react app I assume they all connect to the same instance of the back-end. So it it enough just to add a dequeue to the back-end and protect it with a mutex lock (what is the pythonic way to use mutexes?). Or is this too simple? Do I need some other process or method to implement the task queue (such as writing/reading to an external file to act as the queue)?
In general, the most popular way to run tasks in Python is using Celery. It is a Python framework that runs on a separate process, continuously checking a queue (like Redis or AMQP) for tasks. When it finds one, it executes it, and logs the result to a "result backend" (like a database or Redis again). Then you have the Flask servers just push the tasks to the queue.
In order to notify the users, you could use polling from the React app, which is just requesting an update every 5 seconds until you see from the result backend that the task has completed successfully. As soon as you see that, stop polling and show the user the notification.
You can easily have multiple worker processes run in parallel, if the app would become large enough to need it. In general, you just need to remember to have every process do what it's needed to do: Flask servers should answer web requests, and Celery servers should process tasks. Not the other way around.

Flask and long running tasks

I'm writing a small web server using Flask that needs to do the following things:
On the first request, serve the basic page and kick off a long (15-60 second) data processing task. The data processing task queries a second server which I do not control, updates a local database, and then performs some calculations on the results to show in the web page.
The page issues several AJAX requests that all depend on parts of the result from the long task, so I need to wait until the processing is done.
Subsequent requests for the first page would ideally re-use the previous request's result if they come in while the processing task is ongoing (or even shortly thereafter)
I tried using flask-cache (specifically SimpleCache), but ran into an issue as it seems the cache pickles the result, when I'd really rather keep the exact object.
I suppose I could re-write what I'm caching to be pickle-able, and then implement a single worker thread to do the processing.
Is there some more better way of handling this kind of workflow?
I think best way for long data processing is something like Celery.
Send request to run task and receive task ID.
Periodically send ajax requests to check task progress and receive result of task execution.

Tornado Subprocess confusion

I am trying to build a Tornado web server which takes requests from multiple clients. The request consists of:
a. For a given directory name passed through an URL, zip the files, etc and FTP it out.
b. Providing a status of sorts if the task is completed.
So, rather than making it a synchronous and linear process, I wanted to break it down into multiple subtasks. The client will submit the URL request and then simply receive a response of sorts 'job submitted'. A bit later, the client can come along asking status on this job. During this time the job obviously has to finish its task.
I am confused between what modules to use - Tornado Subprocess, Popen contructor, Subprocess.Call, etc. I've read Python docs but can't find anything where the task is running longer and Tornado is not supposed to wait for it to finish. So, I need a mechanism to start a job, let it run its course but relinquish the client and then when asked by client provide a status on it.
Any help is appreciated. Thanks.
Python programmers widely use Celery for a set of processes to manage a queue of tasks. Set up Celery with RabbitMQ and write a Celery worker (perhaps with Celery Canvas that does the work you need: zips a directory, ftps it to somewhere, etc.
The Tornado-Celery integration package provides something that appears close to what you need to integrate your Tornado application with Celery.
This is all a lot of moving parts to install and configure at first, of course, but it will prepare you for a maintainable application architecture.

Can AppEngine python threads last longer than the original request?

We're trying to use the new python 2.7 threading ability in Google App Engine and it seems like the created thread is getting killed before it finishes running. Our scenario:
User sends a message to the server
We update the user's data
We spawn a thread to do some more heavy duty processing
We return a response to the user before waiting for the heavy duty processing to finish
My assumption was that the thread would continue to run after the request had returned, as long as it did not exceed the total request time limit. What we're seeing though is that the thread is randomly killed partway through it's execution. No exceptions, no errors, nothing. It just stops running.
Are threads allowed to exist after the response has been returned? This does not repro on the dev server, only on live servers.
We could of course use a task queue instead, but that's a real pain since we'd have to set up a url for the action and serialize/deserialize the data.
The 'Sandboxing' section of this page:
http://code.google.com/appengine/docs/python/python27/using27.html#Sandboxing
indicates that threads cannot run past the end of the request.
Deferred tasks are the way to do this. You don't need a URL or serialization to use them:
from google.appengine.ext import deferred
deferred.defer(myfunction, arg1, arg2)

Categories