How to restart session after n amount of requests have been made - python

I have a script that tries to scrape data from a website. The website blocks any incoming requests after ~75 requests have already been made to it. I found that resetting a session after 50 requests and sleeping for 30s seems to get around the problem of getting blocked. Now I would like to subclass requests.Session and modify It's behaviour in order so It automatically resets the session when it needs to. Here is my code so far:
class Session(requests.Session):
request_count_limit = 50
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.request_count = 0
def get(self, url, **kwargs):
if self.request_count == self.request_count_limit:
self = Session.restart_session()
response = super().get(url, **kwargs)
self.request_count += 1
return response
#classmethod
def restart_session(cls):
print('Restarting Session, Sleeping For 20 seconds...')
time.sleep(20)
return cls()
However, the code above doesn't work. The reason is although I am reassigning self the object itself doesn't change and with that the request_count doesn't change as well. Any help would be appreciated

Assigning to self is just changing a local variable, it has absolutely no effect outside of the method. You could try implementing .new() instead.
Look here: Python Class: overwrite `self`

Related

Django with Unicorn losing thread local storage during requst

I've set up a request-scope cache using middleware and tried to set it to be available from anywhere using threading.local() variable. However, sometimes long requests processes drop with the following error:
File "label.py", line 29, in get_from_request_cache
label = cache.labels[url]
AttributeError: '_thread._local' object has no attribute 'labels'
However, some of the items get processed correctly, and they all depend on existence of that cache.
The cache object is initialized once and is cleared at the end of the request using the following middleware:
request_cache.py
import threading
_request_cache = threading.local()
def get_request_cache():
return _request_cache
class RequestCacheMiddleware:
def __init__(self, get_response):
self.get_response = get_response
def __call__(self, request):
global _request_cache
response = self.get_response(request)
_request_cache.__dict__.clear()
return response
def process_exception(self, request, exception):
_request_cache.__dict__.clear()
return None
And the only code that accesses the cache object directly is this part:
label.py
import django.db.models
from request_cache import get_request_cache
from base import Model
class Label(Model):
**model fields**
#staticmethod
def create_for_request_cache(urls):
cache = get_request_cache()
urls = set(urls)
if not hasattr(cache, 'labels'):
cache.labels = {}
new_urls = urls.symmetric_difference(set(cache.labels.keys()))
entries = [Label(url=url) for url in new_urls]
if entries:
Label.objects.bulk_create(entries)
for entry in entries:
cache.labels[entry.url] = entry
#staticmethod
def get_from_request_cache(url):
cache = get_request_cache()
label = cache.labels[url]
return label
The request that crashes is split into batches in code, and before each batch, new unique urls are taken and added to the cache with the following code, and that's where the process crashes:
fill_labels.py
class LabelView(django.views.generic.base.View):
def _fill_labels_batch(items_batch):
VersionLabel.create_for_request_cache([item.get('url', '') for item in items_batch])
for item in items_batch:
**process items** - CRASHES HERE
#transaction.atomic
def post(self, request, subcategory):
item_batches = **split items into batches**
for item_batch in item_batches:
_fill_labels_batch(item_batch)
If I correctly understand the way Django and Gunicorn work, the thread local object should be local to either a thread if there has been no monkey patching, or to a greenlet if Gunicorn does monkey patching internally, and Django uses the same thread for the entire duration of the request, which means that thread local storage should not change mid-request in either of those cases. However, it's possible to have several requests being processed at the same time and each request can have input data of around 200MB and the time it takes to process the request can be several hours - the last crash happened after 4 hours of processing.
What could be the reason behind the request process losing this cache? If it wasn't created at all, the request would crash much faster, and I can't think of a reason for Django to change or lose threading.local() object mid-request.
Adding explicit monkey.patch_all() call somehow fixed the issue. The root of the problem has remained unknown.

How to access the instance of a class from an inner decorator class?

I have a class that handles the API calls to a server. Certain methods within the class require the user to be logged in. Since it is possible for the session to run out, I need some functionality that re-logins the user once the session timed out. My idea was to use a decorator. If I try it like this
class Outer_Class():
class login_required():
def __init__(self, decorated_func):
self.decorated_func = decorated_func
def __call__(self, *args, **kwargs):
try:
response = self.decorated_func(*args, **kwargs)
except:
print('Session probably timed out. Logging in again ...')
args[0]._login()
response = self.decorated_func(*args, **kwargs)
return response
def __init__(self):
self.logged_in = False
self.url = 'something'
self._login()
def _login(self):
print(f'Logging in on {self.url}!')
self.logged_in = True
#this method requires the user to be logged in
#login_required
def do_something(self, param_1):
print('Doing something important with param_1')
if (): #..this fails
raise Exception()
I get an error. AttributeError: 'str' object has no attribute '_login'
Why do I not get a reference to the Outer_Class-instance handed over via *args? Is there another way to get a reference to the instance?
Found this answer How to get instance given a method of the instance? , but the decorated_function doesn't seem to have a reference to it's own instance.
It works fine, when Im using a decorator function outside of the class. This solves the problem, but I like to know, if it is possible to solve the this way.
The problem is that the magic of passing the object as the first hidden parameter only works for a non static method. As your decorator returns a custom callable object which is not a function, it never receives the calling object which is just lost in the call. So when you try to call the decorated function, you only pass it param_1 in the position of self. You get a first exception do_something() missing 1 required positional argument: 'param_1', fall into the except block and get your error.
You can still tie the decorator to the class, but it must be a function to have self magic work:
class Outer_Class():
def login_required(decorated_func):
def inner(self, *args, **kwargs):
print("decorated called")
try:
response = decorated_func(self, *args, **kwargs)
except:
print('Session probably timed out. Logging in again ...')
self._login()
response = decorated_func(self, *args, **kwargs)
return response
return inner
...
#this method requires the user to be logged in
#login_required
def do_something(self, param_1):
print('Doing something important with param_1', param_1)
if (False): #..this fails
raise Exception()
You can then successfully do:
>>> a = Outer_Class()
Logging in on something!
>>> a.do_something("foo")
decorated called
Doing something important with param_1
You have the command of
args[0]._login()
in the except. Since args[0] is a string and it doesn't have a _login method, you get the error message mentioned in the question.

block read of instance variable when trying to set it

Class A(object):
def __init__(self, cookie):
self.__cookie = cookie
def refresh_cookie():
```This method refresh the cookie after every 10 min```
self.__cookie = <newcookie>
#property
def cookie(self):
return self.__cookie
Problem is cookie value gets changed after every 10 min. However if some method already had the older cookie then request fails. This happen when multiple threads using the same A object.
I am looking for some solution where whenever we tries to refresh i.e. modify cookie value no one should be able to read the cookie value rather there should be a lock at cookie value.
This is a job for a condition variable.
from threading import Lock, Condition
class A(object):
def __init__(self, cookie):
self.__cookie = cookie
self.refreshing = Condition()
def refresh_cookie():
```This method refresh the cookie after every 10 min```
with self.refreshing:
self.__cookie = <newcookie>
self.refreshing.notifyAll()
#property
def cookie(self):
with self.refreshing:
return self.__cookie
Only one thread can enter a with block governed by self.refreshing at a time. The first thread to try will succeed; the others will block until the first leaves its with block.

How does python inheritance work with GAE?

So, I have wasted now about two hours in a bug that I assume has to do with inheritance problems in python and GAE.
I have 2 classes, BlogHandler, and the child, LoginHandler:
class BlogHandler(webapp2.RequestHandler):
def __init__(self, request=None, response=None):
super(BlogHandler, self).__init__(request, response)
self.is_logged_in = False
def initialize(self, *args, **kwargs):
webapp2.RequestHandler.initialize(self, *args, **kwargs)
logging.warning('Checking for cookie')
if True:
self.is_logged_in = True
logging.warning('We are logged in!')
else:
logging.warning('We seem to be logged out')
class LoginHandler(BlogHandler):
def get(self):
logging.warning('Choose wisely!: + %s', self.is_logged_in)
if self.is_logged_in:
self.redirect(MAIN_URL)
else:
self.render("login.html")
Every time I get a GET request from a client, the initialize(self, *args, **kwargs) method will run in the father, and then the get(self): method will run on the child.
Now I want the father to share a variable with the child, the is_logged_in variable. I have to give a default value to the varible so I initialize it in the father's constructor as False
Then, when I run the initialize(self, *args, **kwargs) I check for a condition, which is always True, so it has 100% chance of changing the is_logged_in variable to True.
Back to the child, I check the value of the said variable and ... is is always False. I cannot understand this bug, specially because I know I am changing the value of the said variable. Here is a log:
WARNING 2014-05-09 22:50:52,062 blog.py:47] Checking for cookie
WARNING 2014-05-09 22:50:52,062 blog.py:51] We are logged in!
WARNING 2014-05-09 22:50:52,063 blog.py:116] Choose wisely!: + False
INFO 2014-05-09 22:50:52,071 module.py:639] default: "GET /blog/login HTTP/1.1" 200 795
Why is this happening? What am I not understanding?
The other answer provided fixes it for you, but doesn't explain why.
This has nothing to do with GAE, or failure in python inheritance.
In your case the __init__ method is inherited by LoginHandler and always sets is_logged_in to False after the super call. This is expected behaviour of inheritence.
You problem is you are calling super in your __init__ before you set is_logged_in, which means whatever you do in your own initialize method you are immediately and unconditionally overriding it.
Try changing:
class BlogHandler(webapp2.RequestHandler):
def __init__(self, request=None, response=None):
self.is_logged_in = False
super(BlogHandler, self).__init__(request, response)
Putting your attribute before your call to super.

Celery creates several instances of Task

I'm creating a task (by subclassing celery.task.Task) that creates a connection to Twitter's streaming API. For the Twitter API calls, I am using tweepy. As I've read from the celery-documentation, 'a task is not instantiated for every request, but is registered in the task registry as a global instance.' I was expecting that whenever I call apply_async (or delay) for the task, I will be accessing the task that was originally instantiated but that doesn't happen. Instead, a new instance of the custom task class is created. I need to be able to access the original custom task since this is the only way I can terminate the original connection created by the tweepy API call.
Here's some piece of code if this would help:
from celery import registry
from celery.task import Task
class FollowAllTwitterIDs(Task):
def __init__(self):
# requirements for creation of the customstream
# goes here. The CustomStream class is a subclass
# of tweepy.streaming.Stream class
self._customstream = CustomStream(*args, **kwargs)
#property
def customstream(self):
if self._customstream:
# terminate existing connection to Twitter
self._customstream.running = False
self._customstream = CustomStream(*args, **kwargs)
def run(self):
self._to_follow_ids = function_that_gets_list_of_ids_to_be_followed()
self.customstream.filter(follow=self._to_follow_ids, async=False)
follow_all_twitterids = registry.tasks[FollowAllTwitterIDs.name]
And for the Django view
def connect_to_twitter(request):
if request.method == 'POST':
do_stuff_here()
.
.
.
follow_all_twitterids.apply_async(args=[], kwargs={})
return
Any help would be appreciated. :D
EDIT:
For additional context for the question, the CustomStream object creates an httplib.HTTPSConnection instance whenever the filter() method is called. This connection needs to be closed whenever there is another attempt to create one. The connection is closed by setting customstream.running to False.
The task should only be instantiated once, if you think it is not for some reason,
I suggest you add a
print("INSTANTIATE")
import traceback
traceback.print_stack()
to the Task.__init__ method, so you could tell where this would be happening.
I think your task could be better expressed like this:
from celery.task import Task, task
class TwitterTask(Task):
_stream = None
abstract = True
def __call__(self, *args, **kwargs):
try:
return super(TwitterTask, self).__call__(stream, *args, **kwargs)
finally:
if self._stream:
self._stream.running = False
#property
def stream(self):
if self._stream is None:
self._stream = CustomStream()
return self._stream
#task(base=TwitterTask)
def follow_all_ids():
ids = get_list_of_ids_to_follow()
follow_all_ids.stream.filter(follow=ids, async=false)

Categories