Using ETag in feedparser

Using ETag in feedparser - python

I'm writing a Django view that gets the latest blog posts of a wordpress system.
def __get_latest_blog_posts(rss_url, limit=4):
feed = feedparser.parse(rss_url)
return something
I tried in a terminal to use ETags:
>>> import feedparser
>>> d = feedparser.parse("http://a real url")
>>> d.etag
u'"2ca34419a999eae486b5e9fddaa2b2b9"'
>>> d2 = feedparser.parse("http://a real url", d.etag)
I'd like to avoid requesting the feed for every user of the web app. maybe etag aren't the best option?
Once the first user sees this view, can I store the etag and use it for all the other users? is there a thread for every user and therefore I can't share the value of a variable this way?

Etag allows to mark unique status of a web resource, so that you have a chance to ask for the resource expressing latest status you already have.
But to have some version already at your client, you have to fetch it the first time, so for the first request is use of etag irrelevant.
See HTTP Etag at wikipedia, it explains it all.
Typical scenario is:
fetch your page the first time and read value of Etag header for future use
next time you ask for the same page, you add header If-None-Match with value of Etag from your last fetch. Server will check, if there is something new, if the Etag you provide and Etag at current version of resource are the same, it will not return complete page back, but rather returh HTTP Status code 304 Not Modified. If the page has different status on the server, you get the page with HTTP Status code 200 and with new value of Etag in the response header.
If you want to optimize your app not to generate initial request for the same feed by each user, you shall somehow share the Etag value for given resource globally across your application.

The first request the client will never be able to use any local caches, so at the first request ETag isn't necessary. Remember that ETag needs to be passed into the conditional request headers (If-None-Match, If-Match, etc), the semantic of non conditional requests are clear.
If your feed is a public feed, then an intermediate caching proxy are also allowed to return an ETagged result for non conditional request, although it will always have to contact origin server if the conditional header doesn't match.

Related

Session object in python

Does a Session object maintain the same TCP connection with a client?
In the code below, a request from the client is submitted to a handler, the handler creates a sessions object why does session["count"] on an object give a dictionary?
A response is then given back to the client, upon another request is the code re-executed?
So that another session object is created?
How does the session store the previous count information if it did not return a cookie to the client?
from appengine_utilities import sessions
class SubmitHandler(webapp.RequestHandler):
def get(self):
session = sessions.Session()
if "count" in session:
session["count"]=session["count"]+1
else:
session["count"]=1
template_values={'message':"You have clicked:"+str(session["count"])}
# render the page using the template engine
path = os.path.join(os.path.dirname(__file__),'index.html')
self.response.out.write(template.render(path,template_values))

You made several questions so let's go one by one:
Sessions are not related to TCP connections. A TCP connection is maintained when both client and server agreed upon that using the HTTP Header keep-alive. (Quoted from Pablo Santa Cruz in this answer).
Looking at the module session.py in line 1010 under __getitem__ definition I've found the following TODO: It's broke here, but I'm not sure why, it's returning a model object. Could be something along these lines, I haven't debug it myself.
From appengine_utilities documentation sessions are stored in Datastore and Memcache or kept entirely as cookies. The first option also involves sending a token to the client to identify it in subsequent requests. Choosing one or another depends on your actual settings or the default ones if you haven't configured your own. Default settings are defined to use the Datastore option.
About code re-execution you could check that yourself adding some logging code to count how many times is the function executed.
Something important, I have noticed that this library had it's latest update on 2nd of January 2016, so it has gone unmaintained for 4 years. It would be best if you change to an up to date library, for example the webapp2 session module. Furthermore, Python 2 is sunsetting by this year (1st January 2020) so you might consider switching to python 3 instead.
PD: I found the exact code you posted under this website. In case you took it from there consider next time to include a reference/citation to it's origin.

In requests_cache, does the sqlite file contain the cached request time and age?

Consider the following code:
import requests
import requests_cache
requests_cache.install_cache(expire_after=7200)
url = 'http://www.example.com'
with requests.Session() as sess:
response = sess.get(url)
print response.text
First run
When I first run this code, I am sure that the GET request is sent out to www.example.com, since no cache has been set up yet. I will then see a file named cache.sqlite in the working directory, which contains the request being cached inside it.
The first process will then exit, erasing all traces of it from RAM.
Second run, maybe 2000 seconds later
What else does requests_cache.install_cache do? Aside from "installing" a cache, does it also tell the present Python session that "Hey, there's a cache present right now, you might want to look into it before sending out new requests".
So, my question is, does the new instance of my script process respect the existing cache.sqlite or does it create an entirely new one from scratch?
If not, how do I make sure that it will look up the existing cache first before sending out new requests, and also consider the age of the cached requests?

Here's what's going on under the hood:
requests_cache.install_cache() globally patches out requests.Session with caching behavior.
install_cache() takes a number of optional arguments to tell it where and how to cache responses, but by default it will create a SQLite database in the current directory, as you noticed.
A cached response will be stored along with its expiration time, in response.expires
The next time you run your script, install_cache() will load the existing database instead of making a new one
The next time you make that request, the expiration time will be checked against the current time. If it's expired, a new request will be sent and the old cached response will be overwritten with the new one.
Here's an example that makes it more obvious what's going on:
from requests_cache import CachedSession
session = CachedSession('~/my_project/requests_cache.db', expire_after=7200)
session.get('http://www.example.com')
session.get('http://www.example.com')
# Show if the response came from the cache, when it was created, and when it expires
print(f'From cache: {response.from_cache}')
print(f'Created: {response.created_at}')
print(f'Expires: {response.expires}')
# You can also get a summary from just printing the cached response object
print(response)
# Show where the cache is stored, and currently cached request URLs
print(session.cache.db_path)
for url in session.cache.urls:
print(url)
And for reference, there is now more thorough user documentation that should answer most questions about how requests-cache works and how to make it behave the way you want: https://requests-cache.readthedocs.io

Google App Engine: Determine whether Current Request is a Taskqueue

Is there a way to dynamically determine whether the currently executing task is a standard http request or a TaskQueue?
In some parts of my request handler, I make a few urlfetches. I would like the timeout delay of the url fetch to be short if the request is a standard http request and long if it is a TaskQueue.

Pick any one of the following HTTP headers:
X-AppEngine-QueueName, the name of the queue (possibly default)
X-AppEngine-TaskName, the name of the task, or a system-generated unique ID if no name was specified
X-AppEngine-TaskRetryCount, the number of times this task has been retried; for the first attempt, this value is 0
X-AppEngine-TaskETA, the target execution time of the task, specified in microseconds since January 1st 1970.
Standard HTTP requests won't have these headers.

Task requests always include a specific set of HTTP headers, which you can check.

image download problem (python)

I was trying to download images with multi-thread, which has a limited max_count in python.
Each time a download_thread is started, I leave it alone and activate another one. I hope the download process could be ended in 5s, which means downloading is failed if opening the url costs more than 5s.
But how can I know it and stop the failed thread???

Can you tell which version of python you are using?
Maybe you could have posted a snippet too.
From Python 2.6, you have a timeout added in urllib2.urlopen.
Hope this will help you. It's from the python docs.
urllib2.urlopen(url[, data][,
timeout]) Open the URL url, which can
be either a string or a Request
object.
Warning HTTPS requests do not do any
verification of the server’s
certificate. data may be a string
specifying additional data to send to
the server, or None if no such data is
needed. Currently HTTP requests are
the only ones that use data; the HTTP
request will be a POST instead of a
GET when the data parameter is
provided. data should be a buffer in
the standard
application/x-www-form-urlencoded
format. The urllib.urlencode()
function takes a mapping or sequence
of 2-tuples and returns a string in
this format. urllib2 module sends
HTTP/1.1 requests with
Connection:close header included.
The optional timeout parameter
specifies a timeout in seconds for
blocking operations like the
connection attempt (if not specified,
the global default timeout setting
will be used). This actually only
works for HTTP, HTTPS and FTP
connections.
This function returns a file-like
object with two additional methods:
geturl() — return the URL of the
resource retrieved, commonly used to
determine if a redirect was followed
info() — return the meta-information
of the page, such as headers, in the
form of an mimetools.Message instance
(see Quick Reference to HTTP Headers)
Raises URLError on errors.
Note that None may be returned if no
handler handles the request (though
the default installed global
OpenerDirector uses UnknownHandler to
ensure this never happens).
In addition, default installed
ProxyHandler makes sure the requests
are handled through the proxy when
they are set.
Changed in version 2.6: timeout was
added.

Why are CherryPy object attributes persistent between requests?

I was writing debugging methods for my CherryPy application. The code in question was (very) basically equivalent to this:
import cherrypy
class Page:
def index(self):
try:
self.body += 'okay'
except AttributeError:
self.body = 'okay'
return self.body
index.exposed = True
cherrypy.quickstart(Page(), config='root.conf')
I was surprised to notice that from request to request, the output of self.body grew. When I visited the page from one client, and then from another concurrently-open client, and then refreshed the browsers for both, the output was an ever-increasing string of "okay"s. In my debugging method, I was also recording user-specific information (i.e. session data) and that, too, showed up in both users' output.
I'm assuming that's because the python module is loaded into working memory instead of being re-run for every request.
My question is this: How does that work? How is it that self.debug is preserved from request to request, but cherrypy.session and cherrypy.response aren't?
And is there any way to set an object attribute that will only be used for the current request? I know I can overwrite self.body per every request, but it seems a little ad-hoc. Is there a standard or built-in way of doing it in CherryPy?
(second question moved to How does CherryPy caching work?)

synthesizerpatel's analysis is correct, but if you really want to store some data per request, then store it as an attribute on cherrypy.request, not in the session. The cherrypy.request and .response objects are new for each request, so there's no fear that any of their attributes will persist across requests. That is the canonical way to do it. Just make sure you're not overwriting any of cherrypy's internal attributes! cherrypy.request.body, for example, is already reserved for handing you, say, a POSTed JSON request body.
For all the details of exactly how the scoping works, the best source is the source code.

You hit the nail on the head with the observation that you're getting the same data from self.body because it's the same in memory of the Python process running CherryPy.
self.debug maintains 'state' for this reason, it's an attribute of the running server.
To set data for the current session, use cherrypy.session['fieldname'] = 'fieldvalue', to get data use cherrypy.session.get('fieldname').
You (the programmer) do not need to know the session ID, cherrypy.session handles that for you -- the session ID is automatically generated on the fly by cherrypy and is persisted by exchanging a cookie between the browser and server on subsequent query/response interactions.
If you don't specify a storage_type for cherrypy.session in your config, it'll be stored in memory (accessible to the server and you), but you can also store the session files on disk if you wish which might be a handy way for you to debug without having to write a bunch of code to dig out session IDs or key/pair values from the running server.
For more info check out http://www.cherrypy.org/wiki/CherryPySessions

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.