I created a new Pylons project, and would like to use Cassandra as my database server. I plan on using Pycassa to be able to use cassandra 0.7beta.
Unfortunately, I don't know where to instantiate the connection to make it available in my application.
The goal would be to :
Create a pool when the application is launched
Get a connection from the pool for each request, and make it available to my controllers and libraries (in the context of the request). The best would be to get a connexion from the pool "lazily", i.e. only if needed
If a connexion has been used, release it when the request has been processed
Additionally, is there something important I should know about it ? When I see some comments like "Be careful when using a QueuePool with use_threadlocal=True, especially with retries enabled. Synchronization may be required to prevent the connection from changing while another thread is using it.", what does it mean exactly ?
Thanks.
--
Pierre
Well. I worked a little more. In fact, using a connection manager was probably not a good idea as this should be the template context. Additionally, opening a connection for each thread is not really a big deal. Opening a connection per request would be.
I ended up with just pycassa.connect_thread_local() in app_globals, and there I go.
Okay.
I worked a little, I learned a lot, and I found a possible answer.
Creating the pool
The best place to create the pool seems to be in the app_globals.py file, which is basically a container for objects which will be accessible "throughout the life of the application". Exactly what I want for a pool, in fact.
I just added at the end of the file my init code, which takes settings from the pylons configuration file :
"""Creating an instance of the Pycassa Pool"""
kwargs = {}
# Parsing servers
if 'cassandra.servers' in config['app_conf']:
servers = config['app_conf']['cassandra.servers'].split(',')
if len(servers):
kwargs['server_list'] = servers
# Parsing timeout
if 'cassandra.timeout' in config['app_conf']:
try:
kwargs['timeout'] = float(config['app_conf']['cassandra.timeout'])
except:
pass
# Finally creating the pool
self.cass_pool = pycassa.QueuePool(keyspace='Keyspace1', **kwargs)
I could have done better, like moving that in a function, or supporting more parameters (pool size, ...). Which I'll do.
Getting a connection at each request
Well. There seems to be the simple way : in the file base.py, adding something like c.conn = g.cass_pool.get() before calling WSGIController, something like c.conn.return_to_pool() after. This is simple, and works. But this gets a connection from the pool even when it's not required by the controller. I have to dig a little deeper.
Creating a connection manager
I had the simple idea to create a class which would be instantiated at each request in the base.py file, and which would automatically grab a connection from the pool when requested (and release it after). This is a really simple class :
class LocalManager:
'''Requests a connection from a Pycassa Pool when needed, and releases it at the end of the object's life'''
def __init__(self, pool):
'''Class constructor'''
assert isinstance(pool, Pool)
self._pool = pool
self._conn = None
def get(self):
'''Grabs a connection from the pool if not already done, and returns it'''
if self._conn is None:
self._conn = self._pool.get()
return self._conn
def __getattr__(self, key):
'''It's cooler to write "c.conn" than "c.get()" in the code, isn't it?'''
if key == 'conn':
return self.get()
else:
return self.__dict__[key]
def __del__(self):
'''Releases the connection, if needed'''
if not self._conn is None:
self._conn.return_to_pool()
Just added c.cass = CassandraLocalManager(g.cass_pool) before calling WSGIController in base.py, del(c.cass) after, and I'm all done.
And it works :
conn = c.cass.conn
cf = pycassa.ColumnFamily(conn, 'TestCF')
print cf.get('foo')
\o/
I don't know if this is the best way to do this. If not, please let me know =)
Plus, I still did not understand the "synchronization" part in Pycassa source code. If it is needed in my case, and what should I do to avoid problems.
Thanks.
Related
I am working on a Python flask app, and the main method start() calls an external API (third_party_api_wrapper()). That external API has an associated webhook (webhook()) that receives the output of that external API call (note that the output that webhook() receives is actually different from the response returned in the third_party_wrapper())
The main method start() needs the result of webhook(). How do I make start() wait for webhook() to be executed? And how do wo pass the returned value of webhook() back to start()?
Here's is a minimal code snippet to capture the scenario.
#app.route('/webhook', methods=['POST'])
def webhook():
return "webhook method has executed"
# this method has a webhook that calls webhook() after this method has executed
def third_party_api_wrapper():
url = 'https://api.thirdparty.com'
response = requests.post(url)
return response
# this is the main entry point
#app.route('/start', methods=['POST'])
def start():
third_party_api_wrapper()
# The rest of this code depends on the output of webhook().
# How do we wait until webhook() is called, and how do we access the returned value?
The answer to this question really depends on how you plan on running your app in production. It's much simpler if we make the assumption that you only plan to have a single instance of your app running at once (as opposed to multiple behind a load balancer, for example), so I'll make that assumption first to give you a place to start, and comment on a more "production-ready" solution afterwards.
A big thing to keep in mind when writing a web application is that you have to understand how you want the outside world to interact with your app. Do you expect to have the /start endpoint called only once at the beginning of your app's lifetime, or is this a generic endpoint that may start any number of background processes that you want the caller of each to wait for? Or, do you want the behavior where any caller after the first one will wait for the same process to complete as the first one? I can't answer these questions for you, it depends on the use-case you're trying to implement. I'll give you a relatively simple solution that you should be able to modify to fulfill any of the ones I mentioned though.
This solution will use the Event class from the threading standard library module; I added some comments to clarify which parts you may have to change depending on the specifics of the API you're calling and stuff like that.
import threading
import uuid
from typing import Any
import requests
from flask import Flask, Response, request
# The base URL for your app, if you're running it locally this should be fine
# however external providers can't communicate with your `localhost` so you'll
# need to change this for your app to work end-to-end.
BASE_URL = "http://localhost:5000"
app = Flask(__name__)
class ThirdPartyProcessManager:
def __init__(self) -> None:
self.events = {}
self.values = {}
def wait_for_request(self, request_id: str) -> None:
event = threading.Event()
actual_event = self.events.setdefault(request_id, event)
if actual_event is not event:
raise ValueError(f"Request {request_id} already exists.")
event.wait()
return self.values.pop(request_id)
def finish_request(self, request_id: str, value: Any) -> None:
event = self.events.pop(request_id, None)
if event is None:
raise ValueError(f"Request {request_id} does not exist.")
self.values[request_id] = value
event.set()
MANAGER = ThirdPartyProcessManager()
# This is assuming that you can specify the callback URL per-request, otherwise
# you may have to get the request ID from the body of the request or something
#app.route('/webhook/<request_id>', methods=['POST'])
def webhook(request_id: str) -> Response:
MANAGER.finish_request(request_id, request.json)
return "webhook method has executed"
# Somehow in here you need to create or generate a unique identifier for this
# request--this may come from the third-party provider, or you can generate one
# yourself. There are three main paths I see here:
# - If you can specify the callback/webhook URL in each call, you can just pass them
# <base>/webhook/<request_id> and use that to identify which request is being
# responded to in the webhook.
# - If the provider gives you a request ID, you can return it from this function
# then retrieve it from the request body in the webhook route
# For now, I'll assume the first situation but you should be able to implement the second
# with minimal changes
def third_party_api_wrapper() -> str:
request_id = uuid.uuid4().hex
url = 'https://api.thirdparty.com'
# Just an example, I don't know how the third party API you're working with works
response = requests.post(
url,
json={"callback_url": f"{BASE_URL}/webhook/{request_id}"}
)
# NOTE: unrelated to the problem at hand, you should always check for errors
# in HTTP responses. This method is an easy way provided by requests to raise
# for non-success status codes.
response.raise_for_status()
return request_id
#app.route('/start', methods=['POST'])
def start() -> Response:
request_id = third_party_api_wrapper()
result = MANAGER.wait_for_request(request_id)
return result
If you want to run the example fully locally to test it, do the following:
Comment out lines 62-71, which actually make the external API call
Add a print statement after line 77, so that you can get the ID of the "in flight" request. E.g. print("Request ID", request_id)
In one terminal, run the app by pasting the above code into an app.py file and running flask run in that directory.
In another terminal, start the process via:
curl -XPOST http://localhost:5000/start
Copy the request ID that will be logged in the first terminal that's running the server.
In a third terminal, complete the process by calling the webhook:
curl -XPOST http://localhost:5000/webhook/<your_request_id> -H Content-Type:application/json -d '{"foo":"bar"}'
You should see {"foo":"bar"} as the response in the second terminal that made the /start request.
I hope that's enough to help you get started w/ whatever problem you're trying to solve.
There are a couple of design-y comments I have based on the information provided as well:
As I mentioned before, this will not work if you have more than one instance of the app running at once. This works by storing the state of in-flight requests in a global state inside your python process, so if you have more than one process, they won't all be working and modifying the same state. If you need to run more than one instance of your process, I would use a similar approach with some database backend to store the shared state (assuming your requests are pretty short-lived, Redis might be a good choice here, but once again it'll depend on exactly what you're trying to do).
Even if you do only have one instance of the app running, flask is capable of being run in a variety of different server contexts--for example, the server might be using threads (the default), greenlets via gevent or a similar library, or multiple processes, or maybe some other approach entirely in order to handle multiple requests concurrently. If you're using an approach that creates multiple processes, you should be able to use the utilities provided by the multiprocessing module to implement the same approach as I've given above.
This approach probably will work just fine for something where the difference in time between the API call and the webhook response is small (on the order of a couple of seconds at most I'd say), but you should be wary of using this approach for something where the difference in time can be quite large. If the connection between the client and your server fails, they'll have to make another request and run the long-running process that your third party is completing for you again. Some proxies and load balancers may also have time out behavior that could terminate the request after a certain amount of time even if nothing goes wrong in the connection between your server and the client making a request to it. An alternative approach would be for your /start endpoint to return quickly and give the client a request_id that they could poll for updates. As an example, AWS Athena's API is structured like this--there is a StartQueryExecution method, and separate GetQueryExecution and GetQueryResults methods that the client makes requests to check the status of a query and retrieve the results respectively (there are also other methods like StopQueryExecution and GetQueryRuntimeStatistics available as well). You can check out the documentation here.
I know that's a lot of info, but I hope it helps. Happy to update the answer w/ more specific info if you'll provide some more details about your use-case.
Our application was working fine till we ported to Microsoft database for PostgreSQL in Azure. Then periodically, our application fails for no real reason and we have SSL SYSCALL errors all over the place - on DELETE's and so on. We have tried everything described in the internet- use keepalive args, RAM, memory and everything else . We want to try automatically reestablishing the connection. But we have a threaded connection pool. have looked at this thread Psycopg2 auto reconnect inside a class
But our functions that read the database are in another class. So we have two questions:
1)What is the cause of the SSL SYSCALL errors . I have searched all threads and the usual suspects are ruled out.
2)How do i reconnect on failure inside a threaded connection pool class--> this is being used in a flask app
Here is how our app is structured
class DBClass(object):
_instance = None
conn= None
def __new__(cls):
if cls._instance is None:
cls._instance = object.__new__(cls)
try:
max_conn = 12
keepalive_args = { "keepalives": 1, "keepalives_idle": 25, "keepalives_interval": 4,"keepalives_count": 9,
}
db._instance.pool = psycopg2.pool.ThreadedConnectionPool(3, max_conn, db=,
host=, user=,
password=,
port=, **keepalive_args)
except Exception as ex:
db._instance = None
raise ex
return cls._instance
def __enter__(self):
self.conn= self._instance.pool.getconn()
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self._instance.pool.putconn(self.connection)
def __del__(self):
self._instance.pool.closeall()
#another python module has a class called clsEmployee. We have dozens of functions using the above-mentioned database class. Something like this.
with DBClass() as db:
pg_conn = db.connection
cur = pg_conn.cursor()
cur.execute("SELECT * from emp")
row = cur.fetchone()[0]
There are many ways you could handle this.
The solution proposed by Psycopg2 auto reconnect inside a class Will still work if your calls to execute db work is outside of the DBClass. You just need functions that call the database, and you wrap them with a decorator. All the decorator is doing is adding a loop to allow the function to be called multiple times, wrapping the actual function in a try/except and reconnecting on an except. This is actually a pretty standard way of handling this type of problem as it can work for DB, APIs, or anything that could fail. The one thing you may want to do is add an exponential backoff to your retry (Where the sleep call is).
The other option you have is to create your own subclass of cursor that has the same retry logic inside an overridden version of execute. This will accomplish the same thing, It's just a case of what you think is easier to work with.
Since this is being used in a Flask app you could modify the first approach and instead of doing the retry at the model code level, You could do the retry on the Flask route level.
Is there a simple example of working with the neo4j python driver?
How do I just pass cypher query to the driver to run and return a cursor?
If I'm reading for example this it seems the demo has a class wrapper, with a private member func I pass to the session.write,
session.write_transaction(self._create_and_return_greeting, ...
That then gets called it with a transaction as a first parameter...
def _create_and_return_greeting(tx, message):
that in turn runs the cypher
result = tx.run("CREATE (a:Greeting) "
This seems 10X more complicated than it needs to be.
I did just try a simpler:
def raw_query(query, **kwargs):
neodriver = neo_connect() # cached dbconn
with neodriver.session() as session:
try:
result = session.run(query, **kwargs)
return result.data()
But this results in a socket error on the query, probably because the session goes out of scope?
[dfcx/__init__] ERROR | Underlying socket connection gone (_ssl.c:2396)
[dfcx/__init__] ERROR | Failed to write data to connection IPv4Address(('neo4j-core-8afc8558-3.production-orch-0042.neo4j.io', 7687)) (IPv4Address(('34.82.120.138', 7687)))
Also I can't return a cursor/iterator, just the data()
When the session goes out of scope, the query result seems to die with it.
If I manually open and close a session, then I'd have the same problems?
Python must be the most popular language this DB is used with, does everyone use a different driver?
Py2neo seems cute, but completely lacking in ORM wrapper function for most of the cypher language features, so you have to drop down to raw cypher anyway. And I'm not sure it supports **kwargs argument interpolation in the same way.
I guess that big raise should help iron out some kinks :D
Slightly longer version trying to get a working DB wrapper:
def neo_connect() -> Union[neo4j.BoltDriver, neo4j.Neo4jDriver]:
global raw_driver
if raw_driver:
# print('reuse driver')
return raw_driver
neoconfig = NEOCONFIG
raw_driver = neo4j.GraphDatabase.driver(
neoconfig['url'], auth=(
neoconfig['user'], neoconfig['pass']))
if raw_driver is None:
raise BaseException("cannot connect to neo4j")
else:
return raw_driver
def raw_query(query, **kwargs):
# just get data, no cursor
neodriver = neo_connect()
session = neodriver.session()
# logging.info('neoquery %s', query)
# with neodriver.session() as session:
try:
result = session.run(query, **kwargs)
data = result.data()
return data
except neo4j.exceptions.CypherSyntaxError as err:
logging.error('neo error %s', err)
logging.error('failed query: %s', query)
raise err
# finally:
# logging.info('close session')
# session.close()
update: someone pointed me to this example which is another way to use the tx wrapper.
https://github.com/neo4j-graph-examples/northwind/blob/main/code/python/example.py#L16-L21
def raw_query(query, **kwargs):
neodriver = neo_connect() # cached dbconn
with neodriver.session() as session:
try:
result = session.run(query, **kwargs)
return result.data()
This is perfectly fine and works as intended on my end.
The error you're seeing is stating that there is a connection problem. So there must be something going on between the server and the driver that's outside of its influence.
Also, please note, that there is a difference between all of these ways to run a query:
with driver.session():
result = session.run("<SOME CYPHER>")
def work(tx):
result = tx.run("<SOME CYPHER>")
with driver.session():
session.write_transaction(work)
The latter one might be 3 lines longer and the team working on the drivers collected some feedback regarding this. However, there are more things to consider here. Firstly, changing the API surface is something that needs careful planning and cannot be done in say a patch release. Secondly, there are technical hurdles to overcome. Here are the semantics, anyway:
Auto-commit transaction. Runs only that query as one unit of work.
If you run a new auto-commit transaction within the same session, the previous result will buffer all available records for you (depending on the query, this will consume a lot of memory). This can be avoided by calling result.consume(). However, if the session goes out of scope, the result will be consumed automatically. This means you cannot extract further records from it. Lastly, any error will be raised and needs handling in the application code.
Managed transaction. Runs whatever unit of work you want within that function. A transaction is implicitly started and committed (unless you rollback explicitly) around the function.
If the transaction ends (end of function or rollback), the result will be consumed and become invalid. You'll have to extract all records you need before that.
This is the recommended way of using the driver because it will not raise all errors but handle some internally (where appropriate) and retry the work function (e.g. if the server is only temporarily unavailable). Since the function might be executed multiple time, you must make sure it's idempotent.
Closing thoughts:
Please remember that stackoverlfow is monitored on a best-effort basis and what can be perceived as hasty comments may get in the way of getting helpful answers to your questions
I have a simple web server which serves content over HTTPS:
sslContext = ssl.DefaultOpenSSLContextFactory(
'/home/user/certs/letsencrypt-privkey.pem',
'/home/user/certs/letsencrypt-fullchain.pem',
)
reactor.listenSSL(
port=https_server_port,
factory=website_factory,
contextFactory=sslContext,
interface=https_server_interface
)
do_print(bcolors.YELLOW + 'server.py | running https server on ' + https_server_interface + ':' + str(https_server_port) + bcolors.END)
Is it possible to reload the certificates on the fly (for example by calling a path like https://example.com/server/reload-certificates and having it execute some code) or what do I need to do in order to get it done?
I want to avoid restarting the Python process.
It is possible in several ways. Daniel F's answer is pretty good and shows a good, general technique for reconfiguring your server on the fly.
Here are a couple more techniques that are more specific to TLS support in Twisted.
First, you could reload the OpenSSL "context" object from the DefaultOpenSSLContextFactory instance. When it comes time to reload the certificates, run:
sslContext._context = None
sslContext.cacheContext()
The cacheContext call will create a new OpenSSL context, re-reading the certificate files in the process. This does have the downside of relying on a private interface (_context) and its interaction with a not-really-that-public interface (cacheContext).
You could also implement your own version of DefaultOpenSSLContextFactory so that you don't have to rely on these things. DefaultOpenSSLContextFactory doesn't really do much. Here's a copy/paste/edit that removes the caching behavior entirely:
class DefaultOpenSSLContextFactory(ContextFactory):
"""
L{DefaultOpenSSLContextFactory} is a factory for server-side SSL context
objects. These objects define certain parameters related to SSL
handshakes and the subsequent connection.
"""
_context = None
def __init__(self, privateKeyFileName, certificateFileName,
sslmethod=SSL.SSLv23_METHOD, _contextFactory=SSL.Context):
"""
#param privateKeyFileName: Name of a file containing a private key
#param certificateFileName: Name of a file containing a certificate
#param sslmethod: The SSL method to use
"""
self.privateKeyFileName = privateKeyFileName
self.certificateFileName = certificateFileName
self.sslmethod = sslmethod
self._contextFactory = _contextFactory
def getContext(self):
"""
Return an SSL context.
"""
ctx = self._contextFactory(self.sslmethod)
# Disallow SSLv2! It's insecure! SSLv3 has been around since
# 1996. It's time to move on.
ctx.set_options(SSL.OP_NO_SSLv2)
ctx.use_certificate_file(self.certificateFileName)
ctx.use_privatekey_file(self.privateKeyFileName)
Of course, this reloads the certificate files for every single connection which may be undesirable. You could add your own caching logic back in, with a control interface that fits into your certificate refresh system. This also has the downside that DefaultOpenSSLContextFactory is not really a very good SSL context factory to begin with. It doesn't follow current best practices for TLS configuration.
So you probably really want to use twisted.internet.ssl.CertificateOptions instead. This has a similar _context cache that you could clear out:
sslContext = CertificateOptions(...) # Or PrivateCertificate(...).options(...)
...
sslContext._context = None
It will regenerate the context automatically when it finds that it is None so at least you don't have to call cacheContext this way. But again you're relying on a private interface.
Another technique that's more similar to Daniel F's suggestion is to provide a new factory for the already listening socket. This avoids the brief interruption in service that comes between stopListening and listenSSL. This would be something like:
from twisted.protocols.tls import TLSMemoryBIOFactory
# DefaultOpenSSLContextFactory or CertificateOptions or whatever
newContextFactory = ...
tlsWebsiteFactory = TLSMemoryBIOFactory(
newContextFactory,
isClient=False,
websiteFactory,
)
listeningPortFileno = sslPort.fileno()
websiteFactory.sslPort.stopReading()
websiteFactory.sslPort = reactor.adoptStreamPort(
listeningPortFileno,
AF_INET,
tlsWebsiteFactory,
)
This basically just has the reactor stop servicing the old sslPort with its outdated configuration and tells it to start servicing events for that port's underlying socket on a new factory. In this approach, you have to drop down to the slightly lower level TLS interface since you can't adopt a "TLS port" since there is no such thing. Instead, you adopt the TCP port and apply the necessary TLS wrapping yourself (this is just what listenSSL is doing for you under the hood).
Note this approach is a little more limited than the others since not all reactors provide the fileno or adoptStreamPort methods. You can test for the interfaces the various objects provide if you want to use this where it's supported and degrade gracefully elsewhere.
Also note that since TLSMemoryBIOFactory is how it always works under the hood anyway, you could also twiddle its private interface, if you have a reference to it:
tlsMemoryBIOFactory._connectionCreator = IOpenSSLServerConnectionCreator(
newContextFactory,
)
and it will begin using that for new connections. But, again, private...
It is possible.
reactor.listenSSL returns a twisted.internet.tcp.Port instance which you can store somewhere accessible like in the website resource of your server, so that you can later access it:
website_resource = Website()
website_factory = server.Site(website_resource)
website_resource.sslPort = reactor.listenSSL( # <---
port=https_server_port,
factory=website_factory,
contextFactory=sslContext,
interface=https_server_interface
)
then later in your http handler (render function) you can execute the following:
if request.path == b'/server/reload-certificates':
request.setHeader("connection", "close")
self.sslPort.connectionLost(reason=None)
self.sslPort.stopListening()
self.sslListen()
return b'ok'
where self.sslListen is the initial setup code:
website_resource = Website()
website_factory = server.Site(website_resource)
def sslListen():
sslContext = ssl.DefaultOpenSSLContextFactory(
'/home/user/certs/letsencrypt-privkey.pem',
'/home/user/certs/letsencrypt-fullchain.pem',
)
website_resource.sslPort = reactor.listenSSL(
port=https_server_port,
factory=website_factory,
contextFactory=sslContext,
interface=https_server_interface
)
website_resource.sslListen = sslListen # <---
sslListen() # invoke once initially
# ...
reactor.run()
Notice that request.setHeader("connection", "close") is optional. It indicates the browser that it should close the connection and not reuse it for the next fetch to the server (HTTP/1.1 connections usually are kept open for at least 30 seconds in order to be reused).
If the connection: close header is not sent, then everything will still work, the connection will still be active and usable, but it will still be using the old certificate, which should be no problem if you're just reloading the certificates to refresh them after certbot updated them. New connections from other browsers will start using the new certificates immediately.
I am constructing a model that does large parts of its calculations in a Postgresql database (for performance reasons). It looks somewhat like this:
def sql_func1(conn):
# prepare some data, crunch some number, etc.
curs = conn.cursor()
curs.execute("SOME SQL COMMAND")
curs.commit()
curs.close()
if __name__ == "__main__":
connection = psycopg2.connect(dbname='name', user='user', password='pass', host='localhost', port=1234)
sql_func1(conn)
sql_func2(conn)
sql_func3(conn)
connection.close()
The script uses around 30 individual functions like sql_func1. Obviously it is a little awkward to manage the connection and cursor in each function all the time. Thus I started using a decorator as described here. Now I can simply wrap sql_func1 with a decorator #db_connect and pass the connection from there. However, that means I am opening and closing the connection all the time, which is not good practice either. The psycopg2 FAQ says:
Creating a connection can be slow (think of SSL over TCP) so the best
practice is to create a single connection and keep it open as long as
required. It is also good practice to rollback or commit frequently
(even after a single SELECT statement) to make sure the backend is
never left “idle in transaction”. See also psycopg2.pool for
lightweight connection pooling.
Could you please give me some insights which would be an ideal practice im my case. Should I rather use a decorator that passes the cursor object instead of the connection? If so, please provide a code sample for the decorator. As I am rather new to programming, please let me also know in case you think my overall approach is wrong.
What about storing the connection in a global variable without closing it in the finally block? Something like this (according to the example you linked):
cnn = None
def with_connection(f):
def with_connection_(*args, **kwargs):
global cnn
if not cnn:
cnn = psycopg.connect(DSN)
try:
rv = f(cnn, *args, **kwargs)
except Exception, e:
cnn.rollback()
raise
else:
cnn.commit() # or maybe not
return rv
return with_connection_