How do I make PySolr drop a connection?

How do I make PySolr drop a connection? - python

I'm working on time series charts for 300+ clients.
It is beneficial to us to pull each client separately as the combined data is huge and in some cases clients data is resampled or manipulated in a slightly different fashion.
My problem is that the function I loop through to get each client data opens 3 new threads but never closes the threads (I'm assuming the connection stays open) when the request is complete and the function returns the data.
Once I have the results of a client, I'd like to close that connection. I just can't figure out how to do that and haven't been able to find anything in my searches.
def solr_data_pull(submitterId):
zookeeper= pysolr.ZooKeeper('ndhhadr1dnp11,ndhhadr1dnp12,ndhhadr1dnp13:2181/solr')
solr = pysolr.SolrCloud(zookeeper, collection='tran_timings', timeout=60)
query = ('SubmitterId:'+ str(submitterId) +' AND Tier:'+tier+' AND Mode:'+mode+' '
'AND Timestamp:['+ str(start_period)+' TO '+ str(end_period)+ '] ')
results = solr.search(rows=50000, q=[query], fl=[fl_list])
return(pd.DataFrame(list(results)))

PySolr uses the Session object from requests as its underlying library (which in turn uses urllib3s connection pooling), so calling solr.get_session().close() should close all connections and drain the pool:
def close(self):
"""Closes all adapters and as such the session"""
(SolrCloud is an extension of Solr which have the get_session() method.)
For disconnecting from Zookeeper - which you probably shouldn't if its a long running session as it'll have to set up watches etc. again, you can use the .zk object directly on your SolrCloud instance - zk is a KazooClient:
stop()
Gracefully stop this Zookeeper session.
close()
Free any resources held by the client.
This method should be called on a stopped client before
it is discarded. Not doing so may result in filehandles
being leaked.

Related

How to keep an inactive connection open with PycURL?

Pseudo-code to better explain question:
#!/usr/bin/env python2.7
import pycurl, threading
def threaded_work():
conn = pycurl.Curl()
conn.setopt(pycurl.TIMEOUT, 10)
# Make a request to host #1 just to open the connection to it.
conn.setopt(pycurl.URL, 'https://host1.example.com/')
conn.perform_rs()
while not condition_that_may_take_very_long:
conn.setopt(pycurl.URL, 'https://host2.example.com/')
print 'Response from host #2: ' + conn.perform_rs()
# Now, after what may be a very long time, we must request host #1 again with a (hopefully) already established connection.
conn.setopt(pycurl.URL, 'https://host1.example.com/')
print 'Response from host #1, hopefully with an already established connection from above: ' + conn.perform_rs()
conn.close()
for _ in xrange(30):
# Multiple threads must work with host #1 and host #2 individually.
threading.Thread(target = threaded_work).start()
I am omitting extra, only unnecessary details for brevity so that the main problem has focus.
As you can see, I have multiple threads that must work with two different hosts, host #1 and host #2. Mostly, the threads will be working with host #2 until a certain condition is met. That condition may take hours or even longer to be met, and will be met at different times in different threads. Once the condition (condition_that_may_take_very_long in the example) is met, I would like host #1 to be requested as fast as possible with the connection that I have already established at the start of the threaded_work method. Is there any efficient way to efficiently accomplish this (open to the suggestion of using two PycURL handles, too)?

Pycurl uses libcurl. libcurl keeps connections alive by default after use, so as long as you keep the handle alive and use that for the subsequent transfer, it will keep the connection alive and ready for reuse.
However, due to modern networks and network equipment (NATs, firewalls, web servers), connections without traffic are often killed off relatively soon so having an idle connection and expecting it to actually work after "hours", is a very slim chance and rare occurance. Typically, libcurl will then discover that the connection has been killed in the mean time and create a new one to use at the next use.
Additionally, and in line with what I've described above, since libcurl 7.65.0 it now defaults to not reusing connections anymore that are older than 118 seconds. Changeable with the CURLOPT_MAXAGE_CONN option. The reason is that they barely ever work so by avoiding having to keep them around, detect them to be dead and reissue the request, this is an optimization.

Websocket Threading

Below is the code to receive live ticks using WebSocket. Each time tick is received callback function on_ticks() is called and it will print ticks.
Can I spawn a single thread in on_ticks() function and call store_ticks() function to store the ticks in the database? if yes can someone please help and show how can it be done? Or is there any other way to call store_ticks() function and store the ticks each time ticks is received?
from kiteconnect import KiteTicker
kws = KiteTicker("your_api_key", "your_access_token")
def on_ticks(ws, ticks):
print(ticks)
def on_connect(ws, response):
# Callback on successful connect.
# Subscribe to a list of instrument_tokens
ws.subscribe([738561, 5633])
def store_ticks():
# Store ticks here
def on_close(ws, code, reason):
# On connection close stop the main loop
# Reconnection will not happen after executing `ws.stop()`
ws.stop()
# Assign the callbacks.
kws.on_ticks = on_ticks
kws.on_connect = on_connect
kws.on_close = on_close
kws.connect()

If the reason you want to spawn a new thread is to avoid delays, I'd say don't be bothered.
I have been using mysql-client (MySQLDB connector) with a MariaDB server, subscribed to 100+ instruments in Full mode, for the past 2 months and there have been no delays in writing the ticks to the DB.
Also, we do not know when and how many ticks we'd receive once we start the ticker.This makes it hard to time/count and close the thread and DB connection. Could end up exhausting the connection limit and the thread # really fast. (DB connection pooling is an overkill here)
The reason I use MySQLDB connector and not pymysql - I've seen an approx 20% increase in write times while using pymsql. This wouldn't be obvious in live ticks . I had cloned a medium sized DB (1 Mill+ rows) , dumped it to a Dataframe in python and then wrote it row by row to another DB and bench marked the result for 10 iterations.
The reason I use MariaDB - all the features of MySQL enterprise edition, without the Oracle fuss.
Just make sure that you set a decent amount of Memory for the DB server you use.
This creates a breathing space for the DB's buffer just in case.
Avoiding a remote server and sticking on to a local sever also helps to great extent.
If you want to back up the data from local to a cloud backup, you can setup a daily job to dump in local, export to cloud and load to DB there
If you are looking for a walkthrough, this page has an example already, along with a code walk through video.
Edit:
I just made my code public here

You could modify your store_ticks() function to
def store_ticks(ticks):
# code to store tick into database
and then modify your on_ticks function to:
def on_ticks(ws, ticks):
print(ticks)
store_ticks(ticks)
What goes inside store_ticks(ticks) is dependent on what database you want to use and what info exactly you wish to store in there.
EDIT:
To spawn a new thread for store_ticks(), use the _thread module:
import _thread
def store_ticks(ticks):
# code to store tick into database
def on_ticks(ticks):
print(ticks)
try:
_thread.start_new_thread(store_ticks, (ticks,))
except:
# unable to start the thread, probably want some logging here

import a Queue and Threading
on_tick() insert data in to the Queue
store_ticks method contains code to save to database and clear Queue
start another Deamon thread sharing the data in Queue and store_ticks
PS: very lazy to open editor and write code

JIRA API - Python: Listen to jira.search_issues and execute when it changes

I need to listen to the return "
jira.search_issues (jql_str = "status = 'WAITING FOR SUPPORT'")) " and when the same switch executes an os.system ().
I was not willing to use WHILE to avoid generating too many connections.

Using while does not generate any more connections then you already are. Once the jira object exists, the connection has been made and will persist regardless of what calls you make. However, if you mean that you are not willing to make that many calls to the API, then there is no way in which you can "listen" as by definition a listener is continuously waiting and asking if anything has changed. If you are not willing to listen, then you have a few options:
Perform jira.search_issues (jql_str = "status = 'WAITING FOR SUPPORT'")) in a while loop once every 30 minutes.
Have some other service listen for you; for instance the automation for jira plugin is a good option in this case.
Run the while loop in a separate thread.

In this Python 3 client-server example, client can't send more than one message

This is a simple client-server example where the server returns whatever the client sends, but reversed.
Server:
import socketserver
class MyTCPHandler(socketserver.BaseRequestHandler):
def handle(self):
self.data = self.request.recv(1024)
print('RECEIVED: ' + str(self.data))
self.request.sendall(str(self.data)[::-1].encode('utf-8'))
server = socketserver.TCPServer(('localhost', 9999), MyTCPHandler)
server.serve_forever()
Client:
import socket
import threading
s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
s.connect(('localhost',9999))
def readData():
while True:
data = s.recv(1024)
if data:
print('Received: ' + data.decode('utf-8'))
t1 = threading.Thread(target=readData)
t1.start()
def sendData():
while True:
intxt = input()
s.send(intxt.encode('utf-8'))
t2 = threading.Thread(target=sendData)
t2.start()
I took the server from an example I found on Google, but the client was written from scratch. The idea was having a client that can keep sending and receiving data from the server indefinitely.
Sending the first message with the client works. But when I try to send a second message, I get this error:
ConnectionAbortedError: [WinError 10053] An established connection was
aborted by the software in your host machine
What am I doing wrong?

For TCPServer, the handle method of the handler gets called once to handle the entire session. This may not be entirely clear from the documentation, but socketserver is, like many libraries in the stdlib, meant to serve as clear sample code as well as to be used directly, which is why the docs link to the source, where you can clearly see that it's only going to call handle once per connection (TCPServer.get_request is defined as just calling accept on the socket).
So, your server receives one buffer, sends back a response, and then quits, closing the connection.
To fix this, you need to use a loop:
def handle(self):
while True:
self.data = self.request.recv(1024)
if not self.data:
print('DISCONNECTED')
break
print('RECEIVED: ' + str(self.data))
self.request.sendall(str(self.data)[::-1].encode('utf-8'))
A few side notes:
First, using BaseRequestHandler on its own only allows you to handle one client connection at a time. As the introduction in the docs says:
These four classes process requests synchronously; each request must be completed before the next request can be started. This isn’t suitable if each request takes a long time to complete, because it requires a lot of computation, or because it returns a lot of data which the client is slow to process. The solution is to create a separate process or thread to handle each request; the ForkingMixIn and ThreadingMixIn mix-in classes can be used to support asynchronous behaviour.
Those mixin classes are described further in the rest of the introduction, and farther down the page, and at the bottom, with a nice example at the end. The docs don't make it clear, but if you need to do any CPU-intensive work in your handler, you want ForkingMixIn; if you need to share data between handlers, you want ThreadingMixIn; otherwise it doesn't matter much which you choose.
Note that if you're trying to handle a large number of simultaneous clients (more than a couple dozen), neither forking nor threading is really appropriate—which means TCPServer isn't really appropriate. For that case, you probably want asyncio, or a third-party library (Twisted, gevent, etc.).
Calling str(self.data) is a bad idea. You're just going to get the source-code-compatible representation of the byte string, like b'spam\n'. What you want is to decode the byte string into the equivalent Unicode string: self.data.decode('utf8').
There's no guarantee that each sendall on one side will match up with a single recv on the other side. TCP is a stream of bytes, not a stream of messages; it's perfectly possible to get half a message in one recv, and two and a half messages in the next one. When testing with a single connection on localhost with the system under light load, it will probably appear to "work", but as soon as you try to deploy any code that assumes that each recv gets exactly one message, your code will break. See Sockets are byte streams, not message streams for more details. Note that if your messages are just lines of text (as they are in your example), using StreamRequestHandler and its rfile attribute, instead of BaseRequestHandler and its request attribute, solves this problem trivially.
You probably want to set server.allow_reuse_address = True. Otherwise, if you quit the server and re-launch it again too quickly, it'll fail with an error like OSError: [Errno 48] Address already in use.

Python Twisted, SSL Timeout Error

from twisted.web.resource import Resource
from twisted.web.server import Site, Session
from twisted.internet import ssl
from twisted.internet import reactor
class Echo(Resource):
def render_GET(self, request):
return "GET"
class WebSite(Resource):
def start(self):
factory = Site(self, timeout=5)
factory.sessionFactory = Session
self.putChild("echo", Echo())
reactor.listenSSL(443, factory, ssl.DefaultOpenSSLContextFactory('privkey.pem', 'cacert.pem'))
#reactor.listenTCP(8080, factory)
self.sessions = factory.sessions
if __name__ == '__main__':
ws = WebSite()
ws.start()
reactor.run()
On the code above, when i enter the url "https: //localhost/echo" from the web browser, it gets the page. After 5 seconds later i try to reload the page, it does not refresh the web page, stuck on reloading operation. On the second attempt of reload, it gets the page instantly.
When i run the code above with reactor.listenTCP(8080, factory), no such problem occurs. (I can reload page without stucking reload and get the page instantly)
Problem can be repeated with Chrome, Firefox. But when i try it with Ubuntu's Epiphany browser, no such problem occurs.
I could not understand why this occurs.
Any comment about understanding/solving problem will be appriciated.
Extra info:
When i use listenSSL, file descriptor related with the connection does not close after timeout seconds later. While reloading page it stays still, and on the second reload operation, it is closed and new file descriptor is opened. (and i get page instantly)
When i use listenTCP, file descriptor closes after timeout seconds later, and when i reload page it opens new file descriptor and return page instantly.
Also with Telnet connection, it timeout connections as expected in both case.
Twisted client that connects this server also affects timeouts as expected.

The class that timeout connection is TimeoutMixin class.
and it uses transport.loseConneciton() method to timeout connections.
Somehow, DefaultOpenSSLFactory uses the connection(?), therefore loseConnection method waits for finishing the transportation and at that time it doesn't accept any process on the connection.
According to twisted documentation:
In the code above, loseConnection is called immediately after writing to the transport. The loseConnection call will close the connection only when all the data has been written by Twisted out to the operating system, so it is safe to use in this case without worrying about transport writes being lost. If a producer is being used with the transport, loseConnection will only close the connection once the producer is unregistered.
In some cases, waiting until all the data is written out is not what
we want. Due to network failures, or bugs or maliciousness in the
other side of the connection, data written to the transport may not be
deliverable, and so even though loseConnection was called the
connection will not be lost. In these cases, abortConnection can be
used: it closes the connection immediately, regardless of buffered
data that is still unwritten in the transport, or producers that are
still registered. Note that abortConnection is only available in
Twisted 11.1 and newer.
As a result, when i change loseConnection() with abortConnection() on timeoutMixinClass via overriding it, situation is no more occuring.
When i clarify the reason of why loseConnection is not enough to close connection on specific situations, i'll note it here. (any comment about it will be appreciated)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.