MongoDB : extremely huge log file?

MongoDB : extremely huge log file? - python

I have set up MongoDB (on windows) a while ago to listen to WebSockets. Recently, I was checking the size of the dataset I had collected so far. The total size of my data folder where the collections are stored is about 7GB now. Then, I discovered that surprisingly the size of the log file is 175GB (!!!). For now, I just used the rotate command to generate a new log file and delete the old one. But of course, I am curious as to why this extremely large log file is generated. I checked the content of the file. Amongst the millions (or billions?) of lines of messages, here are a few examples of how the messages typically look:
{"t":{"$date":"2022-01-27T01:52:07.723+01:00"},"s":"I", "c":"-", "id":20883, "ctx":"conn298926","msg":"Interrupted operation as its client disconnected","attr":{"opId":33104664}}
{"t":{"$date":"2022-01-27T01:52:07.723+01:00"},"s":"I", "c":"NETWORK", "id":22944, "ctx":"conn298926","msg":"Connection ended","attr":{"remote":"127.0.0.1:61724","uuid":"88da63c3-37dc-416f-a044-dcc3eae36d8b","connectionId":298926,"connectionCount":39}}
{"t":{"$date":"2022-01-27T01:52:07.726+01:00"},"s":"I", "c":"-", "id":20883, "ctx":"conn298929","msg":"Interrupted operation as its client disconnected","attr":{"opId":33104670}}
{"t":{"$date":"2022-01-27T01:52:07.727+01:00"},"s":"I", "c":"NETWORK", "id":22944, "ctx":"conn298929","msg":"Connection ended","attr":{"remote":"127.0.0.1:61727","uuid":"7fc3e5c7-5687-474f-aa0e-c83f501c16cc","connectionId":298929,"connectionCount":38}}
{"t":{"$date":"2022-01-27T01:52:07.737+01:00"},"s":"I", "c":"-", "id":20883, "ctx":"conn298932","msg":"Interrupted operation as its client disconnected","attr":{"opId":33104677}}
{"t":{"$date":"2022-01-27T01:52:07.737+01:00"},"s":"I", "c":"NETWORK", "id":22944, "ctx":"conn298932","msg":"Connection ended","attr":{"remote":"127.0.0.1:61730","uuid":"0c8c0ecf-7aef-4bab-980b-267a3a0c78b7","connectionId":298932,"connectionCount":37}}
...
It seems, there is some connect-disconnect thing (on various ports?) going on, but I have no idea where this comes from or what causes this.
BEFORE YOU READ ON: a simple question first that may not require going through all the code below; if these messages are not "critical" (meaning they indicate that something is going wrong), is it possible to simply turn them off in logging? That would be a simple solution. From the dataset I have, I see that my script is functioning correctly and the data is accurately collected, so I assume, it could be sufficient to simply turn off the logging messages of this type. READ FURTHER FOR MORE DETAILS...
My basic async python code for connecting to the WebSockets and receiving messages looks (shortened) like this:
async def run_socket(manager,subscription=None):
async with websockets.connect(manager.url) as socket:
# receive messages and process
while True:
# receive message
message = await socket.recv()
# process data and store
manager.process_message(message)
return
def run_stream(manager,subscription):
while True:
try:
# create a new event loop
loop = asyncio.new_event_loop()
loop.run_until_complete(run_socket(manager,subscription))
except KeyboardInterrupt as e:
print(e)
loop.close()
return
except Exception as e:
# reconnect in case of exception
print(e)
loop.close()
sleep(5)
return
# MAIN LOOP
if __name__ == "__main__":
# manager
manager = create_websocket_manager(stream)
threads = []
for subscription in manager.generate_subscriptions():
print(subscription)
# create a new event loop
loop = asyncio.new_event_loop()
# create thread for the stream
thread = threading.Thread(target=run_stream,args=(manager,subscription))
thread.daemon = False
# collect threads
threads.append(thread)
# run
thread.start()
for thread in threads:
thread.join()
The function create_websocket_manager (not shown here) creates a custom "WebSocket manager" object that parses and stores the messages that arrive from different streams. All objects share the same base class which (shortened) locks like this:
class WebsocketManager:
def __init__(self,stream):
# ... set_collection(self)
return
def set_collection(self):
# ... uri_mongo, db_name, col_name
self.col = MongoClient(uri_mongo)[db_name][col_name]
def parse_message(self,message): # -> list
# ... message -> results
return results
def process_message(self,message):
results = self.parse_message(message)
if len(results)>0:
self.col.insert_many(results)
return
The manager sets a collection for the stream (self.col = MongoClient(uri_mongo)[db_name][col_name]), then parses and stores the arriving messages to the collection.
The main loop shown further above runs various threads for several streams that are processed by the same manager object. Additionally, I run several python instances of the main loop file for different managers. Maybe the different instances of cmd running the WebSockets are what cause the connect-disconnect thing?
If you read till here, thanks already for going through this long question, and I appreciate your feedback :)
Best, JZ

Related

Websocket buffer delayed Python

I
setting up a Websocket that receives market data from 33 pairs, process the data and insert it into a local mysql database.
what I've tried so far :
Setting up the websocket works fine, then process the data on each new message function and insert it directly into the database
--> problem was that with 33 pairs the websocket was stacking up the buffer with market data, and after a few minutes I would get a delay in the database of at least 10 seconds
Then I tried processing the data through a thread : the on_message function would execute a thread that is simply putting the market data into an array, like below
datas=[]
def add_queue(symbol,t,a,b,r_n):
global datas
datas.append([symbol,t,a,b,r_n])
if json_msg['ev']=="C":
symbol=json_msg['p'].replace("/","-")
round_number=pairs_dict_new[symbol]
t = Thread(target=add_queue, args=(symbol,json_msg['t'],json_msg['a'],json_msg['b'],round_number,))
t.start()
and then another function, with a loop thread would pick it up to insert it into the database
def add_db():
global datas
try:
# db = mysql.connector.connect(
# host="104.168.157.164",
# user="bvnwurux_noe_dev",
# password="Tickprofile333",
# database="bvnwurux_tick_values"
# )
while True:
for x in datas:
database.add_db(x[0],x[1],x[2],x[3],x[4])
if x in datas:
datas.remove(x)
except KeyboardInterrupt:
print("program ending..")
t2 = Thread(target=add_db)
t2.start()
still giving a delay, and the threaded process wasn't actually using a lot of CPU but more of RAM and it just was even worse.
instead of using a websocket with a thread, I tried simple webrequests to the API call, so with 1 thread per symbol, it would loop through a webrequest and in everythread send it to the database. my issues here were that mysql connections don't like threads (sometimes they would make a request with the same connection at the same time and crash) or it would still be delayed by the time to process the code, even without buffer. the code was taking too long to process the answered request that it couldnt keep it under 10s of delay.
Here is a little example of the basic code I used to get the data.
pairs={'AUDCAD':5,'AUDCHF':5,'AUDJPY':3,'AUDNZD':5,'AUDSGD':2,'AUDUSD':5,'CADCHF':5,'CADJPY':3,'CHFJPY':3,'EURAUD':5,'EURCAD':5,'EURCHF':5,'EURGBP':5,'EURJPY':3,'EURNZD':5,'EURSGD':5,'EURUSD':5,'GBPAUD':5,'GBPCAD':5,'GBPCHF':5,'GBPJPY':3,'GBPNZD':5,'GBPSGD':5,'GBPUSD':5,'NZDCAD':5,'NZDCHF':5,'NZDJPY':3,'NZDUSD':5,'USDCAD':5,'USDCHF':5,'USDJPY':3,'USDSGD':5,'SGDJPY':3}
def on_open(ws):
print("Opened connection")
ws.send('{"action":"auth","params":"<API KEY>"}') #connecting with secret api key
def on_message(ws, message):
print("msg",message)
json_msg = json.loads(message)[0]
if json_msg['status'] == "auth_success": # successfully authenticated
r = ws.send('{"action":"subscribe","params":"C.*"}') # subscribing to currencies
print("should subscribe to " + pairs)
#once the websocket is connected to all the pairs, process the data
--> process json_msg
if __name__ == "__main__":
# websocket.enableTrace(True) # just to show all the requests made (debug mode)
ws = websocket.WebSocketApp("wss://socket.polygon.io/forex",
on_open=on_open,
on_message=on_message)
ws.run_forever(dispatcher=rel) # Set dispatcher to automatic reconnection
rel.signal(2, rel.abort) # Keyboard Interrupt
rel.dispatch()
method I tried multiprocess, but this was on the other crashing my server because it would use 100% CPU, and then the requests made on the apache server would not reach or take a long time loading. Its really a balance problem
I'm using an ubuntu server with 32CPUS, based in london and the API polygon is based in NYC.
I also tried with 4 CPUS in seattle to NYC, but still no luck.
Even with 4 pairs and 32CPUS , it would eventually reach 10s delay. I think this is more of a code structure problem.

How to restart a coroutine after a websocket stream stops receiving data?

I'm writing an asyncio application to monitor prices of crypto markets and trade/order events, but for an unknown reason some streams stop receiving data after few hours. I'm not familiar with the asyncio package and I would appreciate help in finding a solution.
Basically, the code below establishs websocket connections with a crypto exchange to listen streams of six symbols (ETH/USD, BTC/USD, BNB/USD,...) and trades events from two accounts (user1, user2). The application uses the library ccxtpro. The public method watch_ohlcv get price steams, while private methods watchMyTrades and watchOrders get new orders and trades events at account level.
The problem is that one or several streams are interrupted after few hours, and the object response get empty or None. I would like to detect and restart these streams after they stops working, how can I do that ?
# tasks.py
#app.task(bind=True, name='Start websocket loops')
def start_ws_loops(self):
ws_loops()
# methods.py
def ws_loops():
async def method_loop(client, exid, wallet, method, private, args):
exchange = Exchange.objects.get(exid=exid)
if private:
account = args['account']
else:
symbol = args['symbol']
while True:
try:
if private:
response = await getattr(client, method)()
if method == 'watchMyTrades':
do_stuff(response)
elif method == 'watchOrders':
do_stuff(response)
else:
response = await getattr(client, method)(**args)
if method == 'watch_ohlcv':
do_stuff(response)
# await asyncio.sleep(3)
except Exception as e:
print(str(e))
break
await client.close()
async def clients_loop(loop, dic):
exid = dic['exid']
wallet = dic['wallet']
method = dic['method']
private = dic['private']
args = dic['args']
exchange = Exchange.objects.get(exid=exid)
parameters = {'enableRateLimit': True, 'asyncio_loop': loop, 'newUpdates': True}
if private:
log.info('Initialize private instance')
account = args['account']
client = exchange.get_ccxt_client_pro(parameters, wallet=wallet, account=account)
else:
log.info('Initialize public instance')
client = exchange.get_ccxt_client_pro(parameters, wallet=wallet)
mloop = method_loop(client, exid, wallet, method, private, args)
await gather(mloop)
await client.close()
async def main(loop):
lst = []
private = ['watchMyTrades', 'watchOrders']
public = ['watch_ohlcv']
for exid in ['binance']:
for wallet in ['spot', 'future']:
# Private
for method in private:
for account in ['user1', 'user2']:
lst.append(dict(exid=exid,
wallet=wallet,
method=method,
private=True,
args=dict(account=account)
))
# Public
for method in public:
for symbol in ['ETH/USD', 'BTC/USD', 'BNB/USD']:
lst.append(dict(exid=exid,
wallet=wallet,
method=method,
private=False,
args=dict(symbol=symbol,
timeframe='5m',
limit=1
)
))
loops = [clients_loop(loop, dic) for dic in lst]
await gather(*loops)
loop = asyncio.new_event_loop()
loop.run_until_complete(main(loop))

let me share with you my experience since I am dealing with the same problem.
CCXT is not expected to get stalled streams after some time running it.
Unfortunately practice and theory are different and error 1006 happens quite often. I am using Binance, OKX, Bitmex and BTSE ( BTSE is not supported by CCXT) and my code runs on AWS server so I should not have any connection issue. Binance and OKX are the worst as far as error 1006 is concerned.. Honestly, after researching it on google, I have only understood 1006 is a NetworkError and I know CCXT tries to resubscribe the channel automatically. All other explanations I found online did not convince me. If somebody could give me more info about this error I would appreciate it.
In any case, every time an exception is raised, I put it in an exception_list as a dictionary containing info like time in mls, method, exchange, description ecc. The exception_list is then passed to a handle_exception method. In this case, if the list contains two 1006 exception within X time handle_exception returns we are not on sync with market data and trading must stop. I cancel all my limit order and I emit a beep ( calling human intervention).
As for your second question:
restart these streams after they stops working, how can I do that
remember that you are Running Tasks Concurrently
If return_exceptions is False (default), the first raised exception is
immediately propagated to the task that awaits on gather(). Other
awaitables in the aws sequence won’t be cancelled and will continue to
run.
here you can find info about restarting individual task in a a gather()
In your case, since you are using a single exchange (Binance) and unsubscribe is not implemented in CCXT, you will have to close the connection and restart all the task. You can still use the above example in the link for automating it. In case you are using more then one exchange you can design your code in a way that let you close and restart only the Exchange that failed.
Another option for you would be defining the tasks with more granularity in the main so that every task is related to a single and well defined exchange/user/method/symbol and every task subscribes a single channel. This will result in a more verbose and less elegant code but it will help you catching the exception and eventually restart only a specific coroutine.
I am obviously assuming that after error 1006 the channel status is unsubscribed
final thought:
never leave a robot unattended
Professional market makers with a team of engineers working in London do not go to the pub while their algos ( usually co-located within the exchange ) execute thousands of trades.
I hope this can help you or, at least, get you in the right directions for handling exceptions and restart tasks

You need to use callbacks.
For example:
ws = self.ws = await websockets.connect(END_POINTS, compression=None) # step 1
await self.ws.send(SEND_YOUR_SUBSCRIPTION_MESSAGES) # step 2
while True:
response = await self.ws.recv()
if response:
await handler(response)
In the last like await handler(response) you are sending the response to the handler().
This handler() is the callback, it is the function that actually consumes your data that you receive from the exchange server.
In this handler(), what you can do is you check if the response is your desired data (bid/ask price etc) or it throws an exception like ConnectionClosedError, in which case you restart the websocket by doing STEP 1 and STEP 2 from within your handler.
So basically in the callback method, you need to either process the data
or restart the websocket and pass the handler to it again to receive the responses.
Hope this helps. I could not share the complete code as i need to clean it for sensitive business logic.

Mixing Synchronous and A-sync code in Python

I'm trying to convert a synchronous flow in Python code which is based on callbacks to an A-syncronious flow using asyncio.
Basically the code interacts a lot with TCP/UNIX sockets. It reads data from the sockets, manipulates it to make decisions and writes stuff back to the other side. This is going on over multiple sockets at once and data is shared between the contexts to make decisions sometimes.
EDIT :: The code currently is mostly based on registering a callback to a central entity for a specific socket, and having that entity run the callback when the relevant socket is readable (something like "call this function when that socket has data to be read"). Once the callback is called - a bunch of stuff happens, and eventually a new callback is registered for when new data is available. The central entity runs a select over all sockets registered to figure out which callbacks should be called.
I'm trying to do this without refactoring my entire code and making this as seamless as possible to the programmer - so I was trying to think about it like so - all code should run the same way as it does today - but whenever the current code does a socket.recv() to get new data - the process would yield execution to other tasks. When the read returns, it should go back to handling the data from the same point using the new data it got.
To do this, I wrote a new class called AsyncSocket - which interacts with the IO streams of asyncIO and placed the Async/await statements almost solely in there - thinking that I would implement the recv method in my class to make it look like a "regular IO socket" to the rest of my code.
So far - this is my understanding of what A-sync programming should allow.
Now to the problem :
My code awaits for clients to connect - when it does, each client's context is allowed to read and write from it's own connection.
I've simplified to flow to the following to clarify the problem:
class AsyncSocket():
def __init__(self,reader,writer):
self.reader = reader
self.writer = writer
def recv(self,numBytes):
print("called recv!")
data = self.read_mitigator(numBytes)
return data
async def read_mitigator(self,numBytes):
print("Awaiting of AsyncSocket.reader.read")
data = await self.reader.read(numBytes)
print("Done Awaiting of AsyncSocket.reader.read data is %s " % data)
return data
def mit2(aSock):
return mit3(aSock)
def mit3(aSock):
return aSock.recv(100)
async def echo_server(reader, writer):
print ("New Connection!")
aSock = AsyncSocket(reader,writer) # create a new A-sync socket class and pass it on the to regular code
while True:
data = await some_func(aSock) # this would eventually read from the socket
print ("Data read is %s" % (data))
if not data:
break
writer.write(data) # echo everything back
async def main(host, port):
server = await asyncio.start_server(echo_server, host, port)
await server.serve_forever()
asyncio.run(main('127.0.0.1', 5000))
mit2() and mit3() are synchronous functions that do stuff with the data on the way back before returning to the main client's loop - but here I'm just using them as empty functions.
The problem starts when I play with the implementation of some_func().
A pass through implementation (edit: kind-of-works) - but still has issues :
def some_func(aSock):
try:
return (mit2(aSock)) # works
except:
print("Error!!!!")
While an implementation which reads the data and does something with it - like adding a suffix before returning, throws an error:
def some_func(aSock):
try:
return (mit2(aSock) + "something") # doesn't work
except:
print("Error!!!!")
The error (as far as I understand it) means it's not really doing what it should:
New Connection!
called recv!
/Users/user/scripts/asyncServer.py:36: RuntimeWarning: coroutine 'AsyncSocket.read_mitigator' was never awaited
return (mit2(aSock) + "something") # doesn't work
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Error!!!!
Data read is None
And the echo server obviously doesn't work.
Obviously my code looks more like option #2 with a lot more stuff in some_func(),mit2() and mit3() - but I can't get this to work. I'm fairly new in using asyncio/async/await - so what (rather basic concept I guess) am I missing?

This code won't work as envisioned:
def recv(self,numBytes):
print("called recv!")
data = self.read_mitigator(numBytes)
return data
async def read_mitigator(self,numBytes):
...
You cannot call an async function from a sync function and get the result, you must await it, which ensures that you return to the event loop in case the data is not yet ready. This mismatch between async and sync code is sometimes referred to as the issue of function color.
Since your code is already using non-blocking sockets and an event loop, a good approach to porting it to asyncio might be to first switch to the asyncio event loop. You can use event loop methods like sock_recv to request data:
def start():
loop = asyncio.get_event_loop()
sock = make_socket() # make sure it's non-blocking
future_data = loop.sock_recv(sock, 1024)
future_data.add_done_callback(continue_read)
# return to the event loop - when some data is ready
# continue_read will be invoked
def continue_read(future):
data = future.result()
print('got', data)
# ... do something with data, e.g. process it
# and call sock_sendall with the response
asyncio.get_event_loop().call_soon(start())
asyncio.get_event_loop().run_forever()
Once you have the program working in that mode, you can start moving to coroutines, which allow the code to look like sync code, but work in exactly the same way:
async def start():
loop = asyncio.get_event_loop()
sock = make_socket() # make sure it's non-blocking
data = await loop.sock_recv(sock, 1024)
# data is available "immediately", meaning the coroutine gets
# automatically suspended when awaiting data that is not yet
# ready, and automatically re-scheduled when the data is ready
print('got', data)
asyncio.run(start())
The next step can be eliminating make_socket and switching to asyncio streams.

How to receive multiple request in a Tornado application

I have a Tornado web application, this app can receive GET and POST request from the client.
The POSTs request put an information received in a Tornado Queue, then I pop this information from the queue and with it I do an operation on the database, this operation can be very slow, it can take several seconds to complete!
In the meantime that this database operation goes on I want to be able to receive other POSTs (that put other information in the queue) and GET. The GET are instead very fast and must return to the client their result immediatly.
The problem is that when I pop from the queue and the slow operation begin the server doesn't accept other requests from the client. How can I resolve this?
This is the semplified code I have written so far (import are omitted for avoid wall of text):
# URLs are defined in a config file
application = tornado.web.Application([
(BASE_URL, Variazioni),
(ARTICLE_URL, Variazioni),
(PROMO_URL, Variazioni),
(GET_FEEDBACK_URL, Feedback)
])
class Server:
def __init__(self):
http_server = tornado.httpserver.HTTPServer(application, decompress_request=True)
http_server.bind(8889)
http_server.start(0)
transactions = TransactionsQueue() #contains the queue and the function with interact with it
IOLoop.instance().add_callback(transactions.process)
def start(self):
try:
IOLoop.instance().start()
except KeyboardInterrupt:
IOLoop.instance().stop()
if __name__ == "__main__":
server = Server()
server.start()
class Variazioni(tornado.web.RequestHandler):
''' Handle the POST request. Put an the data received in the queue '''
#gen.coroutine
def post(self):
TransactionsQueue.put(self.request.body)
self.set_header("Location", FEEDBACK_URL)
class TransactionsQueue:
''' Handle the queue that contains the data
When a new request arrive, the generated uuid is putted in the queue
When the data is popped out, it begin the operation on the database
'''
queue = Queue(maxsize=3)
#staticmethod
def put(request_uuid):
''' Insert in the queue the uuid in postgres format '''
TransactionsQueue.queue.put(request_uuid)
#gen.coroutine
def process(self):
''' Loop over the queue and load the data in the database '''
while True:
# request_uuid is in postgres format
transaction = yield TransactionsQueue.queue.get()
try:
# this is the slow operation on the database
yield self._load_json_in_db(transaction )
finally:
TransactionsQueue.queue.task_done()
Moreover I don't understand why if I do 5 POST in a row, it put all five data in the queue though the maximun size is 3.

I'm going to guess that you use a synchronous database driver, so _load_json_in_db, although it is a coroutine, is not actually async. Therefore it blocks the entire event loop until the long operation completes. That's why the server doesn't accept more requests until the operation is finished.
Since _load_json_in_db blocks the event loop, Tornado can't accept more requests while it's running, so your queue never grows to its max size.
You need two fixes.
First, use an async database driver written specifically for Tornado, or run database operations on threads using Tornado's ThreadPoolExecutor.
Once that's done your application will be able to fill the queue, so second, TransactionsQueue.put must do:
TransactionsQueue.queue.put_nowait(request_uuid)
This throws an exception if there are already 3 items in the queue, which I think is what you intend.

How to wait for messages on multiple queues using py-amqplib

I'm using py-amqplib to access RabbitMQ in Python. The application receives requests to listen on certain MQ topics from time to time.
The first time it receives such a request it creates an AMQP connection and a channel and starts a new thread to listen for messages:
connection = amqp.Connection(host = host, userid = "guest", password = "guest", virtual_host = "/", insist = False)
channel = connection.channel()
listener = AMQPListener(channel)
listener.start()
AMQPListener is very simple:
class AMQPListener(threading.Thread):
def __init__(self, channel):
threading.Thread.__init__(self)
self.__channel = channel
def run(self):
while True:
self.__channel.wait()
After creating the connection it subscribes to the topic of interest, like this:
channel.queue_declare(queue = queueName, exclusive = False)
channel.exchange_declare(exchange = MQ_EXCHANGE_NAME, type = "direct", durable = False, auto_delete = True)
channel.queue_bind(queue = queueName, exchange = MQ_EXCHANGE_NAME, routing_key = destination)
def receive_callback(msg):
self.queue.put(msg.body)
channel.basic_consume(queue = queueName, no_ack = True, callback = receive_callback)
The first time this all works fine. However, it fails on a subsequent request to subscribe to another topic. On subsequent requests I re-use the AMQP connection and AMQPListener thread (since I don't want to start a new thread for each topic) and when I call the code block above the channel.queue_declare() method call never returns. I've also tried creating a new channel at that point and the connection.channel() call never returns, either.
The only way I've been able to get it to work is to create a new connection, channel and listener thread per topic (ie. routing_key), but this is really not ideal. I suspect it's the wait() method that's somehow blocking the entire connection, but I'm not sure what to do about it. Surely I should be able to receive messages with several routing keys (or even on several channels) using a single listener thread?
A related question is: how do I stop the listener thread when that topic is no longer of interest? The channel.wait() call appears to block forever if there are no messages. The only way I can think of is to send a dummy message to the queue that would "poison" it, ie. be interpreted by the listener as a signal to stop.

If you want more than one comsumer per channel just attach another one using basic_consume() and use channel.wait() after. It will listen to all queues attached via basic_consume(). Make sure you define different consumer tags for each basic_consume().
Use channel.basic_cancel(consumer_tag) if you want to cancel a specific consumer on a queue (cancelling listen to a specific topic).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.