Retrieve full deleted document of delete operation using pymongo change stream

Retrieve full deleted document of delete operation using pymongo change stream - python

I've just started getting my hands dirty with MongoDB and Python, so bear with me on this one.
The scenario is as follows:
I have a MongoDB collection and using pymongo's watch I listen to changes that occur.
For the purposes of explaining my problem, let's say that I can only react to anything that happens after the change in the collection.
The problem comes when there is a delete operation happening in the collection. Change stream only returns the _id of the deleted document, while I am looking for a way of getting the full detailed document (much like how it's being return when you insert a new document).
Is this even possible and if yes, could you provide an example?

The simple answer is no, it's not possible to do that in the current version on MongoDB (4.4)
Change streams are very useful, but they can only tell you what happened post-event. For those from a SQL background used to triggers where you can get the "before" and "after" view, this might be frustrating; but it's just the way it is.

Related

How to check for available space in Elasticsearch with Python

I am completely new to Python but I need make a script that will check the Elasticsearch disk space on AWS and return a warning if it is below a certain threshold. I imagine I would have to create a client instance of Elasticsearch, create a dictionary for the search query to search for space available? But I'm honestly not sure. I'm really hoping someone can provide a starter code syntax to kind of push me in the right direction.

You use CloudWatch metrics for that, e.g.:
FreeStorageSpace
ClusterUsedSpace

login to obiee and execute SQL using python

I've tried various ways of extracting reports from Oracle Business Intelligence (not hosted locally, version 11g), and the best I've come up with so far is the pyobiee library here, which is pretty good: https://github.com/kazei92/pyobiee. I've managed to login and extract reports that I've already written, but in an ideal world I would be able to interrogate the SQL directly. I've tried this using the executeSQL function in pyobiee, but I can only manage to extract a column or two and then it can't do any more. I think I'm limited by my understanding of the SQL syntax which is not a familiar one (it's more logical, no GROUP BY requirement), and I can't find a decent summary of how to use it. Where I have found summaries, I've followed them and it doesn't work (https://docs.oracle.com/middleware/12212/biee/BIESQ/toc.htm#BIESQ102). Please can you advise where I can find a better summary of the logical SQL syntax? The other possibility is that there is something wrong with the pyobiee library (hasn't been maintained since August). I would be open to using pyodbc or cx_Oracle instead, but I can't work out how to login using these routes. Please can you advise?
The reason I'm taking this route is because my organisation has mapping tables that are not held in obiee and there is no prospect of getting them in there. So I'm working on extracting using python so that I can add the mapping tables in SQL server.

I advise you to rethink what you are doing. First of all the python is a wrapper around the OBI web services which in itself isn't wrong, but an additional layer of abstraction which hides most of the web services and functionalities. There are way more than three...
Second - the real question is "What exactly are you trying top achieve?". If you simply want data from the OBI server, then you can just as well get that over ODBC. No need for 50 additional technologies in the middle.
As far as LSQL is concerned: Yes, there is a reference: https://docs.oracle.com/middleware/12212/biee/BIESQ/BIESQ.pdf
BUT you will definitely need to know what you want to access since what's governing things is the RPD. A metadata layer. Not a database.

Best practice to update a column of all documents to Elasticsearch

I'm developing a log analysis system. The input are log files. I have an external Python program that reads the log files and decide whether a record (line) or the log files is "normal" or "malicious". I want to use Elasticsearch Update API to append my Python program's result ("normal" or "malicious") to Elasticsearch by adding a new column called result. So I can see my program's result clearly via Kibana UI.
Simply speaking, my Python code and Elasticsearch both use log files as input respectively. Now I want to update the result from Python code to Elasticsearch. What's the best way to do it?
I can think of several ways:
Elasticsearch automatically assigns a ID (_id) to a document. If I can find out how Elasticsearch calculates _id, then my Python code can calculate it by itself, then update the corresponding Elasticsearch document via _id. But the question is, Elasticsearch official documentation doesn't say about what algorithm it uses to generate _id.
Add an ID (like line number) to the log files by myself. Both my program and Elasticsearch will know this ID. My program can use this ID to update. However, the downside is that my program has to search for this ID every time because it's only a normal field instead of a built-in _id. The performance will be very bad.
My Python code gets the logs from Elasticsearch instead of reading the log files directly. But this makes the system fragile, as Elasticsearch becomes a critical point. I only want Elasticsearch to be a log viewer currently.
So the first solution will be ideal in the current view. But I'm not sure if there are any better ways to do it?

If possible, re-structure your application so that instead of dumping plain-text to a log file you're directly writing structured log information to something like Elasticsearch. Thank me later.
That isn't always feasible (e.g. if you don't control the log source). I have a few opinions on your solutions.
This feels super brittle. Elasticsearch does not base _id on the properties of a particular document. It's selected based off of existing _id fields that it has stored (and I think also off of a random seed). Even if it could work, relying on an undocumented property is a good way to shoot yourself in the foot when dealing with a team that makes breaking changes even for its documented code as often as Elasticsearch does.
This one actually isn't so bad. Elasticsearch supports manually choosing the id of a document. Even if it didn't, it performs quite well for bulk terms queries and wouldn't be as much of a bottleneck as you might think. If you really have so much data that this could break your application then Elasticsearch might not be the best tool.
This solution is great. It's super extensible and doesn't rely on a complicated dependence on how the log file is constructed, how you've chosen to index that log in Elasticsearch, and how you're choosing to read it with Python. Rather you just get a document, and if you need to update it then you do that updating.
Elasticsearch isn't really a worse point of failure here than before (if ES goes down, your app goes down in any of these solutions) -- you're just doing twice as many queries (read and write). If a factor of 2 kills your application, you either need a better solution to the problem (i.e. avoid Elasticsearch), or you need to throw more hardware at it. ES supports all kinds of sharding configurations, and you can make a robust server on the cheap.
One question though, why do you have logs in Elasticsearch that need to be updated with this particular normal/malicious property? If you're the one putting them into ES then just tag them appropriately before you ever store them to prevent the extra read that's bothering you. If that's not an option then you'll still probably be wanting to read ES directly to pull the logs into Python anyway to avoid the enormous overhead of parsing the original log file again.
If this is a one-time hotfix to existing ES data while you're rolling out normal/malicious, then don't worry about a 2x speed improvement. Just throttle the query if you're concerned about bringing down the cluster. The hotfix will execute eventually, and probably faster than if we keep deliberating about the best option.

sqlalchemy query duration logging

My apologies if this has already been asked, but I couldn't find exactly what I was looking for.
I'm looking for the best way to log the duration of queries from sqlalchemy to syslog.
I've read this: http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html#configuring-logging and I've read about the sqlalchemy signals, but I'm not sure they're fit for what I'm trying to do. The echo and echo_pool flags seem to just log to stdout? I'd like more information than what they provide, too. In my perfect world, for each query, I could get the query string, the duration of the query, and even the python stack trace.
We use sqlalchemy in a couple different ways: We use the ORM, we use the query builder, and we even use text queries passed to execute. If there's a signal that would trip a function no matter which way sqlalchemy is used, that would be perfect.
I was also thinking of finding where in the code it actually sends the query over the wire, and if someone had some insight there, that would be great too.
The more I think about it, the more I think it might not be worth the effort if it's a pain, but if someone out there knew of a hook that could save the day, I thought I'd ask. For the record, I'm using Flask with the Flask-Sqlalchemy extension.
Please let me know if anything is unclear here.
Thanks!

Flask-SQLAlchemy already records information about each query when in debug mode (or when the option is set otherwise). You can get at it with get_debug_queries().
If you are using Flask-DebugToolbar, it provides an overlay to let you explore this data for each request.
If you want to log this, you could add a function to trigger after every request that will record this info.
#app.after_request
def record_queries(response):
for info in get_debug_queries():
# write to syslog here
return response

MySql bulk import without writing a file to disk

I have a django site running with a mysql database backend. I accept fairly large uploads from one of the admin users to bulk import some data. The data comes in a format that is slightly different than the form it needs to be in the database so I need to do a little parsing.
I'd like to be able to pivot this data into csv and write it into a cStringIO object then simply use mysql's bulk import command to load that file. I'd prefer to skip writing the file to the disk first, but I can't seem to find a way around it. I've done basically this exact thing with postgresql in the past, but unfortunately this project is on mysql.
The short: Can I take an in memory file like object and somehow use mysql bulk import operation

There is an excellent tutorial called Generator Tricks for Systems Programmers that addresses processing large log files, which is similar, but not identical, to your situation. As long as you can perform the needed transform with access to only the current (and possibly previous) data in the stream, this may work for you.
I have mentioned this gem in a number of answers because I think that it introduces a different way of thinking that can be quite valuable. There is a companion piece, A Curious Course on Coroutines and Concurrency, that can seriously twist your head around.

If by "bulk import" you mean LOAD DATA [LOCAL] INFILE then, no, there's no way around first writing the data to some file, damn it all. You (and I) would really like to write the table directly from an array.
But some OSs, like Linux, allow a RAM-resident filesystem that eases some of the hurt. I'm not enough of a sysadmin to know how to set up one of these guys; I had to get my ISP's tech support to do it for me. I found an article that might have useful info.
HTH

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.