sqlalchemy query duration logging - python

My apologies if this has already been asked, but I couldn't find exactly what I was looking for.
I'm looking for the best way to log the duration of queries from sqlalchemy to syslog.
I've read this: http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html#configuring-logging and I've read about the sqlalchemy signals, but I'm not sure they're fit for what I'm trying to do. The echo and echo_pool flags seem to just log to stdout? I'd like more information than what they provide, too. In my perfect world, for each query, I could get the query string, the duration of the query, and even the python stack trace.
We use sqlalchemy in a couple different ways: We use the ORM, we use the query builder, and we even use text queries passed to execute. If there's a signal that would trip a function no matter which way sqlalchemy is used, that would be perfect.
I was also thinking of finding where in the code it actually sends the query over the wire, and if someone had some insight there, that would be great too.
The more I think about it, the more I think it might not be worth the effort if it's a pain, but if someone out there knew of a hook that could save the day, I thought I'd ask. For the record, I'm using Flask with the Flask-Sqlalchemy extension.
Please let me know if anything is unclear here.
Thanks!

Flask-SQLAlchemy already records information about each query when in debug mode (or when the option is set otherwise). You can get at it with get_debug_queries().
If you are using Flask-DebugToolbar, it provides an overlay to let you explore this data for each request.
If you want to log this, you could add a function to trigger after every request that will record this info.
#app.after_request
def record_queries(response):
for info in get_debug_queries():
# write to syslog here
return response

Related

Retrieve full deleted document of delete operation using pymongo change stream

I've just started getting my hands dirty with MongoDB and Python, so bear with me on this one.
The scenario is as follows:
I have a MongoDB collection and using pymongo's watch I listen to changes that occur.
For the purposes of explaining my problem, let's say that I can only react to anything that happens after the change in the collection.
The problem comes when there is a delete operation happening in the collection. Change stream only returns the _id of the deleted document, while I am looking for a way of getting the full detailed document (much like how it's being return when you insert a new document).
Is this even possible and if yes, could you provide an example?
The simple answer is no, it's not possible to do that in the current version on MongoDB (4.4)
Change streams are very useful, but they can only tell you what happened post-event. For those from a SQL background used to triggers where you can get the "before" and "after" view, this might be frustrating; but it's just the way it is.

login to obiee and execute SQL using python

I've tried various ways of extracting reports from Oracle Business Intelligence (not hosted locally, version 11g), and the best I've come up with so far is the pyobiee library here, which is pretty good: https://github.com/kazei92/pyobiee. I've managed to login and extract reports that I've already written, but in an ideal world I would be able to interrogate the SQL directly. I've tried this using the executeSQL function in pyobiee, but I can only manage to extract a column or two and then it can't do any more. I think I'm limited by my understanding of the SQL syntax which is not a familiar one (it's more logical, no GROUP BY requirement), and I can't find a decent summary of how to use it. Where I have found summaries, I've followed them and it doesn't work (https://docs.oracle.com/middleware/12212/biee/BIESQ/toc.htm#BIESQ102). Please can you advise where I can find a better summary of the logical SQL syntax? The other possibility is that there is something wrong with the pyobiee library (hasn't been maintained since August). I would be open to using pyodbc or cx_Oracle instead, but I can't work out how to login using these routes. Please can you advise?
The reason I'm taking this route is because my organisation has mapping tables that are not held in obiee and there is no prospect of getting them in there. So I'm working on extracting using python so that I can add the mapping tables in SQL server.
I advise you to rethink what you are doing. First of all the python is a wrapper around the OBI web services which in itself isn't wrong, but an additional layer of abstraction which hides most of the web services and functionalities. There are way more than three...
Second - the real question is "What exactly are you trying top achieve?". If you simply want data from the OBI server, then you can just as well get that over ODBC. No need for 50 additional technologies in the middle.
As far as LSQL is concerned: Yes, there is a reference: https://docs.oracle.com/middleware/12212/biee/BIESQ/BIESQ.pdf
BUT you will definitely need to know what you want to access since what's governing things is the RPD. A metadata layer. Not a database.

Best practice to update a column of all documents to Elasticsearch

I'm developing a log analysis system. The input are log files. I have an external Python program that reads the log files and decide whether a record (line) or the log files is "normal" or "malicious". I want to use Elasticsearch Update API to append my Python program's result ("normal" or "malicious") to Elasticsearch by adding a new column called result. So I can see my program's result clearly via Kibana UI.
Simply speaking, my Python code and Elasticsearch both use log files as input respectively. Now I want to update the result from Python code to Elasticsearch. What's the best way to do it?
I can think of several ways:
Elasticsearch automatically assigns a ID (_id) to a document. If I can find out how Elasticsearch calculates _id, then my Python code can calculate it by itself, then update the corresponding Elasticsearch document via _id. But the question is, Elasticsearch official documentation doesn't say about what algorithm it uses to generate _id.
Add an ID (like line number) to the log files by myself. Both my program and Elasticsearch will know this ID. My program can use this ID to update. However, the downside is that my program has to search for this ID every time because it's only a normal field instead of a built-in _id. The performance will be very bad.
My Python code gets the logs from Elasticsearch instead of reading the log files directly. But this makes the system fragile, as Elasticsearch becomes a critical point. I only want Elasticsearch to be a log viewer currently.
So the first solution will be ideal in the current view. But I'm not sure if there are any better ways to do it?
If possible, re-structure your application so that instead of dumping plain-text to a log file you're directly writing structured log information to something like Elasticsearch. Thank me later.
That isn't always feasible (e.g. if you don't control the log source). I have a few opinions on your solutions.
This feels super brittle. Elasticsearch does not base _id on the properties of a particular document. It's selected based off of existing _id fields that it has stored (and I think also off of a random seed). Even if it could work, relying on an undocumented property is a good way to shoot yourself in the foot when dealing with a team that makes breaking changes even for its documented code as often as Elasticsearch does.
This one actually isn't so bad. Elasticsearch supports manually choosing the id of a document. Even if it didn't, it performs quite well for bulk terms queries and wouldn't be as much of a bottleneck as you might think. If you really have so much data that this could break your application then Elasticsearch might not be the best tool.
This solution is great. It's super extensible and doesn't rely on a complicated dependence on how the log file is constructed, how you've chosen to index that log in Elasticsearch, and how you're choosing to read it with Python. Rather you just get a document, and if you need to update it then you do that updating.
Elasticsearch isn't really a worse point of failure here than before (if ES goes down, your app goes down in any of these solutions) -- you're just doing twice as many queries (read and write). If a factor of 2 kills your application, you either need a better solution to the problem (i.e. avoid Elasticsearch), or you need to throw more hardware at it. ES supports all kinds of sharding configurations, and you can make a robust server on the cheap.
One question though, why do you have logs in Elasticsearch that need to be updated with this particular normal/malicious property? If you're the one putting them into ES then just tag them appropriately before you ever store them to prevent the extra read that's bothering you. If that's not an option then you'll still probably be wanting to read ES directly to pull the logs into Python anyway to avoid the enormous overhead of parsing the original log file again.
If this is a one-time hotfix to existing ES data while you're rolling out normal/malicious, then don't worry about a 2x speed improvement. Just throttle the query if you're concerned about bringing down the cluster. The hotfix will execute eventually, and probably faster than if we keep deliberating about the best option.

Way to display number of queries on a page? (Google App Engine)

I used to program procedural PHP with direct MySQL queries. So it was real easy to see what was going on in terms of hitting the DB. Now that I'm using MVC pattern with Python on GAE, it's all a little mysterious to me :) I generally think I know where all the DB activity is. But I was wondering if there was a way to figure out the number of times we hit the DB (App Engine datastore) on a given page (view). Just in case I program something the wrong way expecting 4 hits and I'm actually in some strange loops that hits it 200 times. And I think it would just be good to have so I can get a rough idea of what's going on.
Anyone have any ideas?
p.s. I'm using Flask, if that matters.
Try appstats. Pretty easy to setup, and you'll be able to see all major RPC calls.
https://developers.google.com/appengine/docs/python/tools/appstats
An in-line alternative that we use with a lot of success is https://github.com/kamens/gae_mini_profiler.
You can view the traceable stack and get a lot of information not visible to you with just appstats.

most efficient lookup for url caching system for REST API

What's the best way to store, index, and lookup text strings (URLs in this case)?
I'm creating a caching system for one of my sites. It's actually a bit more complex than that, thus the reason I'm rolling my own. I'm looking for the quickest, most-efficient way to resolve lookups on URLs, which obviously are text strings.
I'm currently using MySQL for a lot of my backend, and obviously I could just throw this in a table as a text field for the URL and its contents and turn on full text indexing, but that just feels slow and fundamentally wrong. Is there something else I should be looking at, whether it's MySQL or some other tool? Should I MD5 the URL, does that give me anything?
I've heard interesting things about mongodb too, but not sure if that buys me anything.
Memcached - easy, quick, found everywhere. I use it a lot.
MongoDB is a database, not a caching system. The speed difference between it and MySQL is not likely to be huge.
As D Mac mention memcached is an excellent choice for this. You do need to be aware that memcached is a true caching system and can throw your data away at any moment. You must be able to handle that.
A good compromise is redis, which is an in-memory database so it won't throw your data away like memcached will but also it will be an order of magnitude quicker than MySQL or MongoDB. The only downside to redis is that your whole dataset must fit into memory.
Your question contains lots of subquestions but not a lot of detail on what you're actually doing so it's hard to give a good answer.

Categories