How to check for available space in Elasticsearch with Python - python

I am completely new to Python but I need make a script that will check the Elasticsearch disk space on AWS and return a warning if it is below a certain threshold. I imagine I would have to create a client instance of Elasticsearch, create a dictionary for the search query to search for space available? But I'm honestly not sure. I'm really hoping someone can provide a starter code syntax to kind of push me in the right direction.

You use CloudWatch metrics for that, e.g.:
FreeStorageSpace
ClusterUsedSpace

Related

Retrieve full deleted document of delete operation using pymongo change stream

I've just started getting my hands dirty with MongoDB and Python, so bear with me on this one.
The scenario is as follows:
I have a MongoDB collection and using pymongo's watch I listen to changes that occur.
For the purposes of explaining my problem, let's say that I can only react to anything that happens after the change in the collection.
The problem comes when there is a delete operation happening in the collection. Change stream only returns the _id of the deleted document, while I am looking for a way of getting the full detailed document (much like how it's being return when you insert a new document).
Is this even possible and if yes, could you provide an example?
The simple answer is no, it's not possible to do that in the current version on MongoDB (4.4)
Change streams are very useful, but they can only tell you what happened post-event. For those from a SQL background used to triggers where you can get the "before" and "after" view, this might be frustrating; but it's just the way it is.

Safety/Encryption for python script (bot)

I built a python script (bot) for a game and I do plan on building a GUI for it to make it more user friendly. However I wanted to add some sort of security along with it, something that would only give access to whoever I want, so maybe adding some kind of encryption key to it, I was thinking something along the lines of an encrypted key to unlock the files and with limited use(a few days for example). I am new when it comes to this specific 'security' topic, so I need help better understanding what my options are and what I can do or search for. Thank you for reading.
after days of searching and trying I figured the easiest way was to use a web API to check requests, you can use for example cryptolens web api or any other api and your encrypted file will work just fine.

login to obiee and execute SQL using python

I've tried various ways of extracting reports from Oracle Business Intelligence (not hosted locally, version 11g), and the best I've come up with so far is the pyobiee library here, which is pretty good: https://github.com/kazei92/pyobiee. I've managed to login and extract reports that I've already written, but in an ideal world I would be able to interrogate the SQL directly. I've tried this using the executeSQL function in pyobiee, but I can only manage to extract a column or two and then it can't do any more. I think I'm limited by my understanding of the SQL syntax which is not a familiar one (it's more logical, no GROUP BY requirement), and I can't find a decent summary of how to use it. Where I have found summaries, I've followed them and it doesn't work (https://docs.oracle.com/middleware/12212/biee/BIESQ/toc.htm#BIESQ102). Please can you advise where I can find a better summary of the logical SQL syntax? The other possibility is that there is something wrong with the pyobiee library (hasn't been maintained since August). I would be open to using pyodbc or cx_Oracle instead, but I can't work out how to login using these routes. Please can you advise?
The reason I'm taking this route is because my organisation has mapping tables that are not held in obiee and there is no prospect of getting them in there. So I'm working on extracting using python so that I can add the mapping tables in SQL server.
I advise you to rethink what you are doing. First of all the python is a wrapper around the OBI web services which in itself isn't wrong, but an additional layer of abstraction which hides most of the web services and functionalities. There are way more than three...
Second - the real question is "What exactly are you trying top achieve?". If you simply want data from the OBI server, then you can just as well get that over ODBC. No need for 50 additional technologies in the middle.
As far as LSQL is concerned: Yes, there is a reference: https://docs.oracle.com/middleware/12212/biee/BIESQ/BIESQ.pdf
BUT you will definitely need to know what you want to access since what's governing things is the RPD. A metadata layer. Not a database.

Best practice to update a column of all documents to Elasticsearch

I'm developing a log analysis system. The input are log files. I have an external Python program that reads the log files and decide whether a record (line) or the log files is "normal" or "malicious". I want to use Elasticsearch Update API to append my Python program's result ("normal" or "malicious") to Elasticsearch by adding a new column called result. So I can see my program's result clearly via Kibana UI.
Simply speaking, my Python code and Elasticsearch both use log files as input respectively. Now I want to update the result from Python code to Elasticsearch. What's the best way to do it?
I can think of several ways:
Elasticsearch automatically assigns a ID (_id) to a document. If I can find out how Elasticsearch calculates _id, then my Python code can calculate it by itself, then update the corresponding Elasticsearch document via _id. But the question is, Elasticsearch official documentation doesn't say about what algorithm it uses to generate _id.
Add an ID (like line number) to the log files by myself. Both my program and Elasticsearch will know this ID. My program can use this ID to update. However, the downside is that my program has to search for this ID every time because it's only a normal field instead of a built-in _id. The performance will be very bad.
My Python code gets the logs from Elasticsearch instead of reading the log files directly. But this makes the system fragile, as Elasticsearch becomes a critical point. I only want Elasticsearch to be a log viewer currently.
So the first solution will be ideal in the current view. But I'm not sure if there are any better ways to do it?
If possible, re-structure your application so that instead of dumping plain-text to a log file you're directly writing structured log information to something like Elasticsearch. Thank me later.
That isn't always feasible (e.g. if you don't control the log source). I have a few opinions on your solutions.
This feels super brittle. Elasticsearch does not base _id on the properties of a particular document. It's selected based off of existing _id fields that it has stored (and I think also off of a random seed). Even if it could work, relying on an undocumented property is a good way to shoot yourself in the foot when dealing with a team that makes breaking changes even for its documented code as often as Elasticsearch does.
This one actually isn't so bad. Elasticsearch supports manually choosing the id of a document. Even if it didn't, it performs quite well for bulk terms queries and wouldn't be as much of a bottleneck as you might think. If you really have so much data that this could break your application then Elasticsearch might not be the best tool.
This solution is great. It's super extensible and doesn't rely on a complicated dependence on how the log file is constructed, how you've chosen to index that log in Elasticsearch, and how you're choosing to read it with Python. Rather you just get a document, and if you need to update it then you do that updating.
Elasticsearch isn't really a worse point of failure here than before (if ES goes down, your app goes down in any of these solutions) -- you're just doing twice as many queries (read and write). If a factor of 2 kills your application, you either need a better solution to the problem (i.e. avoid Elasticsearch), or you need to throw more hardware at it. ES supports all kinds of sharding configurations, and you can make a robust server on the cheap.
One question though, why do you have logs in Elasticsearch that need to be updated with this particular normal/malicious property? If you're the one putting them into ES then just tag them appropriately before you ever store them to prevent the extra read that's bothering you. If that's not an option then you'll still probably be wanting to read ES directly to pull the logs into Python anyway to avoid the enormous overhead of parsing the original log file again.
If this is a one-time hotfix to existing ES data while you're rolling out normal/malicious, then don't worry about a 2x speed improvement. Just throttle the query if you're concerned about bringing down the cluster. The hotfix will execute eventually, and probably faster than if we keep deliberating about the best option.

Amazon Autoscaling trigger not working, how do I debug it?

I'm trying to use autoscaling to create new EC2 instances whenever average CPU load on existing instances goes high. Here's the situation:
I'm setting up autoscaling using this boto script (with keys and image names removed). http://balti.ukcod.org.uk/~francis/tmp/start_scaling_ptdaemon.nokeys.py
I've got min_size set to 2, and the AutoScalingGroup correctly creates an initial 2 instances, which both work fine. I'm pretty sure this means the LaunchConfiguration is right.
When load goes up to nearly 100% on both those two instances, nothing happens.
Some questions / thoughts:
Is there any way of debugging this? I can't find any API calls that give me detals of what Autoscaling is doing, or thinks it is doing. Are there any tools that give feedback either on what it is doing, or on whether it has set things up correctly?
It would be awesome if Autoscaling appeared in the AWS Console.
I'm using EU west availability zone. Is there any reason that should cause trouble with Autoscaling?
Is there any documentation of the "dimensions" parameter when creating a trigger? I have no idea what it means, and have just copied its fields from an example. I can't find any documentation about it that doesn't self-referentially say it is a "dimension", without explaining what that means or what the possible values are.
Thanks for any help!
I'm sure you've already found these and it would be good to use AWS tool first before the Python tool to get the idea.:)
http://ec2-downloads.s3.amazonaws.com/AutoScaling-2009-05-15.zip
http://docs.amazonwebservices.com/AutoScaling/latest/DeveloperGuide/
Cheers,
Rodney
Also, take a look at something like http://alestic.com/2011/11/ec2-schedule-instance for a simple example of how to use the tools with a demo script provided.

Categories