Best practice to update a column of all documents to Elasticsearch

Best practice to update a column of all documents to Elasticsearch - python

I'm developing a log analysis system. The input are log files. I have an external Python program that reads the log files and decide whether a record (line) or the log files is "normal" or "malicious". I want to use Elasticsearch Update API to append my Python program's result ("normal" or "malicious") to Elasticsearch by adding a new column called result. So I can see my program's result clearly via Kibana UI.
Simply speaking, my Python code and Elasticsearch both use log files as input respectively. Now I want to update the result from Python code to Elasticsearch. What's the best way to do it?
I can think of several ways:
Elasticsearch automatically assigns a ID (_id) to a document. If I can find out how Elasticsearch calculates _id, then my Python code can calculate it by itself, then update the corresponding Elasticsearch document via _id. But the question is, Elasticsearch official documentation doesn't say about what algorithm it uses to generate _id.
Add an ID (like line number) to the log files by myself. Both my program and Elasticsearch will know this ID. My program can use this ID to update. However, the downside is that my program has to search for this ID every time because it's only a normal field instead of a built-in _id. The performance will be very bad.
My Python code gets the logs from Elasticsearch instead of reading the log files directly. But this makes the system fragile, as Elasticsearch becomes a critical point. I only want Elasticsearch to be a log viewer currently.
So the first solution will be ideal in the current view. But I'm not sure if there are any better ways to do it?

If possible, re-structure your application so that instead of dumping plain-text to a log file you're directly writing structured log information to something like Elasticsearch. Thank me later.
That isn't always feasible (e.g. if you don't control the log source). I have a few opinions on your solutions.
This feels super brittle. Elasticsearch does not base _id on the properties of a particular document. It's selected based off of existing _id fields that it has stored (and I think also off of a random seed). Even if it could work, relying on an undocumented property is a good way to shoot yourself in the foot when dealing with a team that makes breaking changes even for its documented code as often as Elasticsearch does.
This one actually isn't so bad. Elasticsearch supports manually choosing the id of a document. Even if it didn't, it performs quite well for bulk terms queries and wouldn't be as much of a bottleneck as you might think. If you really have so much data that this could break your application then Elasticsearch might not be the best tool.
This solution is great. It's super extensible and doesn't rely on a complicated dependence on how the log file is constructed, how you've chosen to index that log in Elasticsearch, and how you're choosing to read it with Python. Rather you just get a document, and if you need to update it then you do that updating.
Elasticsearch isn't really a worse point of failure here than before (if ES goes down, your app goes down in any of these solutions) -- you're just doing twice as many queries (read and write). If a factor of 2 kills your application, you either need a better solution to the problem (i.e. avoid Elasticsearch), or you need to throw more hardware at it. ES supports all kinds of sharding configurations, and you can make a robust server on the cheap.
One question though, why do you have logs in Elasticsearch that need to be updated with this particular normal/malicious property? If you're the one putting them into ES then just tag them appropriately before you ever store them to prevent the extra read that's bothering you. If that's not an option then you'll still probably be wanting to read ES directly to pull the logs into Python anyway to avoid the enormous overhead of parsing the original log file again.
If this is a one-time hotfix to existing ES data while you're rolling out normal/malicious, then don't worry about a 2x speed improvement. Just throttle the query if you're concerned about bringing down the cluster. The hotfix will execute eventually, and probably faster than if we keep deliberating about the best option.

Related

How to share sensitive data among programs while keeping the possibility of comparing them with other local data?

Context
As part of my studies, I am creating a bot capable of detecting scam messages, in Python 3. One of the problems I am facing is the detection of fraudulent websites.
Currently, I have a list of domain names saved in a CSV file, containing both known domains considered safe (discord.com, google.com, etc.), and known fraudulent domains (free-nitro.ru etc.)
To share this list between my personal computer and my server, I regularly "deploy" it in ftp. But since my bot also uses GitHub and a MySQL database, I'm looking for a better system to synchronize this list of domain names without allowing anyone to access it.
I feel like I'm looking for a miracle solution that doesn't exist, but I don't want to overestimate my knowledge so I'm coming to you for advice, thanks in advance!
My considered solutions:
Put the domain names in a MySQL table
Advantages: no public access, live synchronization
Disadvantages: my scam detection script should be able to work offline
Hash the domain names before putting them on git
Advantages: no public access, easy to do, supports equality comparison
Disadvantages: does not support similarity comparison, which is an important part of the program
Hash domain names with locality-sensitive hashing
Advantages: no easy public access, supports equality and similarity comparison
Disadvantages : similarities less precise than in clear, and impossible to hash a new string from the server without knowing at least the seed of the random, so public access problems
My opinion
It seems to me that the last solution, with the LSH, is the one that causes the least problems. But it is far from satisfying me, and I hope to find better.
For the LSH algorithm, I have reproduced it here (from this notebook). I get similarity coefficients between 10% and 40% lower than those obtained with the current plain method.
EDIT: for clarification purpose, maybe my intentions weren’t clear enough (I’m sorry, English is not my native language and I’m bad at explaining things lol). The database or GitHub are just convenient ways to share info between my different bot instances. I could have one locally running on my pc, one on my VPS, one other god know where… and this is why I don’t want a FTP or any kind of synchronisation process involving an IP and/or a fixed destination folder. Ideally I’d like to just take my program at any time, download it wherever I want (by git clone) and just run it.
Please tell me if this isn’t clear enough, thanks :)

At the end I think I'll use yet another solution. I'm thinking of using the MySQL database to store domain names, but only use it in my script to synchronize to it, keeping a local CSV version.
In short, the workflow I'm imagining:
I edit my SQL table when I want to add/remove items to it
When the bot is launched, the script connects to the DB and retrieves all the information from the table
Once the information is retrieved, it saves it in a CSV file and finishes running the rest of the script
If at launch no internet connection is available, the synchronization to the DB is not done and only the CSV file is used.
This way I have the advantages of no public access, an automatic synchronization, an access even offline after the first start, and I keep the support of comparison by similarity since no hash is done.
If you think you can improve my idea, I'm interested!

MySql bulk import without writing a file to disk

I have a django site running with a mysql database backend. I accept fairly large uploads from one of the admin users to bulk import some data. The data comes in a format that is slightly different than the form it needs to be in the database so I need to do a little parsing.
I'd like to be able to pivot this data into csv and write it into a cStringIO object then simply use mysql's bulk import command to load that file. I'd prefer to skip writing the file to the disk first, but I can't seem to find a way around it. I've done basically this exact thing with postgresql in the past, but unfortunately this project is on mysql.
The short: Can I take an in memory file like object and somehow use mysql bulk import operation

There is an excellent tutorial called Generator Tricks for Systems Programmers that addresses processing large log files, which is similar, but not identical, to your situation. As long as you can perform the needed transform with access to only the current (and possibly previous) data in the stream, this may work for you.
I have mentioned this gem in a number of answers because I think that it introduces a different way of thinking that can be quite valuable. There is a companion piece, A Curious Course on Coroutines and Concurrency, that can seriously twist your head around.

If by "bulk import" you mean LOAD DATA [LOCAL] INFILE then, no, there's no way around first writing the data to some file, damn it all. You (and I) would really like to write the table directly from an array.
But some OSs, like Linux, allow a RAM-resident filesystem that eases some of the hurt. I'm not enough of a sysadmin to know how to set up one of these guys; I had to get my ISP's tech support to do it for me. I found an article that might have useful info.
HTH

Using Python to generate JSON Data

I'm working on a little project, the aim being to generate a report from a database for a server. The database is SQLite and contains tables like 'connections', 'downloads', etc.
The report I produce will ultimately contain a number of graphs displaying things like 'connections per day', 'top downloads this month', etc.
I plan to use flot for the graphs because the graphs it makes look very nice:
This is my current plan for how my reports will work:
Static .HTML file which is the report. This will contain headings, embedded flot graphs, etc.
JSON Data file. These will be generated by my report generation python script, they will basically contain a JSON variable for each graph representing the dataset that the graph should map. ([100,2009-2-2],[192,2009-2-3]...)
Report generation python script, this will load the SQLite database, run a list of set SQL queries and spit out the JSON Data files.
Does this sound like a sensible set up? I can't help but feel it could be improved but I don't see how. I want the reports to be static. The server they run on cannot take heavy loads so a dynamically generated report is out of the question and also unnecessary for this application.
My concerns are:
I feel that the Python script is largely pointless, all of the processing performed is done by SQLite, my script is basically going to be used to store SQL queries and package up the output. With a bit more work SQLite could probably do this for me.
It seems I'm solving a problem that must have been solved many times before 'take sql queries, spit out pretty graphs in a daily report' must have been done hundreds of times. I'm just having trouble tracking down any broad implementations.

It sounds sensible to me.
You need some programming language to talk to SQLite. You could do it in C, but if you can write the necessary glue code easily in Python, why not? You'll almost certainly save more time writing it than you'll lose from not having the most efficient possible program.
There are definitely programs to analyse logs for you - I've heard of Piwik, for instance. That's for dynamic reports, but no doubt there are projects to do static reports too. But they may not fit the data and output you're after. Writing it yourself means you know precisely what you're getting, so if it's not too much work, carry on.

I feel that the Python script is largely pointless, all of the processing performed is done by SQLite, my script is basically going to be used to store SQL queries and package up the output. With a bit more work SQLite could probably do this for me.
Maybe so, but even then, Python is a great glue language. Also, if you need to do some processing SQLite isn't good at, Python is already there.
It seems I'm solving a problem that must have been solved many times before 'take sql queries, spit out pretty graphs in a daily report' must have been done hundreds of times. I'm just having trouble tracking down any broad implementations.
I think you're leaning towards the general class of HTTP-served reporting. One thing out there that overlaps your problem set is Django, which provides a Python interface between database (SQLite is supported) and web server, along with a templating system for your outputs.
If you just want one or two pieces of a solution, then I recommend looking at SQLAlchemy for interfacing with the database, Jinja2 for templating, and/or Werkzeug for HTTP server interface.

What's the best way to implement web service for ajax autocomplete

I'm implementing a "Google Suggest" like autocomplete feature for tag searching using jQuery's autocomplete.
I need to provide a web service to jQuery giving it a list of suggestions based on what the user has typed. I see 2 ways of implementing the web service:
1) just store all the tags in a database and search the DB using user input as prefix. This is simple, but I'm concerned about latency.
2) Use an in-process trie to store all the tags and search it for matching results. As everything will be in-process, I expect this to have much lower latency. But there are several difficulties:
-What's a good way to initialize the trie on process start up? Presumable I'll store the tag data in a DB and retrieve them and turn them into a trie when I frist start up the process. But I'm not sure how. I'm using Python/Django.
-When a new tag is created by a user, I need to insert the new tag into the trie. But let's say I have 5 Django processes and hence 5 tries, how do I tell the other 4 tries that they need to insert a new tag too?
-How to make sure the trie is threadsafe as my Django processes will be threaded (I'm using mod_wsgi). Or do I not have to worry about threadsafty because of Python's GIL?
-Any way I can store the tag's frequency of use within the trie as well? How do I tell when does the tag's string end and when does the frequency start - eg. if I store apple213 into the trie, is it "apple" with frequency 213 or is it "apple2" with frequency 13??
Any help on the issues above or any suggestions on a different approach would be really appreciated.

Don't be concerned about latency before you measure things -- make up a bunch of pseudo-tags, stick them in the DB, and measure latencies for typical queries. Depending on your DB setup, your latency may be just fine and you're spared wasted worries.
Do always worry about threading, though - the GIL doesn't make race conditions go away (control might switch among threads at any pseudocode instruction boundary, as well as when C code in an underlying extension or builtin is executing). You need first to check the threadsafety attribute of the DB API module you're using (see PEP 249), and then use locking appropriately or spawn a small pool of dedicated threads that perform DB interactions (receiving requests on a Queue.Queue and returning results on another, the normal architecture for sound and easy threading in Python).

I would use the first option. 'KISS' - (Keep It Simple Stupid).
For small amounts of data there shouldn't be much latency. We run the same kind of thing for a name search and results appear pretty quickly on a few thousand rows.
Hope that helps,
Josh

Is it reasonable to save data as python modules?

This is what I've done for a project. I have a few data structures that are bascially dictionaries with some methods that operate on the data. When I save them to disk, I write them out to .py files as code that when imported as a module will load the same data into such a data structure.
Is this reasonable? Are there any big disadvantages? The advantage I see is that when I want to operate with the saved data, I can quickly import the modules I need. Also, the modules can be used seperate from the rest of the application because you don't need a separate parser or loader functionality.

By operating this way, you may gain some modicum of convenience, but you pay many kinds of price for that. The space it takes to save your data, and the time it takes to both save and reload it, go up substantially; and your security exposure is unbounded -- you must ferociously guard the paths from which you reload modules, as it would provide an easy avenue for any attacker to inject code of their choice to be executed under your userid (pickle itself is not rock-solid, security-wise, but, compared to this arrangement, it shines;-).
All in all, I prefer a simpler and more traditional arrangement: executable code lives in one module (on a typical code-loading path, that does not need to be R/W once the module's compiled) -- it gets loaded just once and from an already-compiled form. Data live in their own files (or portions of DB, etc) in any of the many suitable formats, mostly standard ones (possibly including multi-language ones such as JSON, CSV, XML, ... &c, if I want to keep the option open to easily load those data from other languages in the future).

It's reasonable, and I do it all the time. Obviously it's not a format you use to exchange data, so it's not a good format for anything like a save file.
But for example, when I do migrations of websites to Plone, I often get data about the site (such as a list of which pages should be migrated, or a list of how old urls should be mapped to new ones, aor lists of tags). These you typically get in Word och Excel format. Also the data often needs massaging a bit, and I end up with what for all intents and purposes are a dictionaries mapping one URL to some other information.
Sure, I could save that as CVS, and parse it into a dictionary. But instead I typically save it as a Python file with a dictionary. Saves code.
So, yes, it's reasonable, no it's not a format you should use for any sort of save file. It however often used for data that straddles the border to configuration, like above.

The biggest drawback is that it's a potential security problem since it's hard to guarantee that the files won't contains arbitrary code, which could be really bad. So don't use this approach if anyone else than you have write-access to the files.

A reasonable option might be to use the Pickle module, which is specifically designed to save and restore python structures to disk.

Alex Martelli's answer is absolutely insightful and I agree with him. However, I'll go one step further and make a specific recommendation: use JSON.
JSON is simple, and Python's data structures map well into it; and there are several standard libraries and tools for working with JSON. The json module in Python 3.0 and newer is based on simplejson, so I would use simplejson in Python 2.x and json in Python 3.0 and newer.
Second choice is XML. XML is more complicated, and harder to just look at (or just edit with a text editor) but there is a vast wealth of tools to validate it, filter it, edit it, etc.
Also, if your data storage and retrieval needs become at all nontrivial, consider using an actual database. SQLite is terrific: it's small, and for small databases runs very fast, but it is a real actual SQL database. I would definitely use a Python ORM instead of learning SQL to interact with the database; my favorite ORM for SQLite would be Autumn (small and simple), or the ORM from Django (you don't even need to learn how to create tables in SQL!) Then if you ever outgrow SQLite, you can move up to a real database such as PostgreSQL. If you find yourself writing lots of loops that search through your saved data, and especially if you need to enforce dependencies (such as if foo is deleted, bar must be deleted too) consider going to a database.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.