I'm currently writing a python application that will take a directory of text files and parse them into custom python objects based on the the attributes specified in the text file. As part of my application, I compare the current loaded object data set to a previous dataset (same format) and scan it for possible duplicates, conflicts, updates, etc. However since there can be ~10,000+ objects at a time, I'm not really sure how to approach this.
I'm currently storing the previous data set in a DB as it's being used by another web app. As of now, my python application loads the 'proposed' dataset into memory (creating the rule objects), and then I store those objects in a dictionary (problem #1). Then when it comes time to compare, I use a combination of SQL queries and failed inserts to determine new/existing and existing but updated entries (problem #2).
This is hackish and terrible at best. I'm looking for some advice on restructuring the application and handling the object storage/comparisons.
You can fake what Git does and load the entire set as basically a single file and parse from there. The biggest issue is that dictionaries are not ordered so your comparisons will not always be 1:1. A list of tuples will give you 1:1 comparisons. If a lot has changed this will be difficult.
Here is a basic flow for how you can do this.
Start with both tuple lists at index 0.
Compare a hash of each tuple hashlib.sha1(str(tuple1)) == hashlib.sha1(str(tuple2))
If they are equal, record the matching indexes and add 1 to each index and compare again
If the are unequal, search each side for a match and record the matching indexes
If there are no matches, you can assume there is an insert/update/delete happening and come back to it later
You can map your matching items as reference points to do further investigation into the ones that did not match. This technique can be applied at each level you drill down. You will end up with a map of what is different down to the individual values.
The nice thing is each of the slices that you create can be compared in parallel since they will not correspond to each other... unless you are moving things from one file to another.
Then again, it may be easier to use a diff library to compare the two data sets. Might as well not reinvent the wheel; even if it might be a really shiny wheel.
Check out http://docs.python.org/library/difflib.html
Related
I am writing a small python program that tries to find images similar enough to some already in a database (to detect duplicates that have been resized/recompressed/etc). I am using the imagehash library and average hashing, and want to know if there is a hash in a known database that has a hamming distance lower than, say, 3 or 4.
I am currently just using a dictionary that matches hashes to filenames and use brute force for every new image. However, with tens or hundreds of thousands of images to compare to, performance is starting to suffer.
I believe there must be data structures and algorithms that can allow me to search a lot more efficiently but wasn’t able to find much that would match my particular use case. Would anyone be able to suggest where to look?
Thanks!
Here's a suggestion. You mention a database, so initially I will assume we can use that (and don't have to read it all into memory first). If your new image has a hash of 3a6c6565498da525, think of it as 4 parts: 3a6c 6565 498d a525. For a hamming distance of 3 or less any matching image must have a hash where at least one of these parts is identical. So you can start with a database query to find all images whose hash contains the substring 3a6c or 6565 or 498d or a525. This should be a tiny subset of the full dataset, so you can then run your comparison on that.
To improve further you could pre-compute all the parts and store them separately as additional columns in the database. This will allow more efficient queries.
For a bigger hamming distance you would need to split the hash into more parts (either smaller, or you could even use parts that overlap).
If you want to do it all in a dictionary, rather than using the database you could use the parts as keys that each point to a list of images. Either a single dictionary for simplicity, or for more accurate matching, a dictionary for each "position".
Again, this would be used to get a much smaller set of candidate matches on which to run the full comparison.
I want to be able to do two things:
Store a hash of a datasets contents (so I can decide whether it has updated). To date, I have done this via a second output dataset with a single row that stores the hash and row count. In my Transform I can read that output and compare it to the current build's hash and row count to decide if data has updated. This works fine, but I'd like to avoid having a second dataset if possible.
Pass through timestamps from upstream dependencies so that in downstream workflows I can answer "when did dependency X last update?"
It seems like both of these could be solved by some sort of key-value metadata store on the dataset.
You're correct that one of the most straightforward ways to do this is to decorate the rows with a timestamp value, and in fact with Foundry's Parquet storage system, this will be encoded using Dictionary Encoding, a highly efficient mechanism to store repeated values.
The problem with this approach is you'll have to stack a new column for each phase of updating you want to keep track of. This might prove annoying to maintain in practice.
However, if you don't want to add this data to your rows and instead simply want to store your metadata, you have two options, one of which you've already found:
Store metadata in a separate dataset
Write an 'unused' file (probably .csv or .txt) to your output keeping track of this information
Foundry won't consider your .csv or .txt extra file on the output if you're writing a standard DataFrame to it since your schema by default will only read Parquet files. This means you can store this little snippet of information without affecting your output. If you check platform documentation, you can confirm that it's possible to write both a DataFrame to an output and a file of your own.
It may be simpler to interact with a second output however since the mechanisms of Incremental Transforms and schema handling will be taken care of for you, so I'd recommend proceeding with 1. as you are right now.
I recently needed to store large array-like data (sometimes numpy, sometimes key-value indexed) whose values would be changed over time (t=1 one element changes, t=2 another element changes, etc.). This history needed to be accessible (some time in the future, I want to be able to see what t=2’s array looked like).
An easy solution was to keep a list of arrays for all timesteps, but this became too memory intensive. I ended up writing a small class that handled this by keeping all data “elements” in a dict with each element represented by a list of (this_value, timestamp_for_this_value). that let me recreate things for arbitrary timestamps by looking for the last change before some time t, but it was surely not as efficient as it could have been.
Are there data structures available for python that have these properties natively? Or some sort of class of data structure meant for this kind of thing?
Have you considered writing a log file? A good use of memory would be to have the arrays contain only the current relevant values but build in a procedure where the update statement could trigger a logging function. This function could write to a text file, database or an array/dictionary of some sort. These types of audit trails are pretty common in the database world.
Just started out with Tinkerpop and Janusgraph, and I'm trying to figure this out based on the documentation.
I have three datasets, each containing about 20 milions rows (csv files)
There is a specific model in which the variables and rows need to be connected, e.g. what are vertices, what are labels, what are edges, etc.
After having everything in a graph, I'd like to of course use some basic Gremlin to see how well the model works.
But first I need a way to get the data into Janusgraph.
Possibly there exist scripts for this.
But otherwise, is it perhaps something to be written in python, to open a csv file, get each row of a variable X, and add this as a vertex/edge/etc. ...?
Or am I completely misinterpreting Janusgraph/Tinkerpop?
Thanks for any help in advance.
EDIT:
Say I have a few files, each of which contain a few million rows, representing people, and several variables, representing different metrics. A first example could look like thid:
metric_1 metric_2 metric_3 ..
person_1 a e i
person_2 b f j
person_3 c g k
person_4 d h l
..
Should I translate this to files with nodes that are in the first place made up of just the values, [a,..., l].
(and later perhaps more elaborate sets of properties)
And are [a,..., l] then indexed?
The 'Modern' graph here seems to have an index (number 1,...,12 for all the nodes and edges, independent of their overlapping label/category), e.g. should each measurement be indexed separately and then linked to a given person_x to which they belong?
Apologies for these probably straightforward questions, but I'm fairly new to this.
Well, the truth is bulk loading of real user data into JanusGraph is a real pain. I've been using JanuGraph since it's very first version about 2 years ago and its still a pain to bulk load data. A lot of it is not necessarily down to JanusGraph because different users have very different data, different formats, different graph models (ie some mostly need one vertex with one edge ( ex. child-mother ) others deal with one vertex with many edges ( ex user followers ) ) and last but definitely not least, the very nature of the tool deals with large data sets, not to mention the underlying storage and index databases mostly come preconfigured to replicate massively (i.e you might be thinking 20m rows but you actually end up inserting 60m or 80m entries)
All said, I've had moderate success in bulk loading a some tens of millions in decent timeframes (again it will be painful but here are the general steps).
Provide IDs when creating graph elements. If importing from eg MySQL think of perhaps combining the tablename with the id value to create unique IDs eg users1, tweets2
Don't specify schema up front. This is because JanusGraph will need to ensure the data conforms on each inserting
Don't specify index up front. Just related to above but really deserves its own entry. Bulk insert first index later
Please, please, please, be aware of the underlying database features for bulk inserts and activate them i.e read up on Cassandra, ScyllaDB, Big Table, docs especially on replication and indexing
After all the above, configure JanusGraph for bulk loading, ensure your data integrity is correct (i.e no duplicate ids) and consider some form of parallelizing insert request e.g some kind of map reduce system
I think I've covered the major points, again, there's no silver bullet here and the process normally involves quite some trial and error for example the bulk insert rates, too low is bad e.g 10 per second while too high is equally bad eg 10k per second and it almost always depends on your data so its a case by case basis, can't recommend where you should start.
All said and done, give it a real go, bulk load is the hardest part in my opinion and the struggles are well worth the new dimension it gives your application.
All the best!
JanusGraph uses pluggable storage backends and indexs. For testing purposes, a script called bin/janusgraph.sh is packaged with the distribution. It allows to quickly get up and running by starting Cassandra and Elasticsearch (it also starts a gremlin-server but we won't use it)
cd /path/to/janus
bin/janusgraph.sh start
Then I would recommend loading your data using a Groovy script. Groovy scripts can be executed with the Gremlin console
bin/gremlin.sh -e scripts/load_data.script
An efficient way to load the data is to split it into two files:
nodes.csv: one line per node with all attributes
links.csv: one line per link with source_id and target_id and all the links attributes
This might require some data preparation steps.
Here is an example script
The trick to speed up the process is to keep a mapping between your id and the id created by JanusGraph during the creation of the nodes.
Even if it is not mandatory, I strongly recommend you to create an explicit schema for your graph before loading any data. Here is an example script
I have got a approximately 12GB of tab-separated data in a very simple format:
mainIdentifier, altIdentifierType, altIdentifierText
MainIdentifier is not a unique row identifier - only the whole combination of the 3 columns is unique. My main use-case is looking up corresponding entries going from mainIdentifier or going from two different types of alternative identifiers.
From what I can glean, I would need to construct a lookup index for each entry direction to make it fast. However, given the simplicity of the task, I do not really need the index pointing to the record - the index itself is the answer.
I've tried sqlite3 in python but as expected, the result is not as fast as I would've liked. I am now considering just storing the two lists and moving around in a binary-search fashion, however, I do not want to re-invent the wheel - is there any way existing solution how to solve this?
Also, I intend to run this a REST-enabled service, so it's not feasible for the lookup table to be stored in memory in any fashion..