Neo4j: using a numeric index for nodes

Neo4j: using a numeric index for nodes - python

This is py2neo 1.6.
My question is how to generate the unique_identifier for each idea (see commented lines) in order to have a distinct filename for the image.
For the moment we are using python’s uuid.
I wonder if there is some utility in neo4j that can associate a distinct number to each node when the node is added to the index, and so that we can use this number as our unique_identifier
def create_idea_node(idea_text):
#basepath = 'http://www.example.com/ideas/img/'
#filename= str(unique_identifier)+'.png'
#idea_image_url = basepath + filename
newidea_node, = getGraph().create({"idea": idea_text, "idea_image_url": idea_image_url})
_getIdeasIndex().add("idea", idea_text, new_idea_node)
return OK
def _getIdeasIndex():
return getGraph().get_or_create_index(neo4j.Node, "Ideas")

Neo4j nodes have ids, they are integers, however if a node is destroyed and recreated, the integer may be reused. id(n) is the node n’s id. Is there something wrong with the UUID? Integer solutions can become problematic when you are multi-threading or distributing your computing project across multiple servers as you scale. So unless there is something wrong with the UUID solution, I’d just stick with that.
In spite of being hard to read, and perhaps requiring slightly more storage, UUID's have many advantages over trying to enforce uniqueness with integers (in general). I encourage you to read up on the nature of UUIDs on Wikipedia.
Integer uniqueness has many pitfalls when trying to scale across independent systems (for fault tolerance and performance reasons). If you can start out working with UUID's, you can grow with your solution for the long term with many fewer headaches down the road.
FWIW, if you end up storing UUID's in PostgreSQL sometime down the road, be sure to take advantage of the 'uuid' datatype. It will make storing and indexing those values almost as efficient as plain integers. (It will be hard to tell the difference.)

Related

What is an optimal method for isolating unrelated sets of data in neo4j?

I'm using neo4j to contain temporary datasets from different source systems. My data consists of a few parent objects which each contain ~4-7 layers of child objects of varying types. Total object count per dataset varies between 2,000 and 1.5 million. I'm using the python py2neo library, which has had good performance both during the data creation phase, and for passing through cypher queries for reporting.
I'd like to isolate datasets from unrelated systems for querying and purging purposes, but I'm worried about performance. I have a few ideas, but it's not clear to me which are the most likely to be viable.
The easiest to implement (for my code) would be a top-level "project" object. That project object would then have a few direct children (via a relationship) and many indirect children. I'm worried that when I want to filter by project, I'll have to use a relationship wildcard MATCH (pr:project)<-[:IN_PROJECT*7]-(c:child_object) distance, which seems to very expensive query-wise.
I could also make a direct relationship between the project object and every other object in the project. MATCH (pr:project)<-[:IN_PROJECT]-(c:child_object)This should be easier for writing queries, but I don't know what might happen when I have a single object with potentially millions of relationships.
Finally, I could set a project-id property on every single object in the dataset. MATCH (c:child_object {project-id:"A1B2C3"}) It seems to be a wasteful solution, but I think it might be better performance wise in the graph DB model.
Apologies if I mangled the sample Cypher queries / neo4j terminology. I set aside this project for 6 weeks, and I'm a little rusty.

If you have a finite set of datasets, you should consider using a dedicated label to specify the data source. In Neo4j's property graph data model, a node is allowed to have multiple labels.
MATCH (c:child_object:DataSourceA)
Labels are always indexed, so performance should be better than that of your proposals 1-3. I also think this is a more elegant solution -- however, it will get tricky if you do not know the number of data sets up front. In the latter case, you might use something like
MATCH (c:child_object)
WHERE 'DataSourceA' IN labels(c)
But this is more like a "full table scan", so performance-wise, you'll be better off using your approach 3 and building an index on project-id.

Hashing strings for load balancing

There's a great deal of information I can find on hashing strings for obfuscation or lookup tables, where collision avoidance is a primary concern. I'm trying to put together a hashing function for the purpose of load balancing, where I want to fit an unknown set of strings into an arbitrarily small number of buckets with a relatively even distribution. Collisions are expected (desired, even).
My immediate use case is load distribution in an application, where I want each instance of the application to fire at a different time of the half-hour, without needing any state information about other instances. So I'm trying to hash strings into integer values from 0 to 29. However, the general approach has wider application with different int ranges for different purposes.
Can anyone make suggestions, or point me to docs that would cover this little corner of hash generation?
My language of choice for this is python, but I can read most common langues so anything should be applicable.

Your might consider something simple, like the adler32() algo, and just mod for bucket size.
import zlib
buf = 'arbitrary and unknown string'
bucket = zlib.adler32(buf) % 30
# at this point bucket is in the range 0 - 29

Efficient large dicts of dicts to represent M:M relationships in Python

I have a very large dataset - millions of records - that I want to store in Python. I might be running on 32-bit machines so I want to keep the dataset down in the hundreds-of-MB range and not ballooning much larger than that.
These records - represent a M:M relationship - two IDs (foo and bar) and some simple metadata like timestamps (baz).
Some foo have too nearly all bar in them, and some bar have nearly all foo. But there are many bar that have almost no foos and many foos that have almost no bar.
If this were a relational database, a M:M relationship would be modelled as a table with a compound key. You can of course search on either component key individually comfortably.
If you store the rows in a hashtable, however, you need to maintain three hashtables as the compound key is hashed and you can't search on the component keys with it.
If you have some kind of sorted index, you can abuse lexical sorting to iterate the first key in the compound key, and need a second index for the other key; but its less obvious to me what actual data-structure in the standard Python collections this equates to.
I am considering a dict of foo where each value is automatically moved from tuple (a single row) to list (of row tuples) to dict depending on some thresholds, and another dict of bar where each is a single foo, or a list of foo.
Are there more efficient - speedwise and spacewise - ways of doing this? Any kind of numpy for indices or something?
(I want to store them in Python because I am having performance problems with databases - both SQL and NoSQL varieties. You end up being IPC memcpy and serialisation-bound. That is another story; however the key point is that I want to move the data into the application rather than get recommendations to move it out of the application ;) )

Have you considered using a NoSQL database that runs in memory such at Redis? Redis supports a decent amount of familiar data structures.
I realize you don't want to move outside of the application, but not reinventing the wheel can save time and quite frankly it may be more efficient.

If you need to query the data in a flexible way, and maintain various relationships, I would suggest looking further into using a database, of which there are many options. How about using an in-memory databse, like sqlite (using ":memory:" as the file)? You're not really moving the data "outside" of your program, and you will have much more flexibility than with multi-layered dicts.
Redis is also an interesting alternative, as it has other data-structures to play with, rather than using a relational model with SQL.

What you describe sounds like a sparse matrix, where the foos are along one axis and the bars along the other one. Each non-empty cell represents a relationship between one foo and one bar, and contains the "simple metadata" you describe.
There are efficient sparse matrix packages for Python (scipy.sparse, PySparse) you should look at. I found these two just by Googling "python sparse matrix".
As to using a database, you claim that you've had performance problems. I'd like to suggest that you may not have chosen an optimal representation, but without more details on what your access patterns look like, and what database schema you used, it's awfully hard for anybody to contribute useful help. You might consider editing your post to provide more information.

NoSQL systems like redis don't provide MM tables.
In the end, a python dict keyed by pairs holding the values, and a dict of the set of pairings for each term was the best I could come up with.
class MM:
def __init__(self):
self._a = {} # Bs for each A
self._b = {} # As for each B
self._ab = {}

Are there any radix/patricia/critbit trees for Python?

I have about 10,000 words used as a set of inverted indices to about 500,000 documents. Both are normalized so the index is a mapping of integers (word id) to a set of integers (ids of documents which contain the word).
My prototype uses Python's set as the obvious data type.
When I do a search for a document I find the list of N search words and their corresponding N sets. I want to return the set of documents in the intersection of those N sets.
Python's "intersect" method is implemented as a pairwise reduction. I think I can do better with a parallel search of sorted sets, so long as the library offers a fast way to get the next entry after i.
I've been looking for something like that for some time. Years ago I wrote PyJudy but I no longer maintain it and I know how much work it would take to get it to a stage where I'm comfortable with it again. I would rather use someone else's well-tested code, and I would like one which supports fast serialization/deserialization.
I can't find any, or at least not any with Python bindings. There is avltree which does what I want, but since even the pair-wise set merge take longer than I want, I suspect I want to have all my operations done in C/C++.
Do you know of any radix/patricia/critbit tree libraries written as C/C++ extensions for Python?
Failing that, what is the most appropriate library which I should wrap? The Judy Array site hasn't been updated in 6 years, with 1.0.5 released in May 2007. (Although it does build cleanly so perhaps It Just Works.)
(Edit: to clarify what I'm looking for from an API, I want something like:
def merge(document_sets):
probe_i = 0
probe_set = document_sets[probe_i]
document_id = GET_FIRST(probe_set)
while IS_VALID(document_id):
# See if the document is present in all sets
for i in range(1, len(document_sets)):
# dynamically adapt to favor the least matching set
target_i = (i + probe_i) % len(document_sets)
target = document_sets[target_i]
if document_id not in target_set:
probe_i = target_id
probe_set = document_sets[probe_i]
document_id = GET_NEXT(probe_set, document_id)
break
else:
yield document_id
I'm looking for something which implements GET_NEXT() to return the next entry which occurs after the given entry. This corresponds to Judy1N and the similar entries for other Judy arrays.
This algorithm dynamically adapts to the data should preferentially favor sets with low hits. For the type of data I work with this has given a 5-10% increase in performance.)
)

Yes, there are some, though I'm not sure if they're suitable for your use case: but it seems none of them are what you asked for.
BioPython has a Trie implementation in C.
Ah, here's a nice discussion including benchmarks: http://bugs.python.org/issue9520
Other (some very stale) implementations:
http://pypi.python.org/pypi/radix
py-radix is an implementation of a
radix tree data structure for the
storage and retrieval of IPv4 and IPv6
network prefixes.
https://bitbucket.org/markon/patricia-tree/src
A Python implementation of
patricia-tree
http://pypi.python.org/pypi/trie
A prefix tree (trie) implementation.
http://pypi.python.org/pypi/logilab-common/0.50.3
patricia.py : A Python implementation
of PATRICIA trie (Practical Algorithm
to Retrieve Information Coded in
Alphanumeric).

I've recently added iteration support to datrie, you may give it a try.

python solutions for managing scientific data dependency graph by specification values

I have a scientific data management problem which seems general, but I can't find an existing solution or even a description of it, which I have long puzzled over. I am about to embark on a major rewrite (python) but I thought I'd cast about one last time for existing solutions, so I can scrap my own and get back to the biology, or at least learn some appropriate language for better googling.
The problem:
I have expensive (hours to days to calculate) and big (GB's) data attributes that are typically built as transformations of one or more other data attributes. I need to keep track of exactly how this data is built so I can reuse it as input for another transformation if it fits the problem (built with right specification values) or construct new data as needed. Although it shouldn't matter, I typically I start with 'value-added' somewhat heterogeneous molecular biology info, for example, genomes with genes and proteins annotated by other processes by other researchers. I need to combine and compare these data to make my own inferences. A number of intermediate steps are often required, and these can be expensive. In addition, the end results can become the input for additional transformations. All of these transformations can be done in multiple ways: restricting with different initial data (eg using different organisms), by using different parameter values in the same inferences, or by using different inference models, etc. The analyses change frequently and build on others in unplanned ways. I need to know what data I have (what parameters or specifications fully define it), both so I can reuse it if appropriate, as well as for general scientific integrity.
My efforts in general:
I design my python classes with the problem of description in mind. All data attributes built by a class object are described by a single set of parameter values. I call these defining parameters or specifications the 'def_specs', and these def_specs with their values the 'shape' of the data atts. The entire global parameter state for the process might be quite large (eg a hundred parameters), but the data atts provided by any one class require only a small number of these, at least directly. The goal is to check whether previously built data atts are appropriate by testing if their shape is a subset of the global parameter state.
Within a class it is easy to find the needed def_specs that define the shape by examining the code. The rub arises when a module needs a data att from another module. These data atts will have their own shape, perhaps passed as args by the calling object, but more often filtered from the global parameter state. The calling class should be augmented with the shape of its dependencies in order to maintain a complete description of its data atts.
In theory this could be done manually by examining the dependency graph, but this graph can get deep, and there are many modules, which I am constantly changing and adding, and ... I'm too lazy and careless to do it by hand.
So, the program dynamically discovers the complete shape of the data atts by tracking calls to other classes attributes and pushing their shape back up to the caller(s) through a managed stack of __get__ calls. As I rewrite I find that I need to strictly control attribute access to my builder classes to prevent arbitrary info from influencing the data atts. Fortunately python is making this easy with descriptors.
I store the shape of the data atts in a db so that I can query whether appropriate data (i.e. its shape is a subset of the current parameter state) already exists. In my rewrite I am moving from mysql via the great SQLAlchemy to an object db (ZODB or couchdb?) as the table for each class has to be altered when additional def_specs are discovered, which is a pain, and because some of the def_specs are python lists or dicts, which are a pain to translate to sql.
I don't think this data management can be separated from my data transformation code because of the need for strict attribute control, though I am trying to do so as much as possible. I can use existing classes by wrapping them with a class that provides their def_specs as class attributes, and db management via descriptors, but these classes are terminal in that no further discovery of additional dependency shape can take place.
If the data management cannot easily be separated from the data construction, I guess it is unlikely that there is an out of the box solution but a thousand specific ones. Perhaps there is an applicable pattern? I'd appreciate any hints at how to go about looking or better describing the problem. To me it seems a general issue, though managing deeply layered data is perhaps at odds with the prevailing winds of the web.

I don't have specific python-related suggestions for you, but here are a few thoughts:
You're encountering a common challenge in bioinformatics. The data is large, heterogeneous, and comes in constantly changing formats as new technologies are introduced. My advice is to not overthink your pipelines, as they're likely to be changing tomorrow. Choose a few well defined file formats, and massage incoming data into those formats as often as possible. In my experience, it's also usually best to have loosely coupled tools that do one thing well, so that you can chain them together for different analyses quickly.
You might also consider taking a version of this question over to the bioinformatics stack exchange at http://biostar.stackexchange.com/

ZODB has not been designed to handle massive data, it is just for web-based applications and in any case it is a flat-file based database.
I recommend you to try PyTables, a python library to handle HDF5 files, which is a format used in astronomy and physics to store results from big calculations and simulations. It can be used as an hierarchical-like database and has also an efficient way to pickle python objects. By the way, the author of pytables explained that ZOdb was too slow for what he needed to do, and I can confirm you that. If you are interested in HDF5, there is also another library, h5py.
As a tool for managing the versioning of the different calculations you have, you can have a try at sumatra, which is something like an extension to git/trac but designed for simulations.
You should ask this question on biostar, you will find better answers there.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.