I am working on a project where I need to perform many merge operations of subgraphs onto a remote graph. Some elements of the subgraph may already exist in the remote graph. I am using py2neo v3, and neo4j.
I tried using the create and merge function of neo4j, and with both I get surprisingly bad performances. Even more surprising, the time taken to merge the subgraph seems to grow quadratically both with the number of nodes and the number of relationships! When the subgraph is too big, the transaction hangs. One thing I should say is that I checked and it is not py2neo that generates a number of cypher statements that grows quadratically with the size of the subgraph. So if something is wrong, it is either with how I am using those technologies, or with neo4j's implementation. I also tried looking at the query plans for the queries generated by py2neo, and did not find any answer in that as to why the query times grow so dramatically, but don't take my word for it since I am relatively non-initiated.
I could hardly find any relevant information on-line so I tried conducting a proper benchmarking where I compared the performances in function of the number of nodes, and topology of the subgraph, depending on whether of I use the merge or create operation and whether I use unique constraints or not. I include below some of the results I got for graphs with "linear" topology, meaning that the number of relationships is roughly the same as the number of nodes (it doesn't grow quadratically).
In my benchmark, I use 5 different types of labels for nodes and relationships that I assign randomly, and reuse 30% of nodes that already exist in the remote graph. The nodes I create have only one property that act as an identifier, and I report the performances depending on whether I add a unique constraint on this property or not. All the merging operations are run within a single transaction.
Query times for graphs with a linear topology in function of the number of nodes, using py2neo create function
Query times for graphs with a linear topology in function of the number of nodes, using py2neo merge function
As you can see, the time taken seems to grow quadratically with the number of nodes (and relationships).
The question I am having a hard time answering is whether I do something wrong, or don't do something that I should, or if it the kind of performances we should expect from neo4j for these kind of operations. Regardless, it seems that what I could do to alleviate this performance issue is to never try merging big subgraphs all at once, but rather start by merging the nodes batch by batch then the relationships. This could and would work, but I want to get at the bottom of this, if someone has any recommendation or insight to share.
Edit
Here is a list to a gist to reproduce the results above, and others.
https://gist.github.com/alreadytaikeune/6be006f0a338502524552a9765e79af6
Edit 2
Following Michael Hunger's questions:
In the code I shared, I tried to write a formatter for neo4j.bolt logs, in order to capture the queries that are sent to the server. I don't have a systematic way to generate query plans for them however.
I did not try without docker and I don't have an SSD. However, considering the size I allocate for the jvm and size of the graph I am handling, everything should fit in RAM.
I use the latest docker image for neo4j, so the corresponding version seems to be 3.3.5
Unfortunately, the merge routine (and a few others) in v3 are a little naive and don't scale well. I have alternatives planned for py2neo v4 that build much more efficient queries instead of (in the case of merge) arbitrarily long sequences of MERGE statements. Version 4 should be released at some point next month (May 2018).
Related
I have been reading the doc and searching online a bit, but I am still confused about the difference in between persist and scatter.
I have been working with data sets about half a TB large, and have been using scatter to generate futures and then send them to workers. This has been working fine. But recently I started scaling up, and now dealing with datasets a few TB large, and this method stops working. On the dashboard, I see workers not triggered and I am quite certain that this is a scheduler issue.
I saw this video by Matt Rocklin. When he deals with a large dataset, I saw first thing he does is to persist it to the memory (distributed memory). I will give this a try with my large datasets, but meanwhile I am wondering what is the difference between persist and scatter? What specific situations are they best suited? Do I still need to scatter after I persist?
Thanks.
First with persist, imagine you have table A, which is used to make table B, and then you use B to generate two tables C and D. You have two chains of lineage with A->B->C and A->B->D. The A->B sequence can be computed twice, once to generate C and another one for D. This is because of the lazy evaluation nature of Dask.
Scatter is also called broadcast in other distributed frameworks. Basically, you have a sizeable object that you want to send to the workers ahead of time to minimize the transfer. Think like a machine learning model. You can scatter it ahead of time so it's available on all workers.
Just started out with Tinkerpop and Janusgraph, and I'm trying to figure this out based on the documentation.
I have three datasets, each containing about 20 milions rows (csv files)
There is a specific model in which the variables and rows need to be connected, e.g. what are vertices, what are labels, what are edges, etc.
After having everything in a graph, I'd like to of course use some basic Gremlin to see how well the model works.
But first I need a way to get the data into Janusgraph.
Possibly there exist scripts for this.
But otherwise, is it perhaps something to be written in python, to open a csv file, get each row of a variable X, and add this as a vertex/edge/etc. ...?
Or am I completely misinterpreting Janusgraph/Tinkerpop?
Thanks for any help in advance.
EDIT:
Say I have a few files, each of which contain a few million rows, representing people, and several variables, representing different metrics. A first example could look like thid:
metric_1 metric_2 metric_3 ..
person_1 a e i
person_2 b f j
person_3 c g k
person_4 d h l
..
Should I translate this to files with nodes that are in the first place made up of just the values, [a,..., l].
(and later perhaps more elaborate sets of properties)
And are [a,..., l] then indexed?
The 'Modern' graph here seems to have an index (number 1,...,12 for all the nodes and edges, independent of their overlapping label/category), e.g. should each measurement be indexed separately and then linked to a given person_x to which they belong?
Apologies for these probably straightforward questions, but I'm fairly new to this.
Well, the truth is bulk loading of real user data into JanusGraph is a real pain. I've been using JanuGraph since it's very first version about 2 years ago and its still a pain to bulk load data. A lot of it is not necessarily down to JanusGraph because different users have very different data, different formats, different graph models (ie some mostly need one vertex with one edge ( ex. child-mother ) others deal with one vertex with many edges ( ex user followers ) ) and last but definitely not least, the very nature of the tool deals with large data sets, not to mention the underlying storage and index databases mostly come preconfigured to replicate massively (i.e you might be thinking 20m rows but you actually end up inserting 60m or 80m entries)
All said, I've had moderate success in bulk loading a some tens of millions in decent timeframes (again it will be painful but here are the general steps).
Provide IDs when creating graph elements. If importing from eg MySQL think of perhaps combining the tablename with the id value to create unique IDs eg users1, tweets2
Don't specify schema up front. This is because JanusGraph will need to ensure the data conforms on each inserting
Don't specify index up front. Just related to above but really deserves its own entry. Bulk insert first index later
Please, please, please, be aware of the underlying database features for bulk inserts and activate them i.e read up on Cassandra, ScyllaDB, Big Table, docs especially on replication and indexing
After all the above, configure JanusGraph for bulk loading, ensure your data integrity is correct (i.e no duplicate ids) and consider some form of parallelizing insert request e.g some kind of map reduce system
I think I've covered the major points, again, there's no silver bullet here and the process normally involves quite some trial and error for example the bulk insert rates, too low is bad e.g 10 per second while too high is equally bad eg 10k per second and it almost always depends on your data so its a case by case basis, can't recommend where you should start.
All said and done, give it a real go, bulk load is the hardest part in my opinion and the struggles are well worth the new dimension it gives your application.
All the best!
JanusGraph uses pluggable storage backends and indexs. For testing purposes, a script called bin/janusgraph.sh is packaged with the distribution. It allows to quickly get up and running by starting Cassandra and Elasticsearch (it also starts a gremlin-server but we won't use it)
cd /path/to/janus
bin/janusgraph.sh start
Then I would recommend loading your data using a Groovy script. Groovy scripts can be executed with the Gremlin console
bin/gremlin.sh -e scripts/load_data.script
An efficient way to load the data is to split it into two files:
nodes.csv: one line per node with all attributes
links.csv: one line per link with source_id and target_id and all the links attributes
This might require some data preparation steps.
Here is an example script
The trick to speed up the process is to keep a mapping between your id and the id created by JanusGraph during the creation of the nodes.
Even if it is not mandatory, I strongly recommend you to create an explicit schema for your graph before loading any data. Here is an example script
I'm using neo4j to contain temporary datasets from different source systems. My data consists of a few parent objects which each contain ~4-7 layers of child objects of varying types. Total object count per dataset varies between 2,000 and 1.5 million. I'm using the python py2neo library, which has had good performance both during the data creation phase, and for passing through cypher queries for reporting.
I'd like to isolate datasets from unrelated systems for querying and purging purposes, but I'm worried about performance. I have a few ideas, but it's not clear to me which are the most likely to be viable.
The easiest to implement (for my code) would be a top-level "project" object. That project object would then have a few direct children (via a relationship) and many indirect children. I'm worried that when I want to filter by project, I'll have to use a relationship wildcard MATCH (pr:project)<-[:IN_PROJECT*7]-(c:child_object) distance, which seems to very expensive query-wise.
I could also make a direct relationship between the project object and every other object in the project. MATCH (pr:project)<-[:IN_PROJECT]-(c:child_object)This should be easier for writing queries, but I don't know what might happen when I have a single object with potentially millions of relationships.
Finally, I could set a project-id property on every single object in the dataset. MATCH (c:child_object {project-id:"A1B2C3"}) It seems to be a wasteful solution, but I think it might be better performance wise in the graph DB model.
Apologies if I mangled the sample Cypher queries / neo4j terminology. I set aside this project for 6 weeks, and I'm a little rusty.
If you have a finite set of datasets, you should consider using a dedicated label to specify the data source. In Neo4j's property graph data model, a node is allowed to have multiple labels.
MATCH (c:child_object:DataSourceA)
Labels are always indexed, so performance should be better than that of your proposals 1-3. I also think this is a more elegant solution -- however, it will get tricky if you do not know the number of data sets up front. In the latter case, you might use something like
MATCH (c:child_object)
WHERE 'DataSourceA' IN labels(c)
But this is more like a "full table scan", so performance-wise, you'll be better off using your approach 3 and building an index on project-id.
I have a Neo4J instance running with the Neo4J Spatial plugin. In it, I have a graph with around 3.5 k nodes each with the same label, we'll call Basket. Each Basket relates to a physical location in the same city, and the density of these baskets is very variable. I have calculated walking times between each Basket and all of its neighbours within 600m, and stored these as non-spatial (directed) relationships between nodes. Thus, some Baskets exist as what seems to be part of a big cluster, and others exist almost on their own, with only one or almost no relationships to other Baskets.
My users have a problem: they wish to begin in one place, and end in another place, visiting an arbitrary, user-defined, number of Baskets along the way. My program aims to provide a few route options for the user (as a sequence of nodes - I'll sort the actual how-to-walk-there part later), calculating the n-th number of shortest paths.
I've written a cypher query to do this, below.
start a = node(5955), b=node(6497)
WITH a,b
MATCH p=((a)-[r:IS_WALKABLE_TO*4..5]->(b))
RETURN p
N.B. - nodes 5955 and 6497 are two nodes I picked about 2 miles apart, in this instance I decided to opt for between 4 and 5 baskets along the way.
However, I keep running into an out of memory exception, and so would like advice on how to reduce the memory demand for this problem to make it perform on an affordable server in an acceptable time of 1 to 6 seconds.
My understanding is that Neo4j would not perform a Cartesian Product to find the solution, but kind of "pick each node and sniff around from each one until it finds a suitable-sized connection" (please, forgive my phrasing!), so I'm confused about the heap memory error.
My thoughts for improving the program are to:
Somehow restrict the path-finding part of the query to nodes within a bounding box, determined by the placing of the start and end node (i.e., add 500 m in each direction, then limit the query to these nodes). However, I can't find any documentation on how to do this - is it possible without having to create another spatial layer for each query?
Re-write the query in a way which doesn't create a memory error - is this doable easily?
Stop using Neo4J for this entirely and write an algorithm to do it manually using an alternative language. If so, what language would you recommend? C? C++ / C#? Or could I stick with Python / Ruby / Java / Go? (or, I was even thinking I might be able to do it in PHP quite effectively but I'm not sure if that was a moment of madness).
Any help and advice about how to tackle this much appreciated!
You might be better off refactoring this Cypher query into Java code into an unmanaged extension. Your java code might then use either Traversal API or GraphAlgoFactory.pathsWithLength()
I think due to the densely connected shape of your graph you easily end up with hundreds of millions of possible path due to duplicate intermediate nodes.
You should add a LIMIT 100 to your query then it stops searching for paths.
One other idea is to rewrite your query to first find distinct starting points around a (and potentially b).
start a = node(5955), b=node(6497)
MATCH (a)-[:IS_WALKABLE_TO]->(a1)-[:IS_WALKABLE_TO]->(a2)
WITH a, b, a2, collect(a1) as first
MATCH p = shortestPath((a2)-[:IS_WALKABLE_TO*..2]->(b))
RETURN count(*)
// or
UNWIND first as a1
RETURN [a,a1] + nodes(p) as path
I have a collection that is potentially going to be very large. Now I know MongoDB doesn't really have a problem with this, but I don't really know how to go about designing a schema that can handle a very large dataset comfortably. So I'm going to give an outline of the problem.
We are collecting large amounts of data for our customers. Basically, when we gather this data it is represented as a 3-tuple, lets say (a, b, c), where b and c are members of sets B and C respectively. In this particular case we know that the B and C sets will not grow very much over time. For our current customers we are talking about ~200,000 members. However, the A set is the one that keeps growing over time. Currently we are at about ~2,000,000 members per customer, but this is going to grow (possibly rapidly.) Also, there are 1->n relations between b->a and c->a.
The workload on this data set is basically split up into 3 use cases. The collections will be periodically updated, where A will get the most writes, and B and C will get some, but not many. The second use case is random access into B, then aggregating over some number of documents in C that pertain to b \in B. And the last usecase is basically streaming a large subset from A and B to generate some new data.
The problem that we are facing is that the indexes are getting quite big. Currently we have a test setup with about 8 small customers, the total dataset is about 15GB in size at the moment, and indexes are running at about 3GB to 4GB. The problem here is that we don't really have any hot zones in our dataset. It's basically going to get an evenly distributed load amongst all documents.
Basically we've come up with 2 options to do this. The one that I described above, where all data for all customers is piled into one collection. This means we'd have to create an index om some field that links the documents in that collection to a particular customer.
The other options is to throw all b's and c's together (these sets are relatively small) but divide up the C collection, one per customer. I can imangine this last solution being a bit harder to manage, but since we rarely access data for multiple customers at the same time, it would prevent memory problems. MongoDB would be able to load the customers index into memory and just run from there.
What are your thoughts on this?
P.S.: I hope this wasn't too vague, if anything is unclear I'll go into some more details.
It sounds like the larger set (A if I followed along correctly), could reasonably be put into its own database. I say database rather than collection, because now that 2.2 is released you would want to minimize lock contention between the busier database and the others, and to do that a separate database would be best (2.2 introduced database level locking). That is looking at this from a single replica set model, of course.
Also the index sizes sound a bit out of proportion to your data size - are you sure they are all necessary? Pruning unneeded indexes, combining and using compound indexes may well significantly reduce the pain you are hitting in terms of index growth (it would potentially make updates and inserts more efficient too). This really does need specifics and probably belongs in another question, or possibly a thread in the mongodb-user group so multiple eyes can take a look and make suggestions.
If we look at it with the possibility of sharding thrown in, then the truly important piece is to pick a shard key that allows you to make sure locality is preserved on the shards for the pieces you will frequently need to access together. That would lend itself more toward a single sharded collection (preserving locality across multiple related sharded collections is going to be very tricky unless you manually split and balance the chunks in some way). Sharding gives you the ability to scale out horizontally as your indexes hit the single instance limit etc. but it is going to make the shard key decision very important.
Again, specifics for picking that shard key are beyond the scope of this more general discussion, similar to the potential index review I mentioned above.