I'm wondering that the optimal batch size while making bulk create in Django. I don't know the exact number of data. It might be millions or thousands. How can I decide the optimal batch size?
According to documentation, by default it tries to do one query - which in most cases is fastest way to update. "Optimal" batch size depends on various factors: starting from type of database and its limits, through server performance, network(latency), ending on your code (for example: how much do you changes you make in objects).
If it's new project or you haven't noticed any issues with updates, I'd stick to default.
Related
I'm using ArangoDB 3.9.2 for search task. The number of items in dataset is 100.000. When I pass the entire dataset as an input list to the engine - the execution time is around ~10 sec, which is pretty quick. But if I pass the dataset in small batches one by one - 100 items per batch, the execution time is rapidly growing. In this case, to process the full dataset takes about ~2 min. Could you explain please, why is it happening? The dataset is the same.
I'm using python driver "ArangoClient" from python-arango lib ver 0.2.1
PS: I had the similar problem with Neo4j, but the problem was solved using transactions committing with HTTP API. Does the ArangoDB have something similar?
Every time you make a call to a remote system (Neo4J or ArangoDB or any database) there is overhead in making the connection, sending the data, and then after executing your command, tearing down the connection.
What you're doing is trying to find the 'sweet spot' for your implementation as to the most efficient batch size for the type of data you are sending, the complexity of your query, the performance of your hardware, etc.
What I recommend doing is writing a test script that sends the data in varying batch sizes to help you determine the optimal settings for your use case.
I have taken this approach with many systems that I've designed and the optimal batch sizes are unique to each implementation. It totally depends on what you are doing.
See what results you get for the overall load time if you use batch sizes of 100, 1000, 2000, 5000, and 10000.
This way you'll work out the best answer for you.
I'm using neo4j to contain temporary datasets from different source systems. My data consists of a few parent objects which each contain ~4-7 layers of child objects of varying types. Total object count per dataset varies between 2,000 and 1.5 million. I'm using the python py2neo library, which has had good performance both during the data creation phase, and for passing through cypher queries for reporting.
I'd like to isolate datasets from unrelated systems for querying and purging purposes, but I'm worried about performance. I have a few ideas, but it's not clear to me which are the most likely to be viable.
The easiest to implement (for my code) would be a top-level "project" object. That project object would then have a few direct children (via a relationship) and many indirect children. I'm worried that when I want to filter by project, I'll have to use a relationship wildcard MATCH (pr:project)<-[:IN_PROJECT*7]-(c:child_object) distance, which seems to very expensive query-wise.
I could also make a direct relationship between the project object and every other object in the project. MATCH (pr:project)<-[:IN_PROJECT]-(c:child_object)This should be easier for writing queries, but I don't know what might happen when I have a single object with potentially millions of relationships.
Finally, I could set a project-id property on every single object in the dataset. MATCH (c:child_object {project-id:"A1B2C3"}) It seems to be a wasteful solution, but I think it might be better performance wise in the graph DB model.
Apologies if I mangled the sample Cypher queries / neo4j terminology. I set aside this project for 6 weeks, and I'm a little rusty.
If you have a finite set of datasets, you should consider using a dedicated label to specify the data source. In Neo4j's property graph data model, a node is allowed to have multiple labels.
MATCH (c:child_object:DataSourceA)
Labels are always indexed, so performance should be better than that of your proposals 1-3. I also think this is a more elegant solution -- however, it will get tricky if you do not know the number of data sets up front. In the latter case, you might use something like
MATCH (c:child_object)
WHERE 'DataSourceA' IN labels(c)
But this is more like a "full table scan", so performance-wise, you'll be better off using your approach 3 and building an index on project-id.
I want to use Word2vec in a web server (production) in two different variants where I fetch two sentences from the web and compare it in real-time. For now, I am testing it on a local machine which has 16GB RAM.
Scenario:
w2v = load w2v model
If condition 1 is true:
if normalize:
reverse normalize by w2v.init_sims(replace=False) (not sure if it will work)
Loop through some items:
calculate their vectors using w2v
else if condition 2 is true:
if not normalized:
w2v.init_sims(replace=True)
Loop through some items:
calculate their vectors using w2v
I have already read the solution about reducing the vocabulary size to a small size but I would like to use all the vocabulary.
Are there new workarounds on how to handle this? Is there a way to initially load a small portion of the vocabulary for first 1-2 minutes and in parallel keep loading the whole vocabulary?
As a one-time delay that you should be able to schedule to happen before any service-requests, I would recommend against worrying too much about the first-time load() time. (It's going to inherently take a lot of time to load a lot of data from disk to RAM – but once there, if it's being kept around and shared between processes well, the cost is not spent again for an arbitrarily long service-uptime.)
It doesn't make much sense to "load a small portion of the vocabulary for first 1-2 minutes and in parallel keep loading the whole vocabulary" – as soon as any similarity-calc is needed, the whole set of vectors need to be accessed for any top-N results. (So the "half-loaded" state isn't very useful.)
Note that if you do init_sims(replace=True), the model's original raw vector magnitudes are clobbered with the new unit-normed (all-same-magnitude) vectors. So looking at your pseudocode, the only difference between the two paths is the explicit init_sims(replace=True). But if you're truly keeping the same shared model in memory between requests, as soon as condition 2 occurs, the model is normalized, and thereafter calls under condition 1 are also occurring with normalized vectors. And further, additional calls under condition 2 just redundantly (and expensively) re-normalize the vectors in-place. So if normalized-comparisons are your only focus, best to do one in-place init_sims(replace=True) at service startup - not at the mercy of order-of-requests.
If you've saved the model using gensim's native save() (rather than save_word2vec_format()), and as uncompressed files, there's the option to 'memory-map' the files on a future re-load. This means rather than immediately copying the full vector array into RAM, the file-on-disk is simply marked as providing the addressing-space. There are two potential benefits to this: (1) if you only even access some limited ranges of the array, only those are loaded, on demand; (2) many separate processes all using the same mapped files will automatically reuse any shared ranges loaded into RAM, rather than potentially duplicating the same data.
(1) isn't much of an advantage as soon as you need to do a full-sweep over the whole vocabulary – because they're all brought into RAM then, and further at the moment of access (which will have more service-lag than if you'd just pre-loaded them). But (2) is still an advantage in multi-process webserver scenarios. There's a lot more detail on how you might use memory-mapped word2vec models efficiently in a prior answer of mine, at How to speed up Gensim Word2vec model load time?
I have a collection that is potentially going to be very large. Now I know MongoDB doesn't really have a problem with this, but I don't really know how to go about designing a schema that can handle a very large dataset comfortably. So I'm going to give an outline of the problem.
We are collecting large amounts of data for our customers. Basically, when we gather this data it is represented as a 3-tuple, lets say (a, b, c), where b and c are members of sets B and C respectively. In this particular case we know that the B and C sets will not grow very much over time. For our current customers we are talking about ~200,000 members. However, the A set is the one that keeps growing over time. Currently we are at about ~2,000,000 members per customer, but this is going to grow (possibly rapidly.) Also, there are 1->n relations between b->a and c->a.
The workload on this data set is basically split up into 3 use cases. The collections will be periodically updated, where A will get the most writes, and B and C will get some, but not many. The second use case is random access into B, then aggregating over some number of documents in C that pertain to b \in B. And the last usecase is basically streaming a large subset from A and B to generate some new data.
The problem that we are facing is that the indexes are getting quite big. Currently we have a test setup with about 8 small customers, the total dataset is about 15GB in size at the moment, and indexes are running at about 3GB to 4GB. The problem here is that we don't really have any hot zones in our dataset. It's basically going to get an evenly distributed load amongst all documents.
Basically we've come up with 2 options to do this. The one that I described above, where all data for all customers is piled into one collection. This means we'd have to create an index om some field that links the documents in that collection to a particular customer.
The other options is to throw all b's and c's together (these sets are relatively small) but divide up the C collection, one per customer. I can imangine this last solution being a bit harder to manage, but since we rarely access data for multiple customers at the same time, it would prevent memory problems. MongoDB would be able to load the customers index into memory and just run from there.
What are your thoughts on this?
P.S.: I hope this wasn't too vague, if anything is unclear I'll go into some more details.
It sounds like the larger set (A if I followed along correctly), could reasonably be put into its own database. I say database rather than collection, because now that 2.2 is released you would want to minimize lock contention between the busier database and the others, and to do that a separate database would be best (2.2 introduced database level locking). That is looking at this from a single replica set model, of course.
Also the index sizes sound a bit out of proportion to your data size - are you sure they are all necessary? Pruning unneeded indexes, combining and using compound indexes may well significantly reduce the pain you are hitting in terms of index growth (it would potentially make updates and inserts more efficient too). This really does need specifics and probably belongs in another question, or possibly a thread in the mongodb-user group so multiple eyes can take a look and make suggestions.
If we look at it with the possibility of sharding thrown in, then the truly important piece is to pick a shard key that allows you to make sure locality is preserved on the shards for the pieces you will frequently need to access together. That would lend itself more toward a single sharded collection (preserving locality across multiple related sharded collections is going to be very tricky unless you manually split and balance the chunks in some way). Sharding gives you the ability to scale out horizontally as your indexes hit the single instance limit etc. but it is going to make the shard key decision very important.
Again, specifics for picking that shard key are beyond the scope of this more general discussion, similar to the potential index review I mentioned above.
I just took my first baby step today into real scientific computing today when I was shown a data set where the smallest file is 48000 fields by 1600 rows (haplotypes for several people, for chromosome 22). And this is considered tiny.
I write Python, so I've spent the last few hours reading about HDF5, and Numpy, and PyTable, but I still feel like I'm not really grokking what a terabyte-sized data set actually means for me as a programmer.
For example, someone pointed out that with larger data sets, it becomes impossible to read the whole thing into memory, not because the machine has insufficient RAM, but because the architecture has insufficient address space! It blew my mind.
What other assumptions have I been relying in the classroom that just don't work with input this big? What kinds of things do I need to start doing or thinking about differently? (This doesn't have to be Python specific.)
I'm currently engaged in high-performance computing in a small corner of the oil industry and regularly work with datasets of the orders of magnitude you are concerned about. Here are some points to consider:
Databases don't have a lot of traction in this domain. Almost all our data is kept in files, some of those files are based on tape file formats designed in the 70s. I think that part of the reason for the non-use of databases is historic; 10, even 5, years ago I think that Oracle and its kin just weren't up to the task of managing single datasets of O(TB) let alone a database of 1000s of such datasets.
Another reason is a conceptual mismatch between the normalisation rules for effective database analysis and design and the nature of scientific data sets.
I think (though I'm not sure) that the performance reason(s) are much less persuasive today. And the concept-mismatch reason is probably also less pressing now that most of the major databases available can cope with spatial data sets which are generally a much closer conceptual fit to other scientific datasets. I have seen an increasing use of databases for storing meta-data, with some sort of reference, then, to the file(s) containing the sensor data.
However, I'd still be looking at, in fact am looking at, HDF5. It has a couple of attractions for me (a) it's just another file format so I don't have to install a DBMS and wrestle with its complexities, and (b) with the right hardware I can read/write an HDF5 file in parallel. (Yes, I know that I can read and write databases in parallel too).
Which takes me to the second point: when dealing with very large datasets you really need to be thinking of using parallel computation. I work mostly in Fortran, one of its strengths is its array syntax which fits very well onto a lot of scientific computing; another is the good support for parallelisation available. I believe that Python has all sorts of parallelisation support too so it's probably not a bad choice for you.
Sure you can add parallelism on to sequential systems, but it's much better to start out designing for parallelism. To take just one example: the best sequential algorithm for a problem is very often not the best candidate for parallelisation. You might be better off using a different algorithm, one which scales better on multiple processors. Which leads neatly to the next point.
I think also that you may have to come to terms with surrendering any attachments you have (if you have them) to lots of clever algorithms and data structures which work well when all your data is resident in memory. Very often trying to adapt them to the situation where you can't get the data into memory all at once, is much harder (and less performant) than brute-force and regarding the entire file as one large array.
Performance starts to matter in a serious way, both the execution performance of programs, and developer performance. It's not that a 1TB dataset requires 10 times as much code as a 1GB dataset so you have to work faster, it's that some of the ideas that you will need to implement will be crazily complex, and probably have to be written by domain specialists, ie the scientists you are working with. Here the domain specialists write in Matlab.
But this is going on too long, I'd better get back to work
In a nutshell, the main differences IMO:
You should know beforehand what your likely
bottleneck will be (I/O or CPU) and focus on the best algorithm and infrastructure
to address this. I/O quite frequently is the bottleneck.
Choice and fine-tuning of an algorithm often dominates any other choice made.
Even modest changes to algorithms and access patterns can impact performance by
orders of magnitude. You will be micro-optimizing a lot. The "best" solution will be
system-dependent.
Talk to your colleagues and other scientists to profit from their experiences with these
data sets. A lot of tricks cannot be found in textbooks.
Pre-computing and storing can be extremely successful.
Bandwidth and I/O
Initially, bandwidth and I/O often is the bottleneck. To give you a perspective: at the theoretical limit for SATA 3, it takes about 30 minutes to read 1 TB. If you need random access, read several times or write, you want to do this in memory most of the time or need something substantially faster (e.g. iSCSI with InfiniBand). Your system should ideally be able to do parallel I/O to get as close as possible to the theoretical limit of whichever interface you are using. For example, simply accessing different files in parallel in different processes, or HDF5 on top of MPI-2 I/O is pretty common. Ideally, you also do computation and I/O in parallel so that one of the two is "for free".
Clusters
Depending on your case, either I/O or CPU might than be the bottleneck. No matter which one it is, huge performance increases can be achieved with clusters if you can effectively distribute your tasks (example MapReduce). This might require totally different algorithms than the typical textbook examples. Spending development time here is often the best time spent.
Algorithms
In choosing between algorithms, big O of an algorithm is very important, but algorithms with similar big O can be dramatically different in performance depending on locality. The less local an algorithm is (i.e. the more cache misses and main memory misses), the worse the performance will be - access to storage is usually an order of magnitude slower than main memory. Classical examples for improvements would be tiling for matrix multiplications or loop interchange.
Computer, Language, Specialized Tools
If your bottleneck is I/O, this means that algorithms for large data sets can benefit from more main memory (e.g. 64 bit) or programming languages / data structures with less memory consumption (e.g., in Python __slots__ might be useful), because more memory might mean less I/O per CPU time. BTW, systems with TBs of main memory are not unheard of (e.g. HP Superdomes).
Similarly, if your bottleneck is the CPU, faster machines, languages and compilers that allow you to use special features of an architecture (e.g. SIMD like SSE) might increase performance by an order of magnitude.
The way you find and access data, and store meta information can be very important for performance. You will often use flat files or domain-specific non-standard packages to store data (e.g. not a relational db directly) that enable you to access data more efficiently. For example, kdb+ is a specialized database for large time series, and ROOT uses a TTree object to access data efficiently. The pyTables you mention would be another example.
While some languages have naturally lower memory overhead in their types than others, that really doesn't matter for data this size - you're not holding your entire data set in memory regardless of the language you're using, so the "expense" of Python is irrelevant here. As you pointed out, there simply isn't enough address space to even reference all this data, let alone hold onto it.
What this normally means is either a) storing your data in a database, or b) adding resources in the form of additional computers, thus adding to your available address space and memory. Realistically you're going to end up doing both of these things. One key thing to keep in mind when using a database is that a database isn't just a place to put your data while you're not using it - you can do WORK in the database, and you should try to do so. The database technology you use has a large impact on the kind of work you can do, but an SQL database, for example, is well suited to do a lot of set math and do it efficiently (of course, this means that schema design becomes a very important part of your overall architecture). Don't just suck data out and manipulate it only in memory - try to leverage the computational query capabilities of your database to do as much work as possible before you ever put the data in memory in your process.
The main assumptions are about the amount of cpu/cache/ram/storage/bandwidth you can have in a single machine at an acceptable price. There are lots of answers here at stackoverflow still based on the old assumptions of a 32 bit machine with 4G ram and about a terabyte of storage and 1Gb network. With 16GB DDR-3 ram modules at 220 Eur, 512 GB ram, 48 core machines can be build at reasonable prices. The switch from hard disks to SSD is another important change.