I am trying to optimize my python training script (I need to run in multiple times, so it makes sense to try to speed up). I have a dataset composed of 9 months of data. The validation setup is a kind of "temporal validation" in which I leave one month out, I train on the remaining set of months (with different sampling methods) and I make a prediction over the "test month".
months # set of months
for test_month in months:
sample_list = generate_different_samples([months - test-months])
for sample in sample_list:
xgb.train(sample)
xgb.predict(test_month)
# evalutaion after
In practice, I have for each month almost 100 different training samples. I am running my code on a machine with 16 cores and 64GB of RAM. The memory is not a problem (the dataset contains millions of instances but they do not fill the memory). I am currently parallelizing at a "test_month" level, thus creating a ProcessPool that run all the 9 months together, however, I am struggling in setting the nthread parameter of xgboost. At the moment is 2, in this way each thread will run on a single core, but I am reading online different opinions (https://github.com/dmlc/xgboost/issues/3042). Should I increase this number? I know that the question may be a little vague, but I was looking for a systematic way to select the best value based on the dataset structure.
It will not come as a surprise, but there is no single golden-goose strategy for this. At least I never bumped into one so far. If you establish one, please share it here- I will be interested to learn.
There is an advice in lightgbm, which is a competitor GBM tool, where they say:
for the best speed, set this to the number of real CPU cores, not the number of threads (most CPUs use hyper-threading to generate 2 threads per CPU core)
I'm not aware if there a similar recommendation from xgboost authors. But to the zeroth order approximation, i do not see a reason, why the two implementations would scale differently.
The most in-depth benchmarking of GBM tools that I saw is this one by Laurae. It shows, among other things, performance scaling as a function of the number of threads. Beware, that it is really advanced and conclusions from there might not directly apply, unless one implements the same preparatory steps on the OS level.
Related
I'm training word2vec from scratch on 34 GB pre-processed MS_MARCO corpus(of 22 GB). (Preprocessed corpus is sentnecepiece tokenized and so its size is more) I'm training my word2vec model using following code :
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
class Corpus():
"""Iterate over sentences from the corpus."""
def __init__(self):
self.files = [
"sp_cor1.txt",
"sp_cor2.txt",
"sp_cor3.txt",
"sp_cor4.txt",
"sp_cor5.txt",
"sp_cor6.txt",
"sp_cor7.txt",
"sp_cor8.txt"
]
def __iter__(self):
for fname in self.files:
for line in open(fname):
words = line.split()
yield words
sentences = Corpus()
model = Word2Vec(sentences, size=300, window=5, min_count=1, workers=8, sg=1, hs=1, negative=10)
model.save("word2vec.model")
My model is running now for about more than 30 hours now. This is doubtful since on my i5 laptop with 8 cores, I'm using all the 8 cores at 100% for every moment of time. Plus, my program seems to have read more than 100 GB of data from the disk now. I don't know if there is anything wrong here, but the main reason after my doubt on the training is because of this 100 GB of read from the disk. The whole corpus is of 34 GB, then why my code has read 100 GB of data from the disk? Does anyone know how much time should it take to train word2vec on 34 GB of text, with 8 cores of i5 CPU running all in parallel? Thank you. For more information, I'm also attaching the photo of my process from system monitor.
I want to know why my model has read 112 GB from memory, even when my corpus is of 34 GB in total? Will my training ever get finished? Also I'm bit worried about health of my laptop, since it is running constantly at its peak capacity since last 30 hours. It is really hot now.
Should I add any additional parameter in Word2Vec for quicker training without much performance loss?
Completing a model requires one pass over all the data to discover the vocabulary, then multiple passes, with a default of 5, to perform vector training. So, you should expect to see about 6x your data size in disk-reads, just from the model training.
(If your machine winds up needing to use virtual-memory swapping during the process, there could be more disk activity – but you absolutely do not want that to happen, as the random-access pattern of word2vec training is nearly a worst-case for virtual memory usage, which will slow training immensely.)
If you'd like to understand the code's progress, and be able to estimate its completion time, you should enable Python logging to at least the INFO level. Various steps of the process will report interim results (such as the discovered and surviving vocabulary size) and estimated progress. You can often tell if something is going wrong before the end of a run by studying the logging outputs for sensible values, and once the 'training' phase has begun the completion time will be a simple projection from the training completed so far.
I believe most laptops should throttle their own CPU if it's becoming so hot as to become unsafe or risk extreme wear on the CPU/components, but whether yours does, I can't say, and definitely make sure its fans work & vents are unobstructed.
I'd suggest you choose some small random subset of your data – maybe 1GB? – to be able to run all your steps to completion, becoming familiar with the Word2Vec logging output, resource usage, and results, and tinkering with settings to observe changes, before trying to run on your full dataset, which might require days of training time.
Some of your shown parameters aren't optimal for speedy training. In particular:
min_count=1 retains every word seen in the corpus-survey, including those with only a single occurrence. This results in a much, much larger model - potentially risking a model that doesn't fit into RAM, forcing disastrous swapping. But also, words with just a few usage examples can't possibly get good word vectors, as the process requires seeing many subtly-varied alternate uses. Still, via typical 'Zipfian' word-frequencies, the number of such words with just a few uses may be very large in total, so retaining all those words takes a lot of training time/effort, and even serves a bit like 'noise' making the training of other words, with plenty of usage examples, less effective. So for model size, training speed, and quality of remaining vectors, a larger min_count is desirable. The default of min_count=5 is better for more projects than min_count=1 – this is a parameter that should only really be changed if you're sure you know the effects. And, when you have plentiful data – as with your 34GB – the min_count can go much higher to keep the model size manageable.
hs=1 should only be enabled if you want to use the 'hierarchical-softmax' training mode instead of 'negative-sampling' – and in that case, negative=0 should also be set to disable 'negative-sampling'. You probably don't want to use hierarchical-softmax: it's not the default for a reason, and it doesn't scale as well to larger datasets. But here you've enabled in in addition to negative-sampling, likely more-than-doubling the required training time.
Did you choose negative=10 because you had problems with the default negative=5? Because this non-default choice, again, would slow training noticeably. (But also, again, a non-default choice here would be more common with smaller datasets, while larger datasets like yours are more likely to experiment with a smaller negative value.)
The theme of the above observations is: "only change the defaults if you've already got something working, and you have a good theory (or way of testing) how that change might help".
With a large-enough dataset, there's another default parameter to consider changing to speed up training (& often improve word-vector quality, as well): sample, which controls how-aggressively highly-frequent words (with many redundant usage-examples) may be downsampled (randomly skipped).
The default value, sample=0.001 (aka 1e-03), is very conservative. A smaller value, such as sample=1e-05, will discard many-more of the most-frequent-words' redundant usage examples, speeding overall training considerably. (And, for a corpus of your size, you could eventually experiment with even smaller, more-aggressive values.)
Finally, to the extent all your data (for either a full run, or a subset run) can be in an already-space-delimited text file, you can use the corpus_file alternate method of specifying the corpus. Then, the Word2Vec class will use an optimized multithreaded IO approach to assign sections of the file to alternate worker threads – which, if you weren't previously seeing full saturation of all threads/CPU-cores, could increase our throughput. (I'd put this off until after trying other things, then check if your best setup still leaves some of your 8 threads often idle.)
For the past few weeks I've been attempting to preform a fairly large clustering analysis using the HDBSCAN algorithm in python 3.7. The data in question is roughly 4 million rows by 40 columns at around 1.5GB in CSV format. It's a mixture of ints, bools, and floats up to 9 digits.
During this period each time I've been able to get the data to cluster it has taken 3 plus days, which seems weird given HDBSCAN is revered for its speed and I'm running this on a Google Cloud Compute Instance with 96 cpus. I've spent days trying to get it to utilize the cloud instance's processing power but to no avail.
Using the auto algorithm detection in HDBSCAN, it selects the boruvka_kdtree as the best algorithm to use. And I've tried passing in all sorts of values to core_dist_n_jobs parameter. From -2,-1, 1, 96, multiprocessing.cpu_count(), to 300. All seem to have a similar effect of causing 4 main python processes to utilize a full core while spawning way more sleeping processes.
I refuse to believe I'm doing this right and this is truly how long this takes on this hardware. I'm convinced I must be missing something like an issue where using JupyterHub on the same machine causes some sort of GIL lock, or I'm missing some parameter for HDBSCAN.
Here is my current call to HDBSCAN:
hdbscan.HDBSCAN(min_cluster_size = 100000, min_samples = 500, algorithm='best', alpha=1.0, memory=mem, core_dist_n_jobs = multiprocessing.cpu_count())
I've followed all existing issues and posts related to this issue I could find and nothing has worked so far, but I'm always down to try even radical ideas, because this isn't even the full data I want to cluster and at this rate it would take 4 years years to cluster the full data!
According to the author
Only the core distance computation can use all the cores, sadly that is apparently the first few seconds. The rest of the computation is quite challenging to parallelise unfortunately and will run on a single thread.
you can read the issues from the links below:
Not using all available CPUs?
core_dist_n_jobs =1 or -1 -> no difference at all and computation time extremely high
I'm doing some Monte Carlo for a model and figured that Dask could be quite useful for this purpose. For the first 35 hours or so, things were running quite "smoothly" (apart from the fan noise giving a sense that the computer was taking off). Each model run would take about 2 seconds and there were 8 partitions running it in parallel. Activity monitor was showing 8 python3.6 instances.
However, the computer has become "silent" and CPU usage (as displayed in Spyder) hardly exceeds 20%. Model runs are happening sequentially (not in parallel) and taking about 4 seconds each. This happened today at some point while I was working on other things. I understand that depending on the sequence of actions, Dask won't use all cores at the same time. However, in this case there is really just one task to be performed (see further below), so one could expect all partitions to run and finish more or less simultaneously. Edit: the whole set up has run successfully for 10.000 simulations in the past, the difference now being that there are nearly 500.000 simulations to run.
Edit 2: now it has shifted to doing 2 partitions in parallel (instead of the previous 1 and original 8). It appears that something is making it change how many partitions are simultaneously processed.
Edit 3: Following recommendations, I have used a dask.distributed.Client to track what is happening, and ran it for the first 400 rows. An illustration of what it looks like after completing is included below. I am struggling to understand the x-axis labels, hovering over the rectangles shows about 143 s.
Some questions therefore are:
Is there any relationship between running other software (Chrome, MS Word) and having the computer "take back" some CPU from python?
Or instead, could it be related to the fact that at some point I ran a second Spyder instance?
Or even, could the computer have somehow run out of memory? But then wouldn't the command have stopped running?
... any other possible explanation?
Is it possible to "tell" Dask to keep up the hard work and go back to using all CPU power while it is still running the original command?
Is it possible to interrupt an execution and keep whichever calculations have already been performed? I have noticed that stopping the current command doesn't seem to do much.
Is it possible to inquire on the overall progress of the computation while it is running? I would like to know how many model runs are left to have an idea of how long it would take to complete in this slow pace. I have tried using the ProgressBar in the past but it hangs on 0% until a few seconds before the end of the computations.
To be clear, uploading the model and the necessary data would be very complex. I haven't created a reproducible example either out of fear of making the issue worse (for now the model is still running at least...) and because - as you can probably tell by now - I have very little idea of what could be causing it and I am not expecting anyone to be able to reproduce it. I'm aware this is not best practice and apologise in advance. However, I would really appreciate some thoughts on what could be going on and possible ways to go about it, if anyone has been thorough something similar before and/or has experience with Dask.
Running:
- macOS 10.13.6 (Memory: 16 GB | Processor: 2.5 GHz Intel Core i7 | 4 cores)
- Spyder 3.3.1
- dask 0.19.2
- pandas 0.23.4
Please let me know if anything needs to be made clearer
If you believe it can be relevant, the main idea of the script is:
# Create a pandas DataFrame where each column is a parameter and each row is a possible parameter combination (cartesian product). At the end of each row some columns to store the respective values of some objective functions are pre-allocated too.
# Generate a dask dataframe that is the DataFrame above split into 8 partitions
# Define a function that takes a partition and, for each row:
# Runs the model with the coefficient values defined in the row
# Retrieves the values of objective functions
# Assigns these values to the respective columns of the current row in the partition (columns have been pre-allocated)
# and then returns the partition with columns for objective functions populated with the calculated values
# map_partitions() to this function in the dask dataframe
Any thoughts?
This shows how simple the script is:
The dashboard:
Update: The approach I took was to:
Set a large number of partitions (npartitions=nCores*200). This made it much easier to visualise the progress. I'm not sure if setting so many partitions is good practice but it worked without much of a slowdown.
Instead of trying to get a single huge pandas DataFrame in the end by .compute(), I got the dask dataframe to be written to Parquet (in this way each partition was written to a separate file). Later, reading all files into a dask dataframe and computeing it to a pandas DataFrame wasn't difficult, and if something went wrong in the middle at least I wouldn't lose the partitions that had been successfully processed and written.
This is what it looked like at a given point:
Dask has many diagnostic tools to help you understand what is going on inside your computation. See http://docs.dask.org/en/latest/understanding-performance.html
In particular I recommend using the distributed scheduler locally and watching the Dask dashboard to get a sense of what is going on in your computation. See http://docs.dask.org/en/latest/diagnostics-distributed.html#dashboard
This is a webpage that you can visit that will tell you exactly what is going on in all of your processors.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I've completed writing a multiclass classification algorithm that uses boosted classifiers. One of the main calculations consists of weighted least squares regression.
The main libraries I've used include:
statsmodels (for regression)
numpy (pretty much everywhere)
scikit-image (for extracting HoG features of images)
I've developed the algorithm in Python, using Anaconda's Spyder.
I now need to use the algorithm to start training classification models. So I'll be passing approximately 7000-10000 images to this algorithm, each about 50x100, all in gray scale.
Now I've been told that a powerful machine is available in order to speed up the training process. And they asked me "am I using GPU?" And a few other questions.
To be honest I have no experience in CUDA/GPU, etc. I've only ever heard of them. I didn't develop my code with any such thing in mind. In fact I had the (ignorant) impression that a good machine will automatically run my code faster than a mediocre one, without my having to do anything about it. (Apart from obviously writing regular code efficiently in terms of loops, O(n), etc).
Is it still possible for my code to get speeded up simply by virtue of being on a high performance computer? Or do I need to modify it to make use of a parallel-processing machine?
The comments and Moj's answer give a lot of good advice. I have some experience on signal/image processing with python, and have banged my head against the performance wall repeatedly, and I just want to share a few thoughts about making things faster in general. Maybe these help figuring out possible solutions with slow algorithms.
Where is the time spent?
Let us assume that you have a great algorithm which is just too slow. The first step is to profile it to see where the time is spent. Sometimes the time is spent doing trivial things in a stupid way. It may be in your own code, or it may even be in the library code. For example, if you want to run a 2D Gaussian filter with a largish kernel, direct convolution is very slow, and even FFT may be slow. Approximating the filter with computationally cheap successive sliding averages may speed things up by a factor of 10 or 100 in some cases and give results which are close enough.
If a lot of time is spent in some module/library code, you should check if the algorithm is just a slow algorithm, or if there is something slow with the library. Python is a great programming language, but for pure number crunching operations it is not good, which means most great libraries have some binary libraries doing the heavy lifting. On the other hand, if you can find suitable libraries, the penalty for using python in signal/image processing is often negligible. Thus, rewriting the whole program in C does not usually help much.
Writing a good algorithm even in C is not always trivial, and sometimes the performance may vary a lot depending on things like CPU cache. If the data is in the CPU cache, it can be fetched very fast, if it is not, then the algorithm is much slower. This may introduce non-linear steps into the processing time depending on the data size. (Most people know this from the virtual memory swapping, where it is more visible.) Due to this it may be faster to solve 100 problems with 100 000 points than 1 problem with 10 000 000 points.
One thing to check is the precision used in the calculation. In some cases float32 is as good as float64 but much faster. In many cases there is no difference.
Multi-threading
Python - did I mention? - is a great programming language, but one of its shortcomings is that in its basic form it runs a single thread. So, no matter how many cores you have in your system, the wall clock time is always the same. The result is that one of the cores is at 100 %, and the others spend their time idling. Making things parallel and having multiple threads may improve your performance by a factor of, e.g., 3 in a 4-core machine.
It is usually a very good idea if you can split your problem into small independent parts. It helps with many performance bottlenecks.
And do not expect technology to come to rescue. If the code is not written to be parallel, it is very difficult for a machine to make it parallel.
GPUs
Your machine may have a great GPU with maybe 1536 number-hungry cores ready to crunch everything you toss at them. The bad news is that making GPU code is a bit different from writing CPU code. There are some slightly generic APIs around (CUDA, OpenCL), but if you are not accustomed to writing parallel code for GPUs, prepare for a steepish learning curve. On the other hand, it is likely someone has already written the library you need, and then you only need to hook to that.
With GPUs the sheer number-crunching power is impressive, almost frightening. We may talk about 3 TFLOPS (3 x 10^12 single-precision floating-point ops per second). The problem there is how to get the data to the GPU cores, because the memory bandwidth will become the limiting factor. This means that even though using GPUs is a good idea in many cases, there are a lot of cases where there is no gain.
Typically, if you are performing a lot of local operations on the image, the operations are easy to make parallel, and they fit well a GPU. If you are doing global operations, the situation is a bit more complicated. A FFT requires information from all over the image, and thus the standard algorithm does not work well with GPUs. (There are GPU-based algorithms for FFTs, and they sometimes make things much faster.)
Also, beware that making your algorithms run on a GPU bind you to that GPU. The portability of your code across OSes or machines suffers.
Buy some performance
Also, one important thing to consider is if you need to run your algorithm once, once in a while, or in real time. Sometimes the solution is as easy as buying time from a larger computer. For a dollar or two an hour you can buy time from quite fast machines with a lot of resources. It is simpler and often cheaper than you would think. Also GPU capacity can be bought easily for a similar price.
One possibly slightly under-advertised property of some cloud services is that in some cases the IO speed of the virtual machines is extremely good compared to physical machines. The difference comes from the fact that there are no spinning platters with the average penalty of half-revolution per data seek. This may be important with data-intensive applications, especially if you work with a large number of files and access them in a non-linear way.
I am afraid you can not speed up your program by just running it on a powerful computer. I had this issue while back. I first used python (very slow), then moved to C(slow) and then had to use other tricks and techniques. for example it is sometimes possible to apply some dimensionality reduction to speed up things while having reasonable accurate result, or as you mentioned using multi processing techniques.
Since you are dealing with image processing problem, you do a lot of matrix operations and GPU for sure would be a great help. there are some nice and active cuda wrappers in python that you can easily use, by not knowing too much CUDA. I tried Theano, pycuda and scikit-cuda (there should be more than that since then).
I just took my first baby step today into real scientific computing today when I was shown a data set where the smallest file is 48000 fields by 1600 rows (haplotypes for several people, for chromosome 22). And this is considered tiny.
I write Python, so I've spent the last few hours reading about HDF5, and Numpy, and PyTable, but I still feel like I'm not really grokking what a terabyte-sized data set actually means for me as a programmer.
For example, someone pointed out that with larger data sets, it becomes impossible to read the whole thing into memory, not because the machine has insufficient RAM, but because the architecture has insufficient address space! It blew my mind.
What other assumptions have I been relying in the classroom that just don't work with input this big? What kinds of things do I need to start doing or thinking about differently? (This doesn't have to be Python specific.)
I'm currently engaged in high-performance computing in a small corner of the oil industry and regularly work with datasets of the orders of magnitude you are concerned about. Here are some points to consider:
Databases don't have a lot of traction in this domain. Almost all our data is kept in files, some of those files are based on tape file formats designed in the 70s. I think that part of the reason for the non-use of databases is historic; 10, even 5, years ago I think that Oracle and its kin just weren't up to the task of managing single datasets of O(TB) let alone a database of 1000s of such datasets.
Another reason is a conceptual mismatch between the normalisation rules for effective database analysis and design and the nature of scientific data sets.
I think (though I'm not sure) that the performance reason(s) are much less persuasive today. And the concept-mismatch reason is probably also less pressing now that most of the major databases available can cope with spatial data sets which are generally a much closer conceptual fit to other scientific datasets. I have seen an increasing use of databases for storing meta-data, with some sort of reference, then, to the file(s) containing the sensor data.
However, I'd still be looking at, in fact am looking at, HDF5. It has a couple of attractions for me (a) it's just another file format so I don't have to install a DBMS and wrestle with its complexities, and (b) with the right hardware I can read/write an HDF5 file in parallel. (Yes, I know that I can read and write databases in parallel too).
Which takes me to the second point: when dealing with very large datasets you really need to be thinking of using parallel computation. I work mostly in Fortran, one of its strengths is its array syntax which fits very well onto a lot of scientific computing; another is the good support for parallelisation available. I believe that Python has all sorts of parallelisation support too so it's probably not a bad choice for you.
Sure you can add parallelism on to sequential systems, but it's much better to start out designing for parallelism. To take just one example: the best sequential algorithm for a problem is very often not the best candidate for parallelisation. You might be better off using a different algorithm, one which scales better on multiple processors. Which leads neatly to the next point.
I think also that you may have to come to terms with surrendering any attachments you have (if you have them) to lots of clever algorithms and data structures which work well when all your data is resident in memory. Very often trying to adapt them to the situation where you can't get the data into memory all at once, is much harder (and less performant) than brute-force and regarding the entire file as one large array.
Performance starts to matter in a serious way, both the execution performance of programs, and developer performance. It's not that a 1TB dataset requires 10 times as much code as a 1GB dataset so you have to work faster, it's that some of the ideas that you will need to implement will be crazily complex, and probably have to be written by domain specialists, ie the scientists you are working with. Here the domain specialists write in Matlab.
But this is going on too long, I'd better get back to work
In a nutshell, the main differences IMO:
You should know beforehand what your likely
bottleneck will be (I/O or CPU) and focus on the best algorithm and infrastructure
to address this. I/O quite frequently is the bottleneck.
Choice and fine-tuning of an algorithm often dominates any other choice made.
Even modest changes to algorithms and access patterns can impact performance by
orders of magnitude. You will be micro-optimizing a lot. The "best" solution will be
system-dependent.
Talk to your colleagues and other scientists to profit from their experiences with these
data sets. A lot of tricks cannot be found in textbooks.
Pre-computing and storing can be extremely successful.
Bandwidth and I/O
Initially, bandwidth and I/O often is the bottleneck. To give you a perspective: at the theoretical limit for SATA 3, it takes about 30 minutes to read 1 TB. If you need random access, read several times or write, you want to do this in memory most of the time or need something substantially faster (e.g. iSCSI with InfiniBand). Your system should ideally be able to do parallel I/O to get as close as possible to the theoretical limit of whichever interface you are using. For example, simply accessing different files in parallel in different processes, or HDF5 on top of MPI-2 I/O is pretty common. Ideally, you also do computation and I/O in parallel so that one of the two is "for free".
Clusters
Depending on your case, either I/O or CPU might than be the bottleneck. No matter which one it is, huge performance increases can be achieved with clusters if you can effectively distribute your tasks (example MapReduce). This might require totally different algorithms than the typical textbook examples. Spending development time here is often the best time spent.
Algorithms
In choosing between algorithms, big O of an algorithm is very important, but algorithms with similar big O can be dramatically different in performance depending on locality. The less local an algorithm is (i.e. the more cache misses and main memory misses), the worse the performance will be - access to storage is usually an order of magnitude slower than main memory. Classical examples for improvements would be tiling for matrix multiplications or loop interchange.
Computer, Language, Specialized Tools
If your bottleneck is I/O, this means that algorithms for large data sets can benefit from more main memory (e.g. 64 bit) or programming languages / data structures with less memory consumption (e.g., in Python __slots__ might be useful), because more memory might mean less I/O per CPU time. BTW, systems with TBs of main memory are not unheard of (e.g. HP Superdomes).
Similarly, if your bottleneck is the CPU, faster machines, languages and compilers that allow you to use special features of an architecture (e.g. SIMD like SSE) might increase performance by an order of magnitude.
The way you find and access data, and store meta information can be very important for performance. You will often use flat files or domain-specific non-standard packages to store data (e.g. not a relational db directly) that enable you to access data more efficiently. For example, kdb+ is a specialized database for large time series, and ROOT uses a TTree object to access data efficiently. The pyTables you mention would be another example.
While some languages have naturally lower memory overhead in their types than others, that really doesn't matter for data this size - you're not holding your entire data set in memory regardless of the language you're using, so the "expense" of Python is irrelevant here. As you pointed out, there simply isn't enough address space to even reference all this data, let alone hold onto it.
What this normally means is either a) storing your data in a database, or b) adding resources in the form of additional computers, thus adding to your available address space and memory. Realistically you're going to end up doing both of these things. One key thing to keep in mind when using a database is that a database isn't just a place to put your data while you're not using it - you can do WORK in the database, and you should try to do so. The database technology you use has a large impact on the kind of work you can do, but an SQL database, for example, is well suited to do a lot of set math and do it efficiently (of course, this means that schema design becomes a very important part of your overall architecture). Don't just suck data out and manipulate it only in memory - try to leverage the computational query capabilities of your database to do as much work as possible before you ever put the data in memory in your process.
The main assumptions are about the amount of cpu/cache/ram/storage/bandwidth you can have in a single machine at an acceptable price. There are lots of answers here at stackoverflow still based on the old assumptions of a 32 bit machine with 4G ram and about a terabyte of storage and 1Gb network. With 16GB DDR-3 ram modules at 220 Eur, 512 GB ram, 48 core machines can be build at reasonable prices. The switch from hard disks to SSD is another important change.