.persist() line sometimes leads to Java Out of Heap Space error - python

As far as I know, when you use .persist(), writing the line persist sets only the persistence level, and then the next action in the script will cause the actual persistence work to be invoked.
However, sometimes, seemingly depending on the dataframe, persist() will lead to a Java out of heap space error.
What is the intended behavior of persist, and why could this simple line actually lead to this memory error?

The whole point of RDD Persistence is to store intermediate results in memory, allowing faster access on subsequent use. There are several different levels of persistence, ranging from MEMORY_ONLY (the default), over MEMORY_AND_DISK, up to DISK_ONLY. Persisting purely to memory means that there has to be enough heap space for the persist to work. If you run out of heap memory you can
go for a lower persistence level,
reduce the size of your partitions,
reduce the total number of persisted RDD stages, e.g., by unpersisting them.
Finding the right balance is one of the key challenges in Spark to achieve a good tradeoff between memory and CPU usage.

Related

Numpy's memmap acting strangely?

I am dealing with large numpy arrays and I am trying out memmap as it could help.
big_matrix = np.memmap(parameters.big_matrix_path, dtype=np.float16, mode='w+', shape=(1000000, 1000000)
The above works fine and it creates a file on my hard drive of about 140GB.
1000000 is just a random number I used - not the one I am actually using.
I want to fill the matrix with values. Currently it is just set to zero.
for i in tqdm(range(len(big_matrix))):
modified_row = get_row(i)
big_matrix[i, :] = modified_row
At this point now, I have a big_matrix filled with the values I want.
The problem is that from this point on I can't operate on this memmap.
For example I want to multiply column wise (broadcast).
I run this:
big_matrix * weights[:, np.newaxis]
Where weights has the same length.
It just hangs and throws and out of memory error as my RAM and SWAP is all used.
My understanding was that the memmap will keep everything on the hard drive.
For example save the results directly there.
So I tried this then:
for i in tqdm(range(big_matrix.shape[1])):
temp = big_matrix[:, i].tolist()
temp = np.array(temp) * weights
The above loads only 1 column in memory, and multiply that with the weights.
Then I will save that column back in big_matrix.
But even with 1 column my program hangs. The only difference here is that the RAM is not used up.
At this point I am thinking of switching to sqlite.
I wanted to get some insights why my code is not working?
Do I need to flush the memmap everytime I change it ?
np.memmap map a part of the virtual memory to the storage device space here. The OS is free to preload pages and cache them for a fast reuse. The memory is generally not flushed unless it is reclaimed (eg. by another process or the same process). When this happen, the OS typically (partially) flush data to the storage device and (partially) free the physical memory used for the mapping. That being said, this behaviour is dependent of the actual OS. It work that way on Windows. On Linux, you can use madvise to tune this behaviour but madvise is a low-level C function not yet supported by Numpy (though it is apparently supported for Python, see this issue for more information). Actually, Numpy does not even support closing the memmaped space (which is leaky). The solution is generally to flush data manually not to lose it. There are alternative solutions but none of them is great yet.
big_matrix * weights[:, np.newaxis]
It just hangs and throws and out of memory error as my RAM and SWAP is all used
This is normal since Numpy creates a new temporary array stored in RAM. There is no way to tell to Numpy to store temporary array in on the storage device. That being said, you can tell to Numpy where the output data is stored using the out parameter on some function (eg. np.multiply supports it). The output array can be created using memmap so not to use too much memory (regarding the behaviour of the OS).
But even with 1 column my program hangs. The only difference here is that the RAM is not used up.
This is also expected, especially if you use a HDD and not and SSD. Indeed, the array is stored (virtually) contiguously on the storage device. big_matrix[:, i] has to fetch data with a huge stride. For each item, with a size of only 2 bytes, the OS will perform an IO request to the storage device. Storage devices are optimized for contiguous reads so fetches are buffered and each IO request has a pretty significant latency. In practice, the OS will generally at least fetch a page (typically 4096 bytes, that is 512 times more than what is actually needed). Moreover, there is a limit of the number of IO requests that can be completed per second. HDDs can typically do about 20-200 IO requests per seconds while the fastest Nvme SSDs reach 100_000-600_000 UI requests per seconds. Note that the cache help not not reload data for the next column unless there are too many loaded pages and the OS has to flush them. Reading a matrix of size (1_000_000,1_000_000) causes up to 1_000_000*1_000_000=1_000_000_000_000 fetch, which is horribly inefficient. The cache could reduce this by a large margin, but operating simultaneously on 1_000_000 pages is also horribly inefficient since the processor cannot do that (due to a limited number of entries in the TLB). This will typically results in TLB misses, that is expensive kernel calls for each item to be read. Because a kernel call typically take (at least) about ~1 us on mainstream PC, this means more than a week to to the whole computation.
If you want to efficiently read columns, then you need to read large chunk of columns. For example, you certainly need at least several hundred of columns to be read even on a fast Nvme SSD. For HDD, it is at least several dozens of thousand columns to get a proper throughput. This means you certainly cannot read the full columns efficiently due to the high amount of requested RAM. Using another data layout (tile + transposed data) is critical in this case.

Over-high memory usage during reading parquet in Python

I have a parquet file at around 10+GB, with columns are mainly strings. When loading it into the memory, the memory usage can peak to 110G, while after it's finished the memory usage is reduced back to around 40G.
I'm working on a high-performance computer with allocated memory so I do have access to large memory. However, it seems a waste to me that I have to apply for a 128G memory just for loading data, after that 64G is sufficient for me. Also, 128G memory is more often to be out of order.
My naive conjecture is that the Python interpreter mistreated the 512G physical memory on the HPC as the total available memory, so it does not do garbage collection as often as actually needed. For example, when I load the data with 64G memory, it never threw me a MemoryError but the kernel is directly killed and restarted.
I was wondering whether the over-high usage of memory when loading is a regular behavior of pyarrow, or it is due to the special setting of my environment. If the latter, then is it possible to somehow limit the available memory during loading?
We fixed a memory use bug that's present in 0.14.0/0.14.1 (which is probably what you're using right now).
https://issues.apache.org/jira/browse/ARROW-6060
We also are introducing an option to read string columns as categorical (aka DictionaryArray in Arrow parlance) which also will reduce memory usage. See https://issues.apache.org/jira/browse/ARROW-3325 and discussion in
https://ursalabs.org/blog/2019-06-07-monthly-report/

Will a Python list ever shrink? (with pop() operations)

I have been using .pop() and .append() extensively for Leetcode-style programming problems, especially in cases where you have to accumulate palindromes, subsets, permutations, etc.
Would I get a substantial performance gain from migrating to using a fixed size list instead? My concern is that internally the python list reallocates to a smaller internal array when I execute a bunch of pops, and then has to "allocate up" again when I append.
I know that the amortized time complexity of append and pop is O(1), but I want to get better performance if I can.
Yes.
Python (at least the CPython implementation) uses magic under the hood to make lists as efficient as possible. According to this blog post (2011), calls to append and pop will dynamically allocate and deallocate memory in chunks (overallocating where necessary) for efficiency. The list will only deallocate memory if it shrinks below the chunk size. So, for most cases if you are doing a lot of appends and pops, no memory allocation/deallocation will be performed.
Basically the idea with these high level languages is that you should be able to use the data structure most suited to your use case and the interpreter will ensure that you don't have to worry about the background workings. (eg. avoid micro-optimisation and instead focus on the efficiency of the algorithms in general) If you're that worried about performance, I'd suggest using a language where you have more control over the memory, like C/C++ or Rust.
Python guarantees O(1) complexity for append and pops as you noted, so it sounds like it will be perfectly suited for your case. If you wanted to use it like a queue and using things like list.pop(1) or list.insert(0, obj) which are slower, then you could look into a dedicated queue data structure, for example.

faster numpy array copy; multi-threaded memcpy?

Suppose we have two large numpy arrays of the same data type and shape, of size on the order of GB's. What is the fastest way to copy all the values from one into the other?
When I do this using normal notation, e.g. A[:] = B, I see exactly one core on the computer at maximum effort doing the copy for several seconds, while the others are idle. When I launch multiple workers using multiprocessing and have them each copy a distinct slice into the destination array, such that all the data is copied, using multiple workers is faster. This is true regardless of whether the destination array is a shared memory array or one that becomes local to the worker. I can get a 5-10x speedup in some tests on a machine with many cores. As I add more workers, the speed does eventually level off and even slow down, so I think this achieves being memory-performance bound.
I'm not suggesting using multiprocessing for this problem; it was merely to demonstrate the possibility of better hardware utilization.
Does there exist a python interface to some multi-threaded C/C++ memcpy tool?
Update (03 May 2017)
When it is possible, using multiple python processes to move data can give major speedup. I have a scenario in which I already have several small shared memory buffers getting written to by worker processes. Whenever one fills up, the master process collects this data and copies it into a master buffer. But it is much faster to have the master only select the location in the master buffer, and assign a recording worker to actually do the copying (from a large set of recording processes standing by). On my particular computer, several GB can be moved in a small fraction of a second by concurrent workers, as opposed to several seconds by a single process.
Still, this sort of setup is not always (or even usually?) possible, so it would be great to have a single python process able to drop into a multi-threaded memcpy routine...
If you are certain that the types/memory layout of both arrays are identical, this might give you a speedup: memoryview(A)[:] = memoryview(B) This should be using memcpy directly and skips any checks for numpy broadcasting or type conversion rules.

what changes when your input is giga/terabyte sized?

I just took my first baby step today into real scientific computing today when I was shown a data set where the smallest file is 48000 fields by 1600 rows (haplotypes for several people, for chromosome 22). And this is considered tiny.
I write Python, so I've spent the last few hours reading about HDF5, and Numpy, and PyTable, but I still feel like I'm not really grokking what a terabyte-sized data set actually means for me as a programmer.
For example, someone pointed out that with larger data sets, it becomes impossible to read the whole thing into memory, not because the machine has insufficient RAM, but because the architecture has insufficient address space! It blew my mind.
What other assumptions have I been relying in the classroom that just don't work with input this big? What kinds of things do I need to start doing or thinking about differently? (This doesn't have to be Python specific.)
I'm currently engaged in high-performance computing in a small corner of the oil industry and regularly work with datasets of the orders of magnitude you are concerned about. Here are some points to consider:
Databases don't have a lot of traction in this domain. Almost all our data is kept in files, some of those files are based on tape file formats designed in the 70s. I think that part of the reason for the non-use of databases is historic; 10, even 5, years ago I think that Oracle and its kin just weren't up to the task of managing single datasets of O(TB) let alone a database of 1000s of such datasets.
Another reason is a conceptual mismatch between the normalisation rules for effective database analysis and design and the nature of scientific data sets.
I think (though I'm not sure) that the performance reason(s) are much less persuasive today. And the concept-mismatch reason is probably also less pressing now that most of the major databases available can cope with spatial data sets which are generally a much closer conceptual fit to other scientific datasets. I have seen an increasing use of databases for storing meta-data, with some sort of reference, then, to the file(s) containing the sensor data.
However, I'd still be looking at, in fact am looking at, HDF5. It has a couple of attractions for me (a) it's just another file format so I don't have to install a DBMS and wrestle with its complexities, and (b) with the right hardware I can read/write an HDF5 file in parallel. (Yes, I know that I can read and write databases in parallel too).
Which takes me to the second point: when dealing with very large datasets you really need to be thinking of using parallel computation. I work mostly in Fortran, one of its strengths is its array syntax which fits very well onto a lot of scientific computing; another is the good support for parallelisation available. I believe that Python has all sorts of parallelisation support too so it's probably not a bad choice for you.
Sure you can add parallelism on to sequential systems, but it's much better to start out designing for parallelism. To take just one example: the best sequential algorithm for a problem is very often not the best candidate for parallelisation. You might be better off using a different algorithm, one which scales better on multiple processors. Which leads neatly to the next point.
I think also that you may have to come to terms with surrendering any attachments you have (if you have them) to lots of clever algorithms and data structures which work well when all your data is resident in memory. Very often trying to adapt them to the situation where you can't get the data into memory all at once, is much harder (and less performant) than brute-force and regarding the entire file as one large array.
Performance starts to matter in a serious way, both the execution performance of programs, and developer performance. It's not that a 1TB dataset requires 10 times as much code as a 1GB dataset so you have to work faster, it's that some of the ideas that you will need to implement will be crazily complex, and probably have to be written by domain specialists, ie the scientists you are working with. Here the domain specialists write in Matlab.
But this is going on too long, I'd better get back to work
In a nutshell, the main differences IMO:
You should know beforehand what your likely
bottleneck will be (I/O or CPU) and focus on the best algorithm and infrastructure
to address this. I/O quite frequently is the bottleneck.
Choice and fine-tuning of an algorithm often dominates any other choice made.
Even modest changes to algorithms and access patterns can impact performance by
orders of magnitude. You will be micro-optimizing a lot. The "best" solution will be
system-dependent.
Talk to your colleagues and other scientists to profit from their experiences with these
data sets. A lot of tricks cannot be found in textbooks.
Pre-computing and storing can be extremely successful.
Bandwidth and I/O
Initially, bandwidth and I/O often is the bottleneck. To give you a perspective: at the theoretical limit for SATA 3, it takes about 30 minutes to read 1 TB. If you need random access, read several times or write, you want to do this in memory most of the time or need something substantially faster (e.g. iSCSI with InfiniBand). Your system should ideally be able to do parallel I/O to get as close as possible to the theoretical limit of whichever interface you are using. For example, simply accessing different files in parallel in different processes, or HDF5 on top of MPI-2 I/O is pretty common. Ideally, you also do computation and I/O in parallel so that one of the two is "for free".
Clusters
Depending on your case, either I/O or CPU might than be the bottleneck. No matter which one it is, huge performance increases can be achieved with clusters if you can effectively distribute your tasks (example MapReduce). This might require totally different algorithms than the typical textbook examples. Spending development time here is often the best time spent.
Algorithms
In choosing between algorithms, big O of an algorithm is very important, but algorithms with similar big O can be dramatically different in performance depending on locality. The less local an algorithm is (i.e. the more cache misses and main memory misses), the worse the performance will be - access to storage is usually an order of magnitude slower than main memory. Classical examples for improvements would be tiling for matrix multiplications or loop interchange.
Computer, Language, Specialized Tools
If your bottleneck is I/O, this means that algorithms for large data sets can benefit from more main memory (e.g. 64 bit) or programming languages / data structures with less memory consumption (e.g., in Python __slots__ might be useful), because more memory might mean less I/O per CPU time. BTW, systems with TBs of main memory are not unheard of (e.g. HP Superdomes).
Similarly, if your bottleneck is the CPU, faster machines, languages and compilers that allow you to use special features of an architecture (e.g. SIMD like SSE) might increase performance by an order of magnitude.
The way you find and access data, and store meta information can be very important for performance. You will often use flat files or domain-specific non-standard packages to store data (e.g. not a relational db directly) that enable you to access data more efficiently. For example, kdb+ is a specialized database for large time series, and ROOT uses a TTree object to access data efficiently. The pyTables you mention would be another example.
While some languages have naturally lower memory overhead in their types than others, that really doesn't matter for data this size - you're not holding your entire data set in memory regardless of the language you're using, so the "expense" of Python is irrelevant here. As you pointed out, there simply isn't enough address space to even reference all this data, let alone hold onto it.
What this normally means is either a) storing your data in a database, or b) adding resources in the form of additional computers, thus adding to your available address space and memory. Realistically you're going to end up doing both of these things. One key thing to keep in mind when using a database is that a database isn't just a place to put your data while you're not using it - you can do WORK in the database, and you should try to do so. The database technology you use has a large impact on the kind of work you can do, but an SQL database, for example, is well suited to do a lot of set math and do it efficiently (of course, this means that schema design becomes a very important part of your overall architecture). Don't just suck data out and manipulate it only in memory - try to leverage the computational query capabilities of your database to do as much work as possible before you ever put the data in memory in your process.
The main assumptions are about the amount of cpu/cache/ram/storage/bandwidth you can have in a single machine at an acceptable price. There are lots of answers here at stackoverflow still based on the old assumptions of a 32 bit machine with 4G ram and about a terabyte of storage and 1Gb network. With 16GB DDR-3 ram modules at 220 Eur, 512 GB ram, 48 core machines can be build at reasonable prices. The switch from hard disks to SSD is another important change.

Categories