I'm currently working on refactoring some legacy analytics into Python/DASK to show the efficacy of this as a solution going forward.
I'm trying to set up a demo scenario and am having problems with memory and would like some advice.
My scenario; I have my data split into 52 gzip compressed parquet files on S3, each one, uncompressed in memory is around 100MB, giving a total dataset size of ~5.5GB and exactly 100,000,000 rows.
My scheduler is on a T2.Medium (4GB/2vCPUs) as are my 4 workers.
Each worker is being run with 1 process, 1 thread and a memory limit of 4GB, I.e. dask-worker MYADDRESS --nprocs 1 --nthreads=1 --memory-limit=4GB.
Now, I'm pulling the parquet files and immediately repartitioning on a column in such a way that I end up with roughly 480 partitions each of ~11MB.
Then I'm using map_partitions to do the main body of work.
This works fine for small datasets, however for the 100 mil dataset, my workers keep crashing due to not having enough memory.
What am I doing wrong here?
For implementation specific info, the function i'm passing to map_partitions can sometimes need roughly 1GB, due to what is essentially a cross join on the partition dataframe.
Am I not understanding something to do with DASK's architecture? Between my scheduler and my 4 workers there is 20GB of memory to work with, yet this is proving to be not enough.
From what I've read from the DASK documentation is that, so long as each partition, and what you do with that partition, fits in the memory of the worker then you should be ok?
Is 4GB just not enough? Does it need way more to handle scheduler/inter process communication oerheads?
Thanks for reading.
See https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-partitions
I'll copy the text here for convenience
Your chunks of data should be small enough so that many of them fit in a worker’s available memory at once. You often control this when you select partition size in Dask DataFrame or chunk size in Dask Array.
Dask will likely manipulate as many chunks in parallel on one machine as you have cores on that machine. So if you have 1 GB chunks and ten cores, then Dask is likely to use at least 10 GB of memory. Additionally, it’s common for Dask to have 2-3 times as many chunks available to work on so that it always has something to work on.
If you have a machine with 100 GB and 10 cores, then you might want to choose chunks in the 1GB range. You have space for ten chunks per core which gives Dask a healthy margin, without having tasks that are too small
Related
I am dealing with large numpy arrays and I am trying out memmap as it could help.
big_matrix = np.memmap(parameters.big_matrix_path, dtype=np.float16, mode='w+', shape=(1000000, 1000000)
The above works fine and it creates a file on my hard drive of about 140GB.
1000000 is just a random number I used - not the one I am actually using.
I want to fill the matrix with values. Currently it is just set to zero.
for i in tqdm(range(len(big_matrix))):
modified_row = get_row(i)
big_matrix[i, :] = modified_row
At this point now, I have a big_matrix filled with the values I want.
The problem is that from this point on I can't operate on this memmap.
For example I want to multiply column wise (broadcast).
I run this:
big_matrix * weights[:, np.newaxis]
Where weights has the same length.
It just hangs and throws and out of memory error as my RAM and SWAP is all used.
My understanding was that the memmap will keep everything on the hard drive.
For example save the results directly there.
So I tried this then:
for i in tqdm(range(big_matrix.shape[1])):
temp = big_matrix[:, i].tolist()
temp = np.array(temp) * weights
The above loads only 1 column in memory, and multiply that with the weights.
Then I will save that column back in big_matrix.
But even with 1 column my program hangs. The only difference here is that the RAM is not used up.
At this point I am thinking of switching to sqlite.
I wanted to get some insights why my code is not working?
Do I need to flush the memmap everytime I change it ?
np.memmap map a part of the virtual memory to the storage device space here. The OS is free to preload pages and cache them for a fast reuse. The memory is generally not flushed unless it is reclaimed (eg. by another process or the same process). When this happen, the OS typically (partially) flush data to the storage device and (partially) free the physical memory used for the mapping. That being said, this behaviour is dependent of the actual OS. It work that way on Windows. On Linux, you can use madvise to tune this behaviour but madvise is a low-level C function not yet supported by Numpy (though it is apparently supported for Python, see this issue for more information). Actually, Numpy does not even support closing the memmaped space (which is leaky). The solution is generally to flush data manually not to lose it. There are alternative solutions but none of them is great yet.
big_matrix * weights[:, np.newaxis]
It just hangs and throws and out of memory error as my RAM and SWAP is all used
This is normal since Numpy creates a new temporary array stored in RAM. There is no way to tell to Numpy to store temporary array in on the storage device. That being said, you can tell to Numpy where the output data is stored using the out parameter on some function (eg. np.multiply supports it). The output array can be created using memmap so not to use too much memory (regarding the behaviour of the OS).
But even with 1 column my program hangs. The only difference here is that the RAM is not used up.
This is also expected, especially if you use a HDD and not and SSD. Indeed, the array is stored (virtually) contiguously on the storage device. big_matrix[:, i] has to fetch data with a huge stride. For each item, with a size of only 2 bytes, the OS will perform an IO request to the storage device. Storage devices are optimized for contiguous reads so fetches are buffered and each IO request has a pretty significant latency. In practice, the OS will generally at least fetch a page (typically 4096 bytes, that is 512 times more than what is actually needed). Moreover, there is a limit of the number of IO requests that can be completed per second. HDDs can typically do about 20-200 IO requests per seconds while the fastest Nvme SSDs reach 100_000-600_000 UI requests per seconds. Note that the cache help not not reload data for the next column unless there are too many loaded pages and the OS has to flush them. Reading a matrix of size (1_000_000,1_000_000) causes up to 1_000_000*1_000_000=1_000_000_000_000 fetch, which is horribly inefficient. The cache could reduce this by a large margin, but operating simultaneously on 1_000_000 pages is also horribly inefficient since the processor cannot do that (due to a limited number of entries in the TLB). This will typically results in TLB misses, that is expensive kernel calls for each item to be read. Because a kernel call typically take (at least) about ~1 us on mainstream PC, this means more than a week to to the whole computation.
If you want to efficiently read columns, then you need to read large chunk of columns. For example, you certainly need at least several hundred of columns to be read even on a fast Nvme SSD. For HDD, it is at least several dozens of thousand columns to get a proper throughput. This means you certainly cannot read the full columns efficiently due to the high amount of requested RAM. Using another data layout (tile + transposed data) is critical in this case.
I have a large set of Parquet files that I am trying to sort on a column. Uncompressed, the data is around ~14Gb, so Dask seemed like the right tool for the job. All I'm doing with Dask is:
Reading the parquet files
Sorting on one of the columns (called "friend")
Writing as parquet files in a separate directory
I can't do this without the Dask process (there's just one, I'm using the synchronous scheduler) running out of memory and getting killed. This surprises me, because no one partition is more than ~300 mb uncompressed.
I've written a little script to profile Dask with progressively larger portions of my dataset, and I've noticed that Dask's memory consumption scales with the size of the input. Here's the script:
import os
import dask
import dask.dataframe as dd
from dask.diagnostics import ResourceProfiler, ProgressBar
def run(input_path, output_path, input_limit):
dask.config.set(scheduler="synchronous")
filenames = os.listdir(input_path)
full_filenames = [os.path.join(input_path, f) for f in filenames]
rprof = ResourceProfiler()
with rprof, ProgressBar():
df = dd.read_parquet(full_filenames[:input_limit])
df = df.set_index("friend")
df.to_parquet(output_path)
rprof.visualize(file_path=f"profiles/input-limit-{input_limit}.html")
Here are the charts produced by the visualize() call:
Input Limit = 2
Input Limit = 4
Input Limit = 8
Input Limit = 16
The full dataset is ~50 input files, so at this rate of growth I'm not surprised that job eats up all of the memory on my 32gb machine.
My understanding is that the whole point of Dask is to allow you to operate on larger-than-memory datasets. I get the impression that people are using Dask to process datasets far larger than my ~14gb one. How do they avoid this issue with scaling memory consumption? What am I doing wrong here?
I'm not interested in using a different scheduler or in parallelism at this point. I'd just like to know why Dask is consuming so much more memory than I would have thought necessary.
This turns out to have been a performance regression in Dask that was fixed in the 2021.03.0 release.
See this Github issue for more info.
I have a data source, around 100GB, and I'm trying to write it partitioned using a date column.
In order to avoid small chunks inside the partitions, I've added a repartition(5) to have 5 files max inside each partition :
df.repartition(5).write.orc("path")
My problem here, is that only 5 executores out of the 30 I'm allocating are actually running. In the end I have what I want (5 files inside each partition), but since only 5 executors are running, the execution time is extremely high.
Dy you have any suggestion on how I can make it faster ?
I fixed it using simply :
df.repartition($"dateColumn").write.partitionBy("dateColumn").orc(path)
And allocating the same number of executors as the number of partitions I ll have in the output.
Thanks all
You can use repartition along with partitionBy to resolve the issue.
There are two ways to solve this.
Suppose you need to partition by dateColumn
df.repartition(5, 'dateColumn').write.partitionBy('dateColumn').parquet(path)
In this case the number of executors used will be equal to 5 * distinct(dateColumn) and all your date will contain 5 files each.
Another approach is to repartition your data 3 times no of executors then using maxRecordsPerFile to save data this will create equal sizze files but you will lose control over the number of files created
df.repartition(60).write.option('maxRecordsPerFile',200000).partitionBy('dateColumn').parquet(path)
Spark can run 1 concurrent task for every partition of an RDD or data frame (up to the number of cores in the cluster). If your cluster has 30 cores, you should have at least 30 partitions. On the other hand, a single partition typically shouldn’t contain more than 128MB and a single shuffle block cannot be larger than 2GB (see SPARK-6235).
Since you want to reduce your execution time, it is better to increase your number of partitions and at the end of your job reduce your number of partitions for your specific job.
for better distribution of your data (equally) among partition, it is better to use the hash partitioner.
I have a function that does the following :
Take a file as input and does basic cleaning.
Extract the required items from the file and then write them in a pandas dataframe.
The dataframe is finally converted into csv and written into a folder.
This is the sample code:
def extract_function(filename):
with open(filename,'r') as f:
input_data=f.readlines()
try:
// some basic searching pattern matching extracting
// dataframe creation with 10 columns and then extracted values are filled in
empty dataframe
// finally df.to_csv()
if __name__ == '__main__':
pool_size = multiprocessing.cpu_count()
filenames=os.listdir("/home/Desktop/input")
pool=multiprocessing.Pool(pool_size)
pool.map(extract_function,filenames)
pool.close()
pool.join()
The total number of files in the input folder is 4000. I used multiprocessing, as running the program normally with for loop was taking some time. Below are the executions times of both approaches:
Normal CPU processing = 139.22 seconds
Multiprocessing = 18.72 seconds
My system specification are :
Intel i5 7th gen, 12gb ram, 1Tb hdd, Ubuntu 16.04
While running the program for the 4000 files all the cores are getting fully used(averaging around 90% each core). So I decided to increase the file size and repeat the process. This time the input file number was increased from 4000 to 1,20,000. But this time while running the code the cpu usage was erratic at start and after some time the utilization went down (avearge usage around 10% per core). The ram utilization is also low averaging at 4gb max (remaining 8gb free). With the 4000 files as input the file writing to csv was fast as at an instant as i could see a jump or around 1000 files or more in an instant. But with the 1,20,000 files as input, the file writing slowed down to some 300 files and this slowing down goes linearly and after sometime the file writing became around 50-70 for an instant. All this time the majority of the ram is free. I restarted the machine and tried the same to clear any unwanted zombie process but still, the result is the same.
What is the reason for this ? How can I achieve the same multiprocessing for large files?
Note :
* Each file size average around 300kb.
* Each output file being written will be around 200bytes.
* Total number of files is 4080. Hence total size would be ~1.2gb.
* This same 4080 files was used to make copies to get 1,20,000 files.
* This program is an experiment to check multiprocessing for large number of files.
Update 1
I have tried the same code in a much more powerful machine.
Intel i7 8th gen 8700, 1Tb SSHD & 60gb ram.
. The file writing was much faster than in normal HDD. The program took:
For 4000 files - 3.7sec
For 1,20,000 files - 2min
Some point of time during the experiment, I got the fastest completion time which is 84sec. At that point in time, it was giving me consistent result while trying two times consecutively. Thinking that it may be because I had correctly set the number of thread factor in the pool size, I restarted and tried again. But this time it was much slower. To give a perspective, during normal runs around 3000-4000 files will be written in a second or two but this time it was writing below 600 files in a second. In this case, also the ram was not being used at all. The CPU even though the multiprocessing module is being used, all the cores just averages around 3-7% utilization.
Reading from and writing to disk is slow, compared to running code and data from RAM. It is extremely slow compared to running code and data from the internal cache in the CPU.
In an attempt to make this faster, several caches are used.
A harddisk generally has a built-in cache. In 2012 I did some write testing on this. With the harddisk's write cache disabled writing speed dropped from 72 MiB/s to 12 MiB/s.
Most operating systems today use otherwise unoccupied RAM as a disk cache.
The CPU has several levels of built-in caches as well.
(Usually there is a way to disable caches 1 and 2. If you try that you'll see read and write speed drop like a rock.)
So my guess is that once you pass a certain number of files, you exhaust one or more of the caches, and disk I/O becomes the bottleneck.
To verify, you would have to add code to extract_function to measure 3 things:
How long it takes to read the data from disk.
How long it takes to do the calculations.
How long it takes to write the CSV.
Have extract_function return a tuple of those three numbers, and analyse them. Instead of map, I would advise to use imap_unordered, so you can start evaluating the numbers as soon as they become available.
If disk I/O turns out to be the problem, consider using an SSD.
As #RolandSmith & #selbie suggested, I avoided the IO continuous write into CSV files by replacing it with data frames and appending to it. This I think cleared the inconsistencies. I checked the "feather" and "paraquet" high-performance IO modules as suggested by #CoMartel but I think it's for compressing large files into a smaller data frame structure. The appending options were not there for it.
Observations:
The program runs slow for the first run. The successive runs will be faster. This behavior was consistent.
I have checked for some trailing python process running after the program completion but couldn't find any. So some kind of caching is there within the CPU/RAM which make the program execution faster for the successive runs.
The program for 4000 input files took 72 sec for first-time execution and then an average of 14-15 sec for all successive runs after that.
Restarting the system clears those cache and causes the program to run slower for the first run.
Average fresh run time is 72 sec. But killing the program as soon as it starts and then running it took 40 sec for the first dry run after termination. The average of 14 sec after all successive runs.
During the fresh run, all core utilization will be around 10-13%. But after all the successive runs, the core utilization will be 100%.
Checked with the 1,20,000 files, it follows the same pattern. So, for now, the inconsistency is solved. So if such a code needs to be used as a server a dry run should be made for the CPU/RAM to get cached before it can start to accept API queries for faster results.
I am using Dask to parallelize time series satellite imagery analysis on a cluster with a substantial amount of computational resources.
I have set up a distributed scheduler with many workers (--nprocs = 56) each managing one thread (--nthreads = 1) and 4GB of memory due to the embarrassingly parallel nature of the work.
My data comes in as an xarray that is chunked into a dask array and map_blocks is used to map a function across each chunk in order to generate an output array that will be saved to an image file.
data = inputArray.chunk(chunks={'y':1})
client.persist(data)
future = data.data.map_blocks(timeSeriesTrends.timeSeriesTrends, jd, drop_axis=[1])
future = client.persist(future)
dask.distributed.wait(future)
outputArray = future.compute()
My problem is that Dask does not make use of all the resources I have allocated to it. Instead it begins with very few parallelized tasks and slowly adds more as processes finish without ever reaching capacity.
This dramatically restricts the capabilities of the hardware I have access to as many of my resources spend most of their time sitting idle.
Is my approach appropriate for generating an output array from an input array? How can I best make use of the hardware I have access to in this situation?