Reading multiple files with Dask

Reading multiple files with Dask - python

I'm trying out dask on a simple embarassingly parallel reading of 24 scientific data files, each of ~250MB, so total ~6GB. The data is in a 2D array format. Its stored on a parallel file system, and read in from a cluster, though I'm reading only from a single node right now. The data is in a format similar to HDF5 (called Adios), and is read similar to h5py package. Each file takes about 4 seconds to read. I'm following the example of skimage read here (http://docs.dask.org/en/latest/array-creation.html). However, I never get a speed up, no matter how many workers. I thought perhaps I was using it wrong, and perhaps only using 1 worker still, but when I profile it, there does appear to be 24 workers. How can I get a speed up for reading this data?
import adios as ad
import numpy as np
import dask.array as da
import dask
bpread = dask.delayed(lambda f: ad.file(f)['data'][...],pure=True)
lazy_datas = [bpread(path) for path in paths]
sample = lazy_datas[0].compute()
#read in data
arrays = [da.from_delayed(lazy_data,dtype=sample.dtype,shape=sample.shape) for lazy_data in lazy_datas]
datas = da.stack(arrays,axis=0)
datas2 = datas.compute(scheduler='processes',num_workers=24)

I recommend looking at the /profile tab of the scheduler's dashboard. This will tell you what lines of code are taking up the most time.
My first guess is that you are already maxing out your disk's ability to serve data to you. You aren't CPU bound, so adding more cores won't help. That's just a guess though, as always you'll have to profile and investigate your situation further to know for sure.

Related

Using set_index() on a Dask Dataframe and writing to parquet causes memory explosion

I have a large set of Parquet files that I am trying to sort on a column. Uncompressed, the data is around ~14Gb, so Dask seemed like the right tool for the job. All I'm doing with Dask is:
Reading the parquet files
Sorting on one of the columns (called "friend")
Writing as parquet files in a separate directory
I can't do this without the Dask process (there's just one, I'm using the synchronous scheduler) running out of memory and getting killed. This surprises me, because no one partition is more than ~300 mb uncompressed.
I've written a little script to profile Dask with progressively larger portions of my dataset, and I've noticed that Dask's memory consumption scales with the size of the input. Here's the script:
import os
import dask
import dask.dataframe as dd
from dask.diagnostics import ResourceProfiler, ProgressBar
def run(input_path, output_path, input_limit):
dask.config.set(scheduler="synchronous")
filenames = os.listdir(input_path)
full_filenames = [os.path.join(input_path, f) for f in filenames]
rprof = ResourceProfiler()
with rprof, ProgressBar():
df = dd.read_parquet(full_filenames[:input_limit])
df = df.set_index("friend")
df.to_parquet(output_path)
rprof.visualize(file_path=f"profiles/input-limit-{input_limit}.html")
Here are the charts produced by the visualize() call:
Input Limit = 2
Input Limit = 4
Input Limit = 8
Input Limit = 16
The full dataset is ~50 input files, so at this rate of growth I'm not surprised that job eats up all of the memory on my 32gb machine.
My understanding is that the whole point of Dask is to allow you to operate on larger-than-memory datasets. I get the impression that people are using Dask to process datasets far larger than my ~14gb one. How do they avoid this issue with scaling memory consumption? What am I doing wrong here?
I'm not interested in using a different scheduler or in parallelism at this point. I'd just like to know why Dask is consuming so much more memory than I would have thought necessary.

This turns out to have been a performance regression in Dask that was fixed in the 2021.03.0 release.
See this Github issue for more info.

Dataset Wrapper Class for Parallel Reads of HDF5 via Multiprocessing

I am needing to manage a large amount of physiological waveform data, like ECGs, and so far have found HDF5 to be the best for compatibility when doing machine learning/data science in Python (pandas, PyTorch, scikit-learn, etc). The ability to slice/query/read only certain rows of a dataset is particularly appealing. However, I am struggling to develop a stable wrapper class which allows for simple yet reliable parallel reads from many multiprocessing workers. I am fine with single threaded writes as I only have to ETL my source data into the HDF5 once, but lacking parallel reads really hurts run times during data analysis, PyTorch model runs, etc. I have tried both PyTables via Pandas and h5py with the following scenarios:
One large HDF5 file
Could only get single process reads stable. mp pretty quickly corrupted the entire file.
One HDF5 file per time series interval
ie, hundreds of millions of files. Still having collisions leading to corrupted files even when opening read only with a context manager. Also means backing up or transferring so many files is very slow due to the high IO cost.
Apologies for no code samples, but I've tried dozens of approaches over the last few months (each was hundreds of lines) and eventually always ran into stability issues. Is there any good example code of how to build a dataset wrapper class backed by HDF5? AFAIK, h5py's MPI-based "Parallel HDF5" is not relevant since this is on one machine, and "Single Writer Multiple Reader" is not necessary as I am fine writing single-threaded before any reading. Ideally, I'd hope that whenever I need data in my project, I could just do similar to the following:
import dataset # my HDF5 dataset wrapper class
import multiprocessing as mp
def dataloader(idxs):
temp = []
ds = dataset.Dataset()
for _, idx in idxs.iterrows():
df = ds.get_data(idx)
# do some data wrangling in parallel
temp.append(df)
return temp
pool = mp.Pool(6)
data = pool.map(dataloader, indices)

Memory problems with pandas (understanding memory issues)

The issue is that I have 299 .csv files (each of 150-200 MB on average of millions of rows and 12 columns - that makes up a year of data (approx of 52 GB/year). I have 6 years and would like to finally concatenate all of them) which I want to concatenate into a single .csv file with python. As you may expect, I ran into memory errors when trying the following code (my machine has 16GB of RAM):
import os, gzip, pandas as pd, time
rootdir = "/home/eriz/Desktop/2012_try/01"
dataframe_total_list = []
counter = 0
start = time.time()
for subdir, dirs, files in os.walk(rootdir):
dirs.sort()
for files_gz in files:
with gzip.open(os.path.join(subdir, files_gz)) as f:
df = pd.read_csv(f)
dataframe_total_list.append(df)
counter += 1
print(counter)
total_result = pd.concat(dataframe_total_list)
total_result.to_csv("/home/eriz/Desktop/2012_try/01/01.csv", encoding="utf-8", index=False)
My aim: get a single .csv file to then use to train DL models and etc.
My constraint: I'm very new to this huge amounts of data, but I have done "part" of the work:
I know that multiprocessing will not help much in my development; it is a sequential job in which I need each task to complete so that I can start the following one. The main problem is memory run out.
I know that pandas works well with this, even with chunk size added. However, the memory problem is still there because the amount of data is huge.
I have tried to split the work into little tasks so that I don't run out of memory, but I will have it anyway later when concatenating.
My questions:
Is still possible to do this task with python/pandas in any other way I'm not aware of or I need to switch no matter what to a database approach? Could you advise which?
If the database approach is the only path, will I have problems when need to perform python-based operations to train the DL model? I mean, if I need to use pandas/numpy functions to transform the data, is that possible or will I have problems because of the file size?
Thanks a lot in advance and I would appreciate a lot a "more" in deep explanation of the topic.
UPDATE 10/7/2018
After trying and using the below code snippets that #mdurant pointed out I have learned a lot and corrected my perspective about dask and memory issues.
Lessons:
Dask is there to be used AFTER the first pre-processing task (if it is the case that finally you end up with huge files and pandas struggles to load/process them). Once you have your "desired" mammoth file, you can load it into dask.dataframe object without any issue and process it.
Memory related:
first lesson - come up with a procedure so that you don't need to concat all the files and you run out of memory; just process them looping and reducing their content by changing dtypes, dropping columns, resampling... second lesson - try to ONLY put into memory what you need, so that you don't run out. Third lesson - if any of the other lessons don't apply, just look for a EC2 instance, big data tools like Spark, SQL etc.
Thanks #mdurant and #gyx-hh for your time and guidance.

First thing: to take the contents of each CSV and concatenate into one huge CSV is simple enough, you do not need pandas or anything else for that (or even python)
outfile = open('outpath.csv', 'w')
for files_gz in files:
with gzip.open(os.path.join(subdir, files_gz)) as f:
for line in f:
outfile.write(line)
outfile.close()
(you may want to ignore the first line of each CSV, if it has a header with column names).
To do processing on the data is harder. The reason is, that although Dask can read all the files and work on the set as a single data-frame, if any file results in more memory than your system can handle, processing will fail. This is because random-access doesn't mix with gzip compression.
The output file, however, is (presumably) uncompressed, so you can do:
import dask.dataframe as dd
df = dd.read_csv('outpath.csv') # automatically chunks input
df[filter].groupby(fields).mean().compute()
Here, only the reference to dd and the .compute() are specific to dask.

Dask Read Data from Binary File

I am trying to implement an out of core processing version of k-means clustering algorithm in python. I learned about dask from this git project K-Mean Parallel... Dask...
I use the same git project but am trying to load my data which is in the form of a binary file. The binary file contains data points with 1024 floating point features each.
My problem is how do I load data if it is very huge i.e. larger than the available memory itself? I tried to use the numpy's fromFile function but my kernel somehow dies. Some of my questions are:
Q. Is it possible to load data into numpy created from some other source (the file was not created by numpy but a c script)?
Q. Is there a module for dask that can load data directly from a binary file? I have seen csv files used but nothing related to binary files.

I only dabble in Dask, but calling np.fromfile in the code below via Dask delayed should allow you to work with it lazily. That said, I'm working on this myself so this is currently a partial answer.
For your first question: I currently load .bin files created by a Labview program with no issues using similar code to this:
import numpy as np
method = "b+r" # binary read method
chunkSize = 1e6 # chunk as needed for your purposes
fileSize = os.path.getsize(myfile)
data = []
with open(myfile,method) as file:
for chunk in range(0,fileSize,chunkSize):
data.append(np.fromfile(file,dtype=np.float32,chunk))
For the second question: I have not been able to find anything in Dask for dealing with binary files. I find that a conversion to something that Dask can use, is worth it.

How to save big array so that it will take less memory in python?

I am new to python. I have a big array, a, with dimensions such as (43200, 4000) and I need to save this, as I need it for future processing. when I try to save it with a np.savetxt, the txt file is too large and my program runs into memory error as I need to process 5 files of same size. Is there any way to save huge arrays so that it will take less memory?
Thanks.

Saving your data to text file is hugely inefficient. Numpy has built-in saving commands save, and savez/savez_compressed which would be much better suited to storing large arrays.
Depending on how you plan to use your data, you should also look into HDF5 format (h5py or pytables), which allows you to store large data sets, without having to load it all in memory.

You can use PyTables to create a Hierarchical Data Format (HDF) file to store the data. This provides some interesting in-memory options that link the object you're working with to the file it's saved in.
Here is another StackOverflow questions that demonstrates how to do this: "How to store a NumPy multidimensional array in PyTables."
If you are willing to work with your array as a Pandas DataFrame object, you can also use the Pandas interface to PyTables / HDF5, e.g.:
import pandas
import numpy as np
a = np.ones((43200, 4000)) # Not recommended.
x = pandas.HDFStore("some_file.hdf")
x.append("a", pandas.DataFrame(a)) # <-- This will take a while.
x.close()
# Then later on...
my_data = pandas.HDFStore("some_file.hdf") # might also take a while
usable_a_copy = my_data["a"] # Be careful of the way changes to
# `usable_a_copy` affect the saved data.
copy_as_nparray = usable_a_copy.values
With files of this size, you might consider whether your application can be performed with a parallel algorithm and potentially applied to only subsets of the large arrays rather than needing to consume all of the array before proceeding.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.