it is possible to read and run a data cleaning and transformation pipeline (all with Pandas) on 4 different CSV files in parallel (one per each thread)? RAM would not be a problem because I would read and transform data in chunks, then persist it to a DB. There would not be any shared state between the threads and the threads would not lock on the DB I/O of DataFrame chunks persistance because the inserts will be async using AioPg. I know how to write the code, but I am not very experienced with Python and I just want to know if the GIL would be relevant/problematic in this case and if it wouldn’t be faster to just do it sequentially. Thanks!
Related
I have several parallel processes working on the same data. The said data is composed of 100.000+ arrays (all stored in an HDF5 file, 90.000 values per array).
For now, each process accesses the data individually, and it works well since the HDF5 file support concurrent reading... but only up to a certain number of parallel processes. Above 14-16 processes accessing the data, I see a drop in efficiency. I was expecting it (too many I/O operations on the same file I reckon), but I don't know how to correct this problem properly.
Since the processes all use the same data, the best would be for the main process to read the file, load the array (or a batch of arrays), and feed it to the running parallel processes, without needing to stop them. A kind of dynamic shared memory if you will.
Is there any way to do it properly and solve my scalability issue?
I use the native "multiprocessing" Python library.
Thanks,
Victor
Is there some way to lock a zarr store when using append?
I have already found out the hard way that using append with multiple processes is a bad idea (the batches to append aren't aligned with the batch size of the store). The reason I'd like to use multiple processes is because I need to transform the original arrays before appending them to the zarr store. It would be nice to be able to block other processes from writing concurrently but still perform the transformations in parallel, then append their data in series.
Edit:
Thanks to jdehesa's suggestion, I became aware of the synchronization part of the documentation. I passed a ProcessSynchronizer pointing to a folder on disk to my array at creation in the main thread, then spawned a bunch of worker processes with concurrent.futures and passed the array to all the workers for them to append their results. I could see that the ProcessSynchronizer did something, as the folder I pointed it to filled with files, but the array that my workers write to ended up missing rows (compared to when written from a single process).
I'm trying to load a ~67 gb dataframe (6,000,000 features by 2300 rows) into dask for machine learning. I'm using a 96 core machine on AWS that I wish to utilize for the actual machine learning bit. However, Dask loads CSVs in a single thread. It has already taken a full 24 hours and it hasn't loaded.
#I tried to display a progress bar, but it is not implemented on dask's load_csv
from dask.diagnostics import ProgressBar
pbar = ProgressBar()
pbar.register()
df = dd.read_csv('../Larger_than_the_average_CSV.csv')
Is there a faster way to load this into Dask and make it persistent? Should I switch to a different technology (Spark on Scala or PySpark?)
Dask is probably still loading it as I can see a steady 100% CPU utilization in top.
The code you show in the question probably takes no time at all, because you are not actually loading anything, just setting up the job prescription. How long this takes will depend on the chunksize you specify.
There are two main bottlenecks to consider for actual loading:
getting the data from disc into memory, raw data transfer over a single disc interface,
parsing that data into in-memory stuff
There is not much you can do about the former if you are on a local disc, and you would expect it to be a small fraction.
The latter may suffer from the GIL, even though dask will execute in multiple threads by default (which is why it may appear only one thread is being used). You would do well to read the dask documentation about the different schedulers, and should try using the Distributed scheduler, even though you are on a single machine, with a mix of threads and processes.
Finally, you probably don't want to "load" the data at all, but process it. Yes, you can persist into memory with Dask if you wish (dask.persist, funnily), but please do not use many workers to load the data just so you then make it into a Pandas dataframe in your client process memory.
I have a file which is placed in HDFS. I would like to know what is efficient way to read file using python. can i use pyspark.?
You can use PySpark which is a Python API for Spark. It will allow you to leverage cluster resources using Spark. I would recommend taking a smaller sized chunk of the 1 TB file, and testing your code on that. If all looks good, then you can submit your job on the larger dataset.
If using Spark: Depending upon how much memory you have on the cluster, consider caching the RDDs in memory that you plan to reuse frequently. This will speed up your job executions.
Does hdf5 support parallel writes to the same file, from different threads or from different processes? Alternatively, does hdf5 support non-blocking writes?
If so then is this also supported by NetCDF4, and by the python bindings for either?
I am writing an application where I want different CPU cores to concurrently compute output intended for non-overlapping tiles of a very large output array. (Later I will want to read sections from it as a single array, without needing my own driver to manage indexing many separate files, and ideally without the additional IO task of rearranging it on disk.)
Not trivially, but there various potential work-arounds.
The ordinary HDF5 library apparently does not even support concurrent reading of different files by multiple threads. Consequently NetCDF4, and the python bindings for either, will not support parallel writing.
If the output file is pre-initialised and has chunking and compression disabled, to avoid having a chunk index, then (in principle) concurrent non-overlapping writes to the same file by separate processes might work(?).
In more recent versions of HDF5, there should be support for virtual datasets. Each process would write output to a different file, and afterward a new container file would be created, consisting of references to the individual data files (but otherwise able to be read like a normal HDF5 file).
There exists a "Parallel HDF5" library for MPI. Although MPI might otherwise seem like overkill, it would have advantages if scaling up later to multiple machines.
If writing output is not a performance bottleneck, a multithreaded application could probably implement one output thread (utilising some form of queue data-structure).
[Edit:] Another option is to use zarr format instead, which places each chunk in a separate file (an approach which future versions of HDF currently seem likely to adopt).
If you are running in AWS, checkout HDF Cloud: https://www.hdfgroup.org/solutions/hdf-cloud.
This is a service that enables multiple reader/multiple writer workflows and is largely feature compatible with the HDF5 library.
The client SDK doesn't support non-blocking writes, but of course if you are using the REST API directly you could do non-blocking I/O just like you would with any http-based service.