How can I read a large CSV file into Python with speed? - python

I'm trying to load a ~67 gb dataframe (6,000,000 features by 2300 rows) into dask for machine learning. I'm using a 96 core machine on AWS that I wish to utilize for the actual machine learning bit. However, Dask loads CSVs in a single thread. It has already taken a full 24 hours and it hasn't loaded.
#I tried to display a progress bar, but it is not implemented on dask's load_csv
from dask.diagnostics import ProgressBar
pbar = ProgressBar()
pbar.register()
df = dd.read_csv('../Larger_than_the_average_CSV.csv')
Is there a faster way to load this into Dask and make it persistent? Should I switch to a different technology (Spark on Scala or PySpark?)
Dask is probably still loading it as I can see a steady 100% CPU utilization in top.

The code you show in the question probably takes no time at all, because you are not actually loading anything, just setting up the job prescription. How long this takes will depend on the chunksize you specify.
There are two main bottlenecks to consider for actual loading:
getting the data from disc into memory, raw data transfer over a single disc interface,
parsing that data into in-memory stuff
There is not much you can do about the former if you are on a local disc, and you would expect it to be a small fraction.
The latter may suffer from the GIL, even though dask will execute in multiple threads by default (which is why it may appear only one thread is being used). You would do well to read the dask documentation about the different schedulers, and should try using the Distributed scheduler, even though you are on a single machine, with a mix of threads and processes.
Finally, you probably don't want to "load" the data at all, but process it. Yes, you can persist into memory with Dask if you wish (dask.persist, funnily), but please do not use many workers to load the data just so you then make it into a Pandas dataframe in your client process memory.

Related

Is there a way to reduce resource usage when reading and writing large dataframes with polars?

For my specific problem I have been converting ".csv" files to ".parquet" files. The CSV files on disk are about 10-20 GB each.
Awhile back I have been using ".SAS7BDAT" files of similar size to convert to ".parquet" files of similar data but now I get them in CSVs so this might not be a good control, but I used the pyreadstat library to read these files in (with multi-threading on in the parameter, which didn't make a difference for some reason) and pandas to write. It was also a tiny bit faster but I feel the code ran on a single thread, and it took a week to convert all my data.
This time, I tried the polars library and it was blazing fast. The CPU usage was near 100%, memory usage was also quite high. I tested this on a single file which would have taken hours, only to complete in minutes. The problem is that it uses too much of my computer's resources and my PC stalls. VSCode has crashed on some occasions. I have tried passing in the low memory parameter but it still uses a lot of resources. My suspicion is with the "reader.next_batches(500)" variable but I don't know for sure.
Regardless, is there a way to limit the CPU and memory usage while running this operation so I can at least browse the internet/listen to music while this runs in the background? With pandas the process is too slow, with polars the process is fast but my PC becomes unusable at times. See image for the code I used.
Thanks.
I tried the low memory parameter with polars but memory usage was still quite high. I was expecting to at least use my PC while this worked in the background. My hope is to use 50-80% of my PC's resources such that enough resources are free for other work while the files are being converted.
I see you're on Windows so convert your notebook into a py script then from the command line run
start /low python yourscript.py
And/or use task manager to lower the priority of your python process once it's running.

How to efficiently test code that loads a large dataset?

I'm writing a Python script to load, filter, and transform a large dataset using pandas. Iteratively changing and testing the script is very slow due to the load time of the dataset: loading the parquet files into memory takes 45 minutes while the transformation takes 5 minutes.
Is there a tool or development workflow that will let me test changes I make to the transformation without having to reload the dataset every time?
Here are some options I'm considering:
Develop in a jupyter-notebook: I use notebooks for prototyping and scratch work, but I find myself making mistakes or accidentally making my code un-reproducible when I develop in them. I'd like a solution that doesn't rely on a notebook if possible, as reproducibility is a priority.
Use Apache Airflow (or a similar tool): I know Airflow lets you define specific steps in a data pipeline that flow into one another, so I could break my script into separate "load" and "transform" steps. Is there a way to use Airflow to "freeze" the results of the load step in memory and iteratively run variations on the transformation step that follows?
Store the dataset in a proper Database on the cloud: I don't know much about databases, and I'm not sure how to evaluate if this would be more efficient. I imagine there is zero load time to interact with a remote database (because it's already loaded into memory on the remote machine), but there would likely be a delay in transmitting the results of each query from the remote database to my local machine?
Thanks in advance for your advice on this open ended question.
For a lot of work like that, I'll break it up into intermediate steps and pickle the results. I'll check if the pickle file exists before running the data load or transformation.

How to dynamically share input data to parallel processes in Python

I have several parallel processes working on the same data. The said data is composed of 100.000+ arrays (all stored in an HDF5 file, 90.000 values per array).
For now, each process accesses the data individually, and it works well since the HDF5 file support concurrent reading... but only up to a certain number of parallel processes. Above 14-16 processes accessing the data, I see a drop in efficiency. I was expecting it (too many I/O operations on the same file I reckon), but I don't know how to correct this problem properly.
Since the processes all use the same data, the best would be for the main process to read the file, load the array (or a batch of arrays), and feed it to the running parallel processes, without needing to stop them. A kind of dynamic shared memory if you will.
Is there any way to do it properly and solve my scalability issue?
I use the native "multiprocessing" Python library.
Thanks,
Victor

Does multiprocessing speed up file transfers compared to multithreading

I am writing a script to simultaneously accept many files transfers from many computers on a subnet using sockets (around 40 jpg files total). I want to use multithreading or multiprocessing to make the the transfer occur as fast as possible.
I'm wondering if this type of image transfer is limited by the CPU - and therefore I should use multiprocessing - or if multithreading will be just as good here.
I would also be curious as to what types of activities are limited by the CPU and require multiprocessing, and which are better suited for multithreading.
If the following assumptions are true:
Your script is simply receiving data from the network and writing that data to disk (more or less) verbatim, i.e. it isn't doing any expensive processing on the data
Your script is running on a modern CPU with typical modern networking hardware (e.g. gigabit Ethernet or slower)
Your script's download routines are not grossly inefficient (e.g. you are receiving reasonably-sized chunks of data and not just 1 byte at a time or something silly like that)
... then it's unlikely that your download rate will be CPU-limited. More likely the bottleneck will be either network bandwidth or disk I/O bandwidth.
In any case, since AFAICT your use-case is embarrassingly parallel (i.e. the various downloads never have to communicate or interact with each other, they just each do their own thing independently), it's unlikely that using multithreading vs multiprocessing will make much difference in terms of performance. Of course, the only way to be certain is to try it both ways and measure the throughput each way.
Short answer:
Generally, it really depends on your workload. If you're serious on the performance, please provide details. for example, whether you store images to disk, whether image sizes are > 1GB or not, and etc.
Note: Generally again, if it not mission-critical, both ways are acceptable since we can easily switch between multithread and multiprocess implementations using threading.Thread and multiprocessing.Process.
some more comments
It seems that not CPU but IO will be the bottleneck.
For multiprocess / multithread, due to GIL and/or your implementation, we may have performance difference. You may implement both ways and make try. BTW, IMHO it won't differ much. I think that async IO vs blocking IO will have greater impact.
If your file transfer isn't extremely slow - slower than writing data to disk, multithreading/multiprocessing isn't going to help. By file transfer I mean downloading images and writing them to the local computer with a single HDD.
Using multithreading or multiprocessing when transferring data from several computers with separate disks definitely can improve overall download performance. Simply data read from several physical disks can be read in paralel. The problem arises when you try to save these images to your local drive.
You have just a single local HDD (if disk array not used), single HDD like most HW devices can do just a single IO operation at time. So trying to write several images to disk in the same time won't improve the overal performance - it can even hamper it.
Just imagine that 40 already downloaded images are trying to be written to a single mechanical HDD with single HDD head to different locations (different physical files) especially if disk is fragmented. Then this can even slow down the whole process because HDD is wasting time moving it magnetic head from one position to different (drives can partially mitigate this by reordering IO operation to limit head movement).
On the other hand if you do some preprocessing with these images that is CPU intensive and just then you are going to save them to disk, multithreading can be really helpful.
And to the question what's preferred. On modern OSs there is not a significant difference between using multithreading and multiprocessing (spanning multiple processes). OSs like Linux or Windows schedule threads not processes - based on process and thread priorities. So there is not a big difference between 40 single threaded processes and a single process containing 40 threads. Using multiple processes normally consumes more memory because OS for every process has to allocate some extra memory (not big), but from point of speed difference between multithreading and multiprocessing is not significant. There are other important question to consider which method to use (will these downloads share some data - like common GUI interface - multithreading is easier to use), (are these files to download so big that 40 transfers can exhaust all virtual address space of a single process - use multiprocessing).
Generally:
Multithreading - easier to use in single application because all threads share virtual address space of a single process and can easily communicate with each other. On the other hand single process has a limited size of virtual address space (less than 4GB on 32bit computer).
Multiprocessing - harder to use in a single application (a need of inter-process communication), but more scalable and more robust (if file transfer process crashes only a single file transfer fails) + more virtual address space to use.

Can someone explain parallelpython versus hadoop for distributing python process across various servers?

I'm new to using multiple cpu's to process jobs and was wondering if people could let me know the pro/cons of parallelpython(or any type of python module) versus hadoop streaming?
I have a very large cpu intensive process that I would like to spread across several servers.
Since moving data becomes harder and harder with size; when it comes to parallel computing, data localization becomes very important. Hadoop as a map/reduce framework maximizes the localization of data being processed. It also gives you a way to spread your data efficiently across your cluster (hdfs). So basically, even if you use other parallel modules, as long as you don't have your data localized on the computers you are doing process or as long as you have to move your data across cluster all the time, you wouldn't get maximum benefit from parallel computing. That's one of the key ideas of hadoop.
The main difference is that Hadoop is good at processing big data (dozen to terabytes of data). It provides a simple logical framework called MapReduce which is well appropriate for data aggregation, and a distributed storage system called HDFS.
If your inputs are smaller than 1 gigabyte, you probably don't want to use Hadoop.

Categories