Reading a .csv.gz file using Dask - python

I am trying to load a .csv.gz file into Dask.
Reading it this way will load it successfully but into one partition only
dd.read_csv(fp, compression="gzip")
My work around now is to unzip the file using gzip, load it into Dask, then remove it after I am finished. Is there a better way?

There is a very similar question here:
The fundamental reason here, is that a format list bz2, gz or zip does not have allow random-access, the only way to read the data is from the start of the data.
The recommendation there is:
The easiest solution is certainly to stream your large files into several compressed files each (remember to end each file on a newline!), and then load those with Dask as you suggest. Each smaller file will become one dataframe partition in memory, so as long as the files are small enough, you will not run out of memory as you process the data with Dask.
As an alternative option, if disk space is consideration, you can use .to_parquet instead of .to_csv upstream, since parquet data is compressed by default.

Related

Is there a way to make dask read_csv ignore empty files?

I have a dasaset with 200k files per day, those files are rather small .txt.gz where 99% are smaller that 60kbytes. Some of this files are empty files of size 20 because of gzip compression.
When I try to load the the whole directory with dask I get a pandas.errors.EmptyDataError. Since I plan to load this directly from S3 each day I wonder if I can ignore or skip those files via dd.read_csv(). I haven't found any option to control the error handling in the documentation for dask's read_csv() and pandas's read_csv().
Of course, I can copy all the files from s3 to the local hard disk and scan and remove all offending files prior to loading in Dask but that is going to be slower (copying all the 200k files).
In principle I just want to load all this 200k CSV files into Dask to convert them to fewer parquet files. So I'm not even sure if Dask is the best tool for this, but if there an easy way to make it work I wo
A possible way to do this is through exceptions:
import pandas.io.common
for i in range(0,len(file_paths)):
try:
pd.read_csv(file_paths[i])
except pandas.io.common.EmptyDataError:
print file_paths[i], " is empty"

How can I optimize file I/O in Python when I process GB-sized files via NFS?

I'm manipulating several files via nfs, due to security concerns. The situation is very painful to process something due to slow file I/O. Followings are descriptions of the issue.
I use pandas in Python to do simple processing on data. So I use read_csv() and to_csv() frequently.
Currently, writing of a 10GB csv file requires nearly 30 mins whereas reading consumes 2 mins.
I have enough CPU cores (> 20 cores) and memory (50G~100G).
It is hard to ask more bandwidth.
I need to access data in column-oriented manner, frequently. For example, there would be 100M records with 20 columns (most of them are numeric data). For the data, I frequently read all of 100M records only for 3~4 columns' value.
I've tried with HDF5, but it constructs a larger file and consumes similar time to write. And it does not provide column-oriented I/O. So I've discarded this option.
I cannot store them locally. It would violate many security criteria. Actually I'm working on virtual machine and file system is mounted via nfs.
I repeatedly read several columns. For several columns, no. The task is something like data analysis.
Which approaches can I consider?
In several cases, I use sqlite3 to manipulate data in simple way and exports results into csv files. Can I accelerate I/O tasks by using sqlite3 in Python? If it provide column-wise operation, it would be a good solution, I reckon.
two options: pandas hdf5 or dask.
you can review hdf5 format with format='table'.
HDFStore supports another PyTables format on disk, the table format.
Conceptually a table is shaped very much like a DataFrame, with rows
and columns. A table may be appended to in the same or other sessions.
In addition, delete and query type operations are supported. This
format is specified by format='table' or format='t' to append or put
or to_hdf.
you can use dask read_csv. it read data only when execute()
For purely improve IO performance, i think hdf with compress format is best.

How to read compressed(.gz) file faster using Pandas/Dask?

I have couple (each of 3.5GB) of gzip files, as of now I am using Pandas to read those files, but it is very slow, I have tried Dask also, but it seems it does not support gzip file breaking. Is there any better way to quickly load these massive gzip files?
Dask and Pandas code:
df = dd.read_csv(r'file', sample = 200000000000,compression='gzip')
I expect it to read the whole file as quickly as possible.
gzip is, inherently, a pretty slow compression method, and (as you say) does not support random access. This means, that the only way to get to position x is to scan through the file from the start, which is why Dask does not support trying to parallelise in this case.
Your best best, if you want to make use of parallel parsing at least, is first to decompress the whole file, so that the chunking mechanism does make sense. You could also break it into several files, and compress each one, so that the total space required is similar.
Note that, in theory, some compression mechanisms that support block-wise random access, but we have not found any with sufficient community support to implement them in Dask.
The best answer, though, is to store your data in parquet or orc format, which has internal compression and partitioning.
One option is to use package datatable for python:
https://github.com/h2oai/datatable
It can read/write significantly faster than pandas (to gzip) using the function fread, for example
import datatable as dt
df = dt.fread('file.csv.gz')
Later, one can convert it to pandas dataframe:
df1 = df.to_pandas()
Currently datatable is only available on Linux/Mac.
You can try using gzip library:
import gzip
f = gzip.open('Your File', 'wb')
file_content = f.read()
print (file_content)
python: read lines from compressed text files

Read snappy or lzo compressed files from DataFlow in Python

Is there a way to read snappy or lzo compressed files on DataFlow using Apache Beam's Python SDK?
Since I couldn't find an easier way, this is my current approach (which seems totally overkill and inefficient):
Start DataProc cluster
Uncompress such data using hive in the new cluster and place it in a temporary location
Stop DataProc cluster
Run the DataFlow job that reads from these temporary uncompressed data
Clean up temporary uncompressed data
I don't think that there is any built in way to do this today with beam. Python beam supports Gzip, bzip2 and deflate.
Option 1: Read in the whole files and decompress manually
Create a custom source to produce a list of filenames (I.e. seeded off of a pipeline option by listing a directory), and emit those as records
In the following ParDo, read each file manually and decompress it. You will need to use a GCS library to read the GCS file, if you have stored your data there.
This solution will likely not perform as fast, and it will not be able to load large files into memory. But if your files are small in size, it might be work well enough.
Option 2: Add a new decompressor to Beam.
You may be able to contribute a decompressor to beam. It looks like you would need to implement the decompressor logic, provide some constants to specify it when authoring a pipleine.
I think one of the constraints is that it must be possible to scan the file and decompress it in chunks at a time. If the compression format requires reading the whole file into memory, then it will likely not work. This is because the TextIO libraries are designed to be record based, which supports reading large files that don't fit into memory and breaking them up into small records for processing.

Reading gzip file that is currently being written to

My program does a lot of file processing, and as the files are large I prefer to write them as GZIP. One challenge is that I often need to read files as they are being written. This is not a problem without GZIP compression, but when compression is on, the reading complains about failed CRC, which I presume might have something to do with compression info not being flushed properly when writing. Is there any way to use GZIP with Python such that, when I write and flush to a file (but not necessarily close the file), that it can be read as well?
I think flushing data to file (compressed) just writes the data into file, but headers are written only on close(), so you need to close the file first, and only after you can open it and read all data you need. If you need to write large data ammounts, you can try to use database, like PostgreSQL or MySQL where you can specify table with compression (archive, compressed), and you will be able to insert data into the table, and read it, database software will do all rest for you (compression, decompression on inserts, selects).

Categories