How to read compressed(.gz) file faster using Pandas/Dask? - python

I have couple (each of 3.5GB) of gzip files, as of now I am using Pandas to read those files, but it is very slow, I have tried Dask also, but it seems it does not support gzip file breaking. Is there any better way to quickly load these massive gzip files?
Dask and Pandas code:
df = dd.read_csv(r'file', sample = 200000000000,compression='gzip')
I expect it to read the whole file as quickly as possible.

gzip is, inherently, a pretty slow compression method, and (as you say) does not support random access. This means, that the only way to get to position x is to scan through the file from the start, which is why Dask does not support trying to parallelise in this case.
Your best best, if you want to make use of parallel parsing at least, is first to decompress the whole file, so that the chunking mechanism does make sense. You could also break it into several files, and compress each one, so that the total space required is similar.
Note that, in theory, some compression mechanisms that support block-wise random access, but we have not found any with sufficient community support to implement them in Dask.
The best answer, though, is to store your data in parquet or orc format, which has internal compression and partitioning.

One option is to use package datatable for python:
https://github.com/h2oai/datatable
It can read/write significantly faster than pandas (to gzip) using the function fread, for example
import datatable as dt
df = dt.fread('file.csv.gz')
Later, one can convert it to pandas dataframe:
df1 = df.to_pandas()
Currently datatable is only available on Linux/Mac.

You can try using gzip library:
import gzip
f = gzip.open('Your File', 'wb')
file_content = f.read()
print (file_content)
python: read lines from compressed text files

Related

Reading a .csv.gz file using Dask

I am trying to load a .csv.gz file into Dask.
Reading it this way will load it successfully but into one partition only
dd.read_csv(fp, compression="gzip")
My work around now is to unzip the file using gzip, load it into Dask, then remove it after I am finished. Is there a better way?
There is a very similar question here:
The fundamental reason here, is that a format list bz2, gz or zip does not have allow random-access, the only way to read the data is from the start of the data.
The recommendation there is:
The easiest solution is certainly to stream your large files into several compressed files each (remember to end each file on a newline!), and then load those with Dask as you suggest. Each smaller file will become one dataframe partition in memory, so as long as the files are small enough, you will not run out of memory as you process the data with Dask.
As an alternative option, if disk space is consideration, you can use .to_parquet instead of .to_csv upstream, since parquet data is compressed by default.

Is there a way to make dask read_csv ignore empty files?

I have a dasaset with 200k files per day, those files are rather small .txt.gz where 99% are smaller that 60kbytes. Some of this files are empty files of size 20 because of gzip compression.
When I try to load the the whole directory with dask I get a pandas.errors.EmptyDataError. Since I plan to load this directly from S3 each day I wonder if I can ignore or skip those files via dd.read_csv(). I haven't found any option to control the error handling in the documentation for dask's read_csv() and pandas's read_csv().
Of course, I can copy all the files from s3 to the local hard disk and scan and remove all offending files prior to loading in Dask but that is going to be slower (copying all the 200k files).
In principle I just want to load all this 200k CSV files into Dask to convert them to fewer parquet files. So I'm not even sure if Dask is the best tool for this, but if there an easy way to make it work I wo
A possible way to do this is through exceptions:
import pandas.io.common
for i in range(0,len(file_paths)):
try:
pd.read_csv(file_paths[i])
except pandas.io.common.EmptyDataError:
print file_paths[i], " is empty"

Loading large zipped data-set using dask

I am trying to load a large zipped data set into python with the following structure:
year.zip
year
month
a lot of .csv files
So far I have used the ZipFile library to iterate through each of the CSV files and load them using pandas.
zf = ZipFile(year.zip)
for file in zf.namelist:
try:
pd.read_csv(zf.open(file))
It takes ages and I am looking into optimizing the code. One option I ran into is to use dask library. However, I can't figure out how to best implement it to access at least the whole month of CSV files in one command. Any suggestions? Also open to other optimization approaches
There are a few ways to do this. The most similar to your suggestion would be something like:
zf = ZipFile("year.zip")
files = list(zf.namelist)
parts = [dask.delayed(pandas.read_csv)(f) for f in files)]
df = dd.from_delayed(parts)
This works because a zipfile has a offset listing, so that the component files can be read independently; however, performance may depend on how the archive was created, and remember: you only have one storage device, to throughput from the device may be your bottleneck anyway.
Perhaps a more daskian method to do this is as follows, taking advantage of the features of fsspec, the file-system abstraction used by dask
df = dd.read_csv('zip://*.csv', storage_options={'fo': 'year.zip'})
(of course, pick the glob pattern appropriate for your files; you could also use a list of files here, if you prepend "zip://" to them)

Python sas7bdat module - iterator or memory intensive?

I'm wondering if the sas7bdat module in Python creates an iterator-type object or loads the entire file into memory as a list? I'm interested in doing something line-by-line to a .sas7bdat file that is on the order of 750GB, and I really don't want Python to attempt to load the whole thing into RAM.
Example script:
from sas7bdat import SAS7BDAT
count = 0
with SAS7BDAT('big_sas_file.sas7bdat') as f:
for row in f:
count+=1
I can also use
it = f.__iter__()
but I'm not sure if that will still go through a memory-intensive data load. Any knowledge of how sas7bdat works OR another way to deal with this issue would be greatly appreciated!
You can see the relevant code on bitbucket. The docstring describes iteration as a "generator", and looking at the code, it appears to be reading small pieces of the file rather than reading the whole thing at once. However, I don't know enough about the file format to know if there are situations that could cause it to read a lot of data at once.
If you really want to get a sense of its performance before trying it on a giant 750G file, you should test it by creating a few sample files of increasing size and seeing how its performance scales with the file size.

Big Data with Blaza and Pandas

I want to know if this approach would be an overkill for a project.
I have a 4gb file that obviously my computer cant handle. Would using Blaze to split the file into more manageable file sizes and open with pandas and visualize with Bokeh be an overkill?
I know Pandas has a "chunk" function but the reason i want to split them is because there are specific rows related to specific names that i need to analyze.
is there a different approach you would take that wont crash my laptop and doesnt require setting up Hadoop or any AWS service?
Pandas chunking with pd.read_csv(..., chunksize=...) works well.
Alternatively dask.dataframe mimics the Pandas interface and handles the chunking for you.

Categories