Loading large zipped data-set using dask - python

I am trying to load a large zipped data set into python with the following structure:
year.zip
year
month
a lot of .csv files
So far I have used the ZipFile library to iterate through each of the CSV files and load them using pandas.
zf = ZipFile(year.zip)
for file in zf.namelist:
try:
pd.read_csv(zf.open(file))
It takes ages and I am looking into optimizing the code. One option I ran into is to use dask library. However, I can't figure out how to best implement it to access at least the whole month of CSV files in one command. Any suggestions? Also open to other optimization approaches

There are a few ways to do this. The most similar to your suggestion would be something like:
zf = ZipFile("year.zip")
files = list(zf.namelist)
parts = [dask.delayed(pandas.read_csv)(f) for f in files)]
df = dd.from_delayed(parts)
This works because a zipfile has a offset listing, so that the component files can be read independently; however, performance may depend on how the archive was created, and remember: you only have one storage device, to throughput from the device may be your bottleneck anyway.
Perhaps a more daskian method to do this is as follows, taking advantage of the features of fsspec, the file-system abstraction used by dask
df = dd.read_csv('zip://*.csv', storage_options={'fo': 'year.zip'})
(of course, pick the glob pattern appropriate for your files; you could also use a list of files here, if you prepend "zip://" to them)

Related

Is there a good way to store large amounts of similar scraped HTML files in Python?

I've written a web scraper in Python and I have a ton (thousands) of files that are extremely similar, but not quite identical. The disk space used currently used by the files is 1.8 GB, but if I compress them into a tar.xz, they compress to 14.4 MB. I want to be closer to that 14.4 MB than the 1.8 GB.
Here are some things I've considered:
I could just use tarfile in Python's standard library and store the files there. The problem with that is I wouldn't be able to modify the files within the tar without recompressing all of the files which would take a while.
I could just use the difflib in Python's standard library, but I've found that this library doesn't offer any way of applying "patches" to recreate the new file.
I could use Google's diff-match-patch Python library, but when I was reading the documentation, they said "Attempting to feed HTML, XML or some other structured content through a fuzzy match or patch may result in problems.", well considering I wanted to use this library to more efficiently store HTML files, that doesn't sound like it'll help me.
So is there a way of saving disk space when storing a large amount of similar HTML files?
You can use a dictionary.
Python's zlib interface supports dictionaries. The compressobj and decompressobj functions both take an optional zdict argument, which is a "dictionary". A dictionary in this case is nothing more than 32K of data with sequences of bytes that you expect will appear in the data you are compressing.
Since your files are about 30K each, this works out quite well for your application. If indeed your files are "extremely similar", then you can take one of those files and use it as the dictionary to compress all of the other files.
Try it, and measure the improvement in compression over not using a dictionary.

Is there a way to make dask read_csv ignore empty files?

I have a dasaset with 200k files per day, those files are rather small .txt.gz where 99% are smaller that 60kbytes. Some of this files are empty files of size 20 because of gzip compression.
When I try to load the the whole directory with dask I get a pandas.errors.EmptyDataError. Since I plan to load this directly from S3 each day I wonder if I can ignore or skip those files via dd.read_csv(). I haven't found any option to control the error handling in the documentation for dask's read_csv() and pandas's read_csv().
Of course, I can copy all the files from s3 to the local hard disk and scan and remove all offending files prior to loading in Dask but that is going to be slower (copying all the 200k files).
In principle I just want to load all this 200k CSV files into Dask to convert them to fewer parquet files. So I'm not even sure if Dask is the best tool for this, but if there an easy way to make it work I wo
A possible way to do this is through exceptions:
import pandas.io.common
for i in range(0,len(file_paths)):
try:
pd.read_csv(file_paths[i])
except pandas.io.common.EmptyDataError:
print file_paths[i], " is empty"

How to read compressed(.gz) file faster using Pandas/Dask?

I have couple (each of 3.5GB) of gzip files, as of now I am using Pandas to read those files, but it is very slow, I have tried Dask also, but it seems it does not support gzip file breaking. Is there any better way to quickly load these massive gzip files?
Dask and Pandas code:
df = dd.read_csv(r'file', sample = 200000000000,compression='gzip')
I expect it to read the whole file as quickly as possible.
gzip is, inherently, a pretty slow compression method, and (as you say) does not support random access. This means, that the only way to get to position x is to scan through the file from the start, which is why Dask does not support trying to parallelise in this case.
Your best best, if you want to make use of parallel parsing at least, is first to decompress the whole file, so that the chunking mechanism does make sense. You could also break it into several files, and compress each one, so that the total space required is similar.
Note that, in theory, some compression mechanisms that support block-wise random access, but we have not found any with sufficient community support to implement them in Dask.
The best answer, though, is to store your data in parquet or orc format, which has internal compression and partitioning.
One option is to use package datatable for python:
https://github.com/h2oai/datatable
It can read/write significantly faster than pandas (to gzip) using the function fread, for example
import datatable as dt
df = dt.fread('file.csv.gz')
Later, one can convert it to pandas dataframe:
df1 = df.to_pandas()
Currently datatable is only available on Linux/Mac.
You can try using gzip library:
import gzip
f = gzip.open('Your File', 'wb')
file_content = f.read()
print (file_content)
python: read lines from compressed text files

How to version-control a set of input data along with its processing scripts?

I am working with a set of Python scripts that take data from an Excel file that is set up to behave as a pseudo-database. Excel is used instead of an SQL software due to compatibility and access requipements for other people I work with who aren't familiar with databases.
I have a set of about 10 tables with multiple records in each and relational keys linking them all (again in a pseudo-linking kind of way, using some flimsy data validation).
The scripts I am using are version controlled by Git, and I know the pitfalls of adding a .xlsx file to a repo, so I have kept it away. Since the data is a bit vulnerable, I want to make sure I have a way of keeping track of any changes we make to it. My thought was to have a script that breaks the Excel file into .csv tables and adds those to the repo, i.e.:
import pandas as pd
from pathlib import Path
excel_input_file = Path(r"<...>")
output_path = Path(r"<...>")
tables_dict = pd.read_excel(excel_input_file, sheet_name=None)
for i,x in tables_dict.items():
x.to_csv(output_path / (i+'.csv'), index=False)
Would this be a typically good method for keeping track of the input files at each stage?
Git tends to work better with text files rather than binary files, as you've noted, so this would be a better choice than just checking in an Excel file. Specifically, Git would be able to merge and diff these files, whereas they couldn't be merged natively by Git otherwise.
Typically the way that people handle this sort of situation is to take one or more plain text input files (e.g., CSV or SQL) and then build them into the usable output format (e.g., Excel or database) as part of the build or test step, depending on where they're needed. I've done similar things by using a Git fast-export dump to create test Git repositories, and it generally works well.
If you had just one input file, which you don't in this case, you could also use smudge and clean filters to turn the source file in the repository into a different format in the checkout. You can read about this with man gitattributes.

Why does Spark output a set of csv's instead or just one?

I had a hard time last week getting data out of Spark, in the end I had to simply go with
df.toPandas().to_csv('mycsv.csv')
out of this answer.
I had tested the more native
df.write.csv('mycsv.csv')
for Spark 2.0+ but as per the comment underneath, it drops a set of csv files instead of one which need to be concatenated, whatever that means in this context. It also dropped an empty file into the directory called something like 'success'. The directory name was /mycsv/ but the csv itself had an unintelligible name out of a long string of characters.
This was the first I had heard of such a thing. Well, Excel has multiple tabs which must somehow be reflected in an .xls file, and NumPy arrays can be multidimensional, but I thought a csv file was just a header, values separated into columns by commas in rows.
Another answer suggested:
query.repartition(1).write.csv("cc_out.csv", sep='|')
So this drops just one file and the blank 'success' file, still the file does not have the name you want, the directory does.
Does anyone know why Spark is doing this, why will it not simply output a csv, how does it name the csv, what is that success file supposed to contain, and if concatenating csv files means here joining them vertically, head to tail.
There are a few reasons why Spark outputs multiple CSVs:
- Spark runs on a distributed cluster. For large datasets, all the data may not be able to fit on a single machine, but it can fit across a cluster of machines. To write one CSV, all the data would presumably have to be on one machine and written by one machine, which one machine may not be able to do.
- Spark is designed for speed. If data lives on 5 partitions across 5 executors, it makes sense to write 5 CSVs in parallel rather than move all data to a single executor and have one executor write the entire dataset.
If you need one CSV, my presumption is that your dataset is not super large. My recommendation is to download all the CSV files into a directory, and run cat *.csv > output.csv in the relevant directory. This will join your CSV files head-to-tail. You may need to do more work to strip headers from each part file if you're writing with headers.
Does anyone know why Spark is doing this, why will it not simply output a csv,
Because it is designed for distributed computing where each chunk of data (a.k.a. partition) is written independently of others.
how does it name the csv
Name depends on the partition number.
what is that success file supposed to contain
Nothing. It just indicates success.
This basically happens because Spark dumps file based on the number of partitions between which the data is divided. So, each partition would simply dump it's own file seperately. You can use the coalesce option to save them to a single file. Check this link for more info.
However, this method has a disadvantage that it needs to collect all the data in the Master Node, hence the Master Node should contain enough memory. A workaround for this can seen in this answer.
This link also sheds some more information about this behavior of Spark:
Spark is like Hadoop - uses Hadoop, in fact - for performing actions like outputting data to HDFS. You'll know what I mean the first time you try to save "all-the-data.csv" and are surprised to find a directory named all-the-data.csv/ containing a 0 byte _SUCCESS file and then several part-0000n files for each partition that took part in the job.

Categories