Large number of csv files with variable column length using Dask

Large number of csv files with variable column length using Dask - python

I'm trying to use Dask to read a large number of csv, but I'm having issues since the number of columns varies between csv files, as does the order of the columns.
I know that packages like d6tstack (as detailed here), can help handle this, but is there a way to fix this without installing additional libraries and without taking up more disk space?

If you use from_delayed, then you can make a function which pre-processes each of your input files as you might wish. This is totally arbitrary, so you can choose to solve the issue using your own code or any package you want to install across the cluster.
#dask.delayed
def read_a_file(filename):
df = pd.read_csv(filename). # or remote file
do_something_with_columns
return df_out
df = dd.from_delayed([read_a_file(f) for f in filenames], meta=...)

Related

Efficiently read in large csv files using pandas or dask in python

I want to read in large csv files into python in the fastest way possible. I have a csv file of ~100 million rows. I came across this primer https://medium.com/casual-inference/the-most-time-efficient-ways-to-import-csv-data-in-python-cc159b44063d and they go through a few packages
csv
pandas
dask
datatable
paratext
For my purposes, "csv" is too raw and I want to leverage the type inference included in the other packages. I need it to work on both windows and linux machines and have also looked into datatable and paratext, but have had problems installing right package dependencies (neither are on the anaconda package repo). So that leaves pandas and dask. At first glance, dask seems much faster, but I only realized that it only does the computations if you call ".compute"
My specific use case is that even though the raw csv file is 100+ million rows, I only need a subset of it loaded into memory. For example, all rows with date >= T. Is there a more efficient way of doing this than just the below example? Both pandas and dask take similar time.
EDIT: The csv file updates daily and there is no pre-known order of the rows of the file. Ie it is not necessarily the case that the most recent dates are at the end of the file
import pandas as pd
import dask as dd
from datetime import datetime
s = datetime.now()
data1 = pd.read_csv("test.csv", parse_dates=["DATE"])
data1 = data1[data1.DATE>=datetime(2019,12,24)]
print(datetime.now()-s)
s = datetime.now()
data2 = dd.read_csv("test.csv", parse_dates=["DATE"])
data2 = data2[data2.DATE>=datetime(2019,12,24)].compute()
print(datetime.now()-s)

Your Dask solution looks good to me. For parsing CSV in particular you might want to use Dask's multiprocessing scheduler. Most Pandas operations are better with threads, but text-based processing like CSV, is an exception.
data2 = data2[data2.DATE>=datetime(2019,12,24)].compute(scheduler="processes")
See https://docs.dask.org/en/latest/scheduling.html for more information.

CSV is not an efficient file format for filtering, CSV files don't have indexes for data fields, don't have key based access. For each filter operation you always have to read all lines.
You can improve the performance marginally by using a library that's written in C or is doing some stuff a little smarter than another library, but don't expect miracles/ I'd expect an improvement of a few percent to perhaps 3x if you identify/implement an optimized C version reading your lines and performing the initial filtering.
If you read the CSV file more often then it might be useful to convert the file during the first read (multiple options exist: hand craftet helpers, indexes, sorting, data bases, ...)
and perform subsequent reads on the 'data base'.
If you know that the new CSV file is a the same as the previous version plus lines appended to the end of the file, you had to memorize the position of the last line of the previous version and just add the new lines to your optimized data file. (data base, ...)
Other file formats might be hundreds or thousand times more efficient, but creating these files the first time is probably at least as expensive as your searching (so if reading only once, there's nothing you can optimize)
If none of above conditions is true you cannot expect a huge performance increase
You might look at What is the fastest way to search the csv file?
for accelerations (assuming files can be sorted/indexed by a search / filter criteria)

Memory problems with pandas (understanding memory issues)

The issue is that I have 299 .csv files (each of 150-200 MB on average of millions of rows and 12 columns - that makes up a year of data (approx of 52 GB/year). I have 6 years and would like to finally concatenate all of them) which I want to concatenate into a single .csv file with python. As you may expect, I ran into memory errors when trying the following code (my machine has 16GB of RAM):
import os, gzip, pandas as pd, time
rootdir = "/home/eriz/Desktop/2012_try/01"
dataframe_total_list = []
counter = 0
start = time.time()
for subdir, dirs, files in os.walk(rootdir):
dirs.sort()
for files_gz in files:
with gzip.open(os.path.join(subdir, files_gz)) as f:
df = pd.read_csv(f)
dataframe_total_list.append(df)
counter += 1
print(counter)
total_result = pd.concat(dataframe_total_list)
total_result.to_csv("/home/eriz/Desktop/2012_try/01/01.csv", encoding="utf-8", index=False)
My aim: get a single .csv file to then use to train DL models and etc.
My constraint: I'm very new to this huge amounts of data, but I have done "part" of the work:
I know that multiprocessing will not help much in my development; it is a sequential job in which I need each task to complete so that I can start the following one. The main problem is memory run out.
I know that pandas works well with this, even with chunk size added. However, the memory problem is still there because the amount of data is huge.
I have tried to split the work into little tasks so that I don't run out of memory, but I will have it anyway later when concatenating.
My questions:
Is still possible to do this task with python/pandas in any other way I'm not aware of or I need to switch no matter what to a database approach? Could you advise which?
If the database approach is the only path, will I have problems when need to perform python-based operations to train the DL model? I mean, if I need to use pandas/numpy functions to transform the data, is that possible or will I have problems because of the file size?
Thanks a lot in advance and I would appreciate a lot a "more" in deep explanation of the topic.
UPDATE 10/7/2018
After trying and using the below code snippets that #mdurant pointed out I have learned a lot and corrected my perspective about dask and memory issues.
Lessons:
Dask is there to be used AFTER the first pre-processing task (if it is the case that finally you end up with huge files and pandas struggles to load/process them). Once you have your "desired" mammoth file, you can load it into dask.dataframe object without any issue and process it.
Memory related:
first lesson - come up with a procedure so that you don't need to concat all the files and you run out of memory; just process them looping and reducing their content by changing dtypes, dropping columns, resampling... second lesson - try to ONLY put into memory what you need, so that you don't run out. Third lesson - if any of the other lessons don't apply, just look for a EC2 instance, big data tools like Spark, SQL etc.
Thanks #mdurant and #gyx-hh for your time and guidance.

First thing: to take the contents of each CSV and concatenate into one huge CSV is simple enough, you do not need pandas or anything else for that (or even python)
outfile = open('outpath.csv', 'w')
for files_gz in files:
with gzip.open(os.path.join(subdir, files_gz)) as f:
for line in f:
outfile.write(line)
outfile.close()
(you may want to ignore the first line of each CSV, if it has a header with column names).
To do processing on the data is harder. The reason is, that although Dask can read all the files and work on the set as a single data-frame, if any file results in more memory than your system can handle, processing will fail. This is because random-access doesn't mix with gzip compression.
The output file, however, is (presumably) uncompressed, so you can do:
import dask.dataframe as dd
df = dd.read_csv('outpath.csv') # automatically chunks input
df[filter].groupby(fields).mean().compute()
Here, only the reference to dd and the .compute() are specific to dask.

Reading file with huge number of columns in python

I have a huge file csv file with around 4 million column and around 300 rows. File size is about 4.3G. I want to read this file and run some machine learning algorithm on the data.
I tried reading the file via pandas read_csv in python but it is taking long time for reading even a single row ( I suspect due to large number of columns ). I checked few other options like numpy fromfile, but nothing seems to be working.
Can someone please suggest some way to load file with many columns in python?

Pandas/numpy should be able to handle that volume of data no problem. I hope you have at least 8GB of RAM on that machine. To import a CSV file with Numpy, try something like
data = np.loadtxt('test.csv', dtype=np.uint8, delimiter=',')
If there is missing data, np.genfromtext might work instead. If none of these meet your needs and you have enough RAM to hold a duplicate of the data temporarily, you could first build a Python list of lists, one per row using readline and str.split. Then pass that to Pandas or numpy, assuming that's how you intend to operate on the data. You could then save it to disk in a format for easier ingestion later. hdf5 was already mentioned and is a good option. You can also save a numpy array to disk with numpy.savez or my favorite the speedy bloscpack.(un)pack_ndarray_file.

csv is very inefficient for storing large datasets. You should convert your csv file into a better suited format. Try hdf5 (h5py.org or pytables.org), it is very fast and allows you to read parts of the dataset without fully loading it into memory.

According to this answer, pandas (which you already tried) is the fastest library available to read a CSV in Python, or at least was in 2014.

Is there a faster way to append many XLS files into a single CSV file?

After the recommendation from Jeff's Answer to check out this Google Forum, I still didn't feel satisfied on what the conclusion was regarding the appendCSV method. Below, you can see my implementation of reading many XLS files. Is there a way to significantly increase the speed of this? It currently takes over 10 minutes for around 900,000 rows.
listOfFiles = glob.glob(file_location)
frame = pd.DataFrame()
for idx, a_file in enumerate(listOfFiles):
data = pd.read_excel(a_file, sheetname=0, skiprows=range(1,2), header=1)
data.rename(columns={'Alphabeta':'AlphaBeta'}, inplace=True)
frame = frame.append(data)
# Save to CSV..
frame.to_csv(output_dir, index=False, encoding='utf-8', date_format="%Y-%m-%d")

The very first important point
Optimize only code that is required to be optimized.
If you need to convert all you files just once then you have already made a great job, congrats! If you, however, need to reuse it really often (and by really I mean that there is a source that produce your Excel files with a speed at least of 900K rows per 10 minutes and you need to parse them in real-time) then what you need to do is to analyze your profiling results.
Profiling analysis
Sorting your profile in descending order by 'cumtime', which is cumulative execution time of function including its subcalls, you will discover that out of ~2000 seconds of runtime ~800 seconds are taken by 'read_excel' method and ~1200 seconds are taken by 'to_csv' method.
If then you will sort profile by 'tottime' which is total execution time of functions themselves you will find out that top time consumers are populated with functions that are connected with reading and writing lines and conversion between formats. So, the real problem is that either libraries you use are slow, or the amount of data you are parsing is really huge.
Possible solutions
For the first reason, please keep in mind that parsing Excel lines and converting them could be a really complex task. It is hard to advice you without having an example of your input data. But there could be a real time loss just because the library you are using is for everything and it does hard work parsing rows several times when you actually do not need it, because your rows have very simple structure. In this case you may try to switch to different libraries, that does not perform complex parsing of input data, for example use xlrd for reading data from Excel. But in title you mentioned that input files are also CSVs so if this is applicable in your case then load lines with just:
line.strip().split(sep)
instead of complex Excel format parsing. And of course if your rows are simple than you can always use
','.join(list_of_rows)
to write CSV instead of using complex DataFrames at all. However, if your files contain Unicode symbols, complex fields and so on then these libraries are probably the best choice.
For the second reason - 900K rows could contain from 900K to infinite bytes, so it is really hard to understand whether your data input is really so big, without an example again. If you have really a lot of data then probably there is not too much you could do and you just have to wait. And remember that disk is actually a very slow device. Usual disks could provide you with ~100Mb/s at its best so if you are copying (because ultimately that is what you are doing) 10Gb of data then you can see that at least 3-4 minutes will be required for just physically reading raw data and writing the result. But in case if you are not using your disk bandwidth for 100% (for example if parsing one row with library that you are using takes comparable time with just reading this row from disk) you might also try to increase speed of your code by asynchronous data reading with multiprocessing map_async instead of cycle.

If you are using pandas, you could do this:
dfs = [pd.read_excel(path.join(dir, name), sep='\t', encoding='cp1252', error_bad_lines=False ) for name in os.listdir(dir) if name.endswith(suffix)]
df = pd.concat(dfs, axis=0, ignore_index=True)
This is screaming fast compared to other methods of getting data into pandas. Other tips:
You can also speed this up by specifying dtype for all columns.
If you are doing read_csv, use the engine='c' to speed up the import.
Skip rows on error

Python: handling a large set of data. Scipy or Rpy? And how?

In my python environment, the Rpy and Scipy packages are already installed.
The problem I want to tackle is such:
1) A huge set of financial data are stored in a text file. Loading into Excel is not possible
2) I need to sum a certain fields and get the totals.
3) I need to show the top 10 rows based on the totals.
Which package (Scipy or Rpy) is best suited for this task?
If so, could you provide me some pointers (e.g. documentation or online example) that can help me to implement a solution?
Speed is a concern. Ideally scipy and Rpy can handle the large files when even when the files are so large that they cannot be fitted into memory

Neither Rpy or Scipy is necessary, although numpy may make it a bit easier.
This problem seems ideally suited to a line-by-line parser.
Simply open the file, read a row into a string, scan the row into an array (see numpy.fromstring), update your running sums and move to the next line.

Python's File I/O doesn't have bad performance, so you can just use the file module directly. You can see what functions are available in it by typing help (file) in the interactive interpreter. Creating a file is part of the core language functionality and doesn't require you to import file.
Something like:
f = open ("C:\BigScaryFinancialData.txt", "r");
for line in f.readlines():
#line is a string type
#do whatever you want to do on a per-line basis here, for example:
print len(line)
Disclaimer: This is a Python 2 answer. I'm not 100% sure this works in Python 3.
I'll leave it to you to figure out how to show the top 10 rows and find the row sums. This can be done with simple program logic that shouldn't be a problem without any special libraries. Of course, if the rows have some kind of complicated formatting that makes it difficult to parse out the values, you might want to use some kind of module for parsing, re for example (type help(re) into the interactive interpreter).

As #gsk3 noted, bigmemory is a great package for this, along with the packages biganalytics and bigtabulate (there are more, but these are worth checking out). There's also ff, though that isn't as easy to use.
Common to both R and Python is support for HDF5 (see the ncdf4 or NetCDF4 packages in R), which makes it very speedy and easy to access massive data sets on disk. Personally, I primarily use bigmemory, though that's R specific. As HDF5 is available in Python and is very, very fast, it's probably going to be your best bet in Python.

How huge is your data, is it larger than your PC's memory? If it can be loaded into memory, you can use numpy.loadtxt() to load text data into a numpy array. for example:
import numpy as np
with file("data.csv", "rb") as f:
title = f.readline() # if your data have a title line.
data = np.loadtxt(f, delimiter=",") # if your data splitted by ","
print np.sum(data, axis=0) # sum along 0 axis to get the sum of every column

I don't know anything about Rpy. I do know that SciPy is used to do serious number-crunching with truly large data sets, so it should work for your problem.
As zephyr noted, you may not need either one; if you just need to keep some running sums, you can probably do it in Python. If it is a CSV file or other common file format, check and see if there is a Python module that will parse it for you, and then write a loop that sums the appropriate values.
I'm not sure how to get the top ten rows. Can you gather them on the fly as you go, or do you need to compute the sums and then choose the rows? To gather them you might want to use a dictionary to keep track of the current 10 best rows, and use the keys to store the metric you used to rank them (to make it easy to find and toss out a row if another row supersedes it). If you need to find the rows after the computation is done, slurp all the data into a numpy.array, or else just take a second pass through the file to pull out the ten rows.

Since this has the R tag I'll give some R solutions:
Overview
http://www.r-bloggers.com/r-references-for-handling-big-data/
bigmemory package http://www.cybaea.net/Blogs/Data/Big-data-for-R.html
XDF format http://blog.revolutionanalytics.com/2011/03/analyzing-big-data-with-revolution-r-enterprise.html
Hadoop interfaces to R (RHIPE, etc.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.