I have a huge file csv file with around 4 million column and around 300 rows. File size is about 4.3G. I want to read this file and run some machine learning algorithm on the data.
I tried reading the file via pandas read_csv in python but it is taking long time for reading even a single row ( I suspect due to large number of columns ). I checked few other options like numpy fromfile, but nothing seems to be working.
Can someone please suggest some way to load file with many columns in python?
Pandas/numpy should be able to handle that volume of data no problem. I hope you have at least 8GB of RAM on that machine. To import a CSV file with Numpy, try something like
data = np.loadtxt('test.csv', dtype=np.uint8, delimiter=',')
If there is missing data, np.genfromtext might work instead. If none of these meet your needs and you have enough RAM to hold a duplicate of the data temporarily, you could first build a Python list of lists, one per row using readline and str.split. Then pass that to Pandas or numpy, assuming that's how you intend to operate on the data. You could then save it to disk in a format for easier ingestion later. hdf5 was already mentioned and is a good option. You can also save a numpy array to disk with numpy.savez or my favorite the speedy bloscpack.(un)pack_ndarray_file.
csv is very inefficient for storing large datasets. You should convert your csv file into a better suited format. Try hdf5 (h5py.org or pytables.org), it is very fast and allows you to read parts of the dataset without fully loading it into memory.
According to this answer, pandas (which you already tried) is the fastest library available to read a CSV in Python, or at least was in 2014.
Related
I have a massive dataset (text file) that is nearly 4GB and would like to work with the dataset using a pandas dataframe. I can read in the file but it takes a couple of minutes to read in all of the data.
So, I would like to leverage the speed of C using the Cython library.
I am having trouble finding out how to read a text file into a pandas dataframe using Cython.
Any guidance would be helpful.
Read it once and store it back as other file formats with faster I/O (e.g. HDF, pickle). You'll most likely see a 10x-20x improvement.
There's a rough comparison on each file format I/O speed and disk space in the official documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#performance-considerations
I want to read in large csv files into python in the fastest way possible. I have a csv file of ~100 million rows. I came across this primer https://medium.com/casual-inference/the-most-time-efficient-ways-to-import-csv-data-in-python-cc159b44063d and they go through a few packages
csv
pandas
dask
datatable
paratext
For my purposes, "csv" is too raw and I want to leverage the type inference included in the other packages. I need it to work on both windows and linux machines and have also looked into datatable and paratext, but have had problems installing right package dependencies (neither are on the anaconda package repo). So that leaves pandas and dask. At first glance, dask seems much faster, but I only realized that it only does the computations if you call ".compute"
My specific use case is that even though the raw csv file is 100+ million rows, I only need a subset of it loaded into memory. For example, all rows with date >= T. Is there a more efficient way of doing this than just the below example? Both pandas and dask take similar time.
EDIT: The csv file updates daily and there is no pre-known order of the rows of the file. Ie it is not necessarily the case that the most recent dates are at the end of the file
import pandas as pd
import dask as dd
from datetime import datetime
s = datetime.now()
data1 = pd.read_csv("test.csv", parse_dates=["DATE"])
data1 = data1[data1.DATE>=datetime(2019,12,24)]
print(datetime.now()-s)
s = datetime.now()
data2 = dd.read_csv("test.csv", parse_dates=["DATE"])
data2 = data2[data2.DATE>=datetime(2019,12,24)].compute()
print(datetime.now()-s)
Your Dask solution looks good to me. For parsing CSV in particular you might want to use Dask's multiprocessing scheduler. Most Pandas operations are better with threads, but text-based processing like CSV, is an exception.
data2 = data2[data2.DATE>=datetime(2019,12,24)].compute(scheduler="processes")
See https://docs.dask.org/en/latest/scheduling.html for more information.
CSV is not an efficient file format for filtering, CSV files don't have indexes for data fields, don't have key based access. For each filter operation you always have to read all lines.
You can improve the performance marginally by using a library that's written in C or is doing some stuff a little smarter than another library, but don't expect miracles/ I'd expect an improvement of a few percent to perhaps 3x if you identify/implement an optimized C version reading your lines and performing the initial filtering.
If you read the CSV file more often then it might be useful to convert the file during the first read (multiple options exist: hand craftet helpers, indexes, sorting, data bases, ...)
and perform subsequent reads on the 'data base'.
If you know that the new CSV file is a the same as the previous version plus lines appended to the end of the file, you had to memorize the position of the last line of the previous version and just add the new lines to your optimized data file. (data base, ...)
Other file formats might be hundreds or thousand times more efficient, but creating these files the first time is probably at least as expensive as your searching (so if reading only once, there's nothing you can optimize)
If none of above conditions is true you cannot expect a huge performance increase
You might look at What is the fastest way to search the csv file?
for accelerations (assuming files can be sorted/indexed by a search / filter criteria)
I have a big dbf file, converting it to a pandas dataframe is taking a lot of time.
Is there a way to convert the file into a dask dataframe?
Dask does not have a dbf loading method.
As far as I can tell, dbf files do not support random-access to the data, so it is not possible to read from sections of the file in separate workers, in parallel. I may be wrong about this, but certainly dbfreader makes no mention of jumping through to an arbitrary record.
Therefore, the only way you could read from dbf in parallel, and hope to see a speed increase, would be to split your original data into multiple dbf files, and use dask.delayed to read each of them.
It is worth mentioning, that probably the reason dbfreader is slow (but please, do your own profiling!) is that it's doing byte-by-byte manipulations and making python objects for every record before passing the records to pandas. If you really wanted to speed things up, this code should be converted to cython or maybe numba, and a pre-allocated dataframe assigned into.
I am trying to be really specific about my issue. I have a dataframe with some 200+ columns and 1mil+ rows. I am reading or writing it to a excel file which takes more than 45 mins if I recorded right.
df = pd.read_csv("data_file.csv", low_memory=False, header=0, delimiter = ',', na_values = ('', 'nan'))
df.to_excel('data_file.xlsx', header=0, index=False)
My question- is there anyway we can read or write faster to a file with pandas dataframe because this is just one file example. I have many more such files with me
Two thoughts:
Investigate Dask, which provides a Pandas like DataFrame that can distribute processing of large datasets across multiple CPUs or clusters. Hard to say to what degree you will get a speed up, if your performance is purely IO bound, but certainly worth investigating. Take a quick look at the Dask use cases to get an understanding of its capabilities.
If you are going to repeatedly read the same CSV input files, then I would suggest converting these to HDF, as reading HDF is orders of magnitude faster than reading the equivalent CSV file. It's as simple as reading the file into a DataFrame and then writing it back out using DataFrame.to_hdf(). Obviously this will only help if you can do this conversion as a once off exercise, and then use the HDF files from that point forward whenever you run your code.
Regards,
Ian
That is a big file you are working with. If you need to process the data then you can't really get around the long read and write times.
Do NOT write to xlsx, use csv, writing to xlsx is taking long time.
Write to csv. It takes a minute on my cheap laptop with SSD.
I'm having memory problems while using Pandas on some big CSV files (more than 30 million rows). So, I'm wondering what is the best solution for this? I need to merge couple big tables. Thanks a lot!
Possible duplicate of Fastest way to parse large CSV files in Pandas.
The inference is, if you are loading the csv file data often, then a better way would be to parse it once (with conventional read_csv) and store it in HDF5 format. Pandas (with PyTables library), provides an efficient way to handle this issue [docs].
Also, the answer to What is the fastest way to upload a big csv file in notebook to work with python pandas? shows you the timed execution (timeit) of sample dataset with csv vs csv.gz vs Pickle vs HDF5 comparison.