I am using Stata to process some data, export the data in a csv file and load it in Python using the pandas read_csv function.
The problem is that everything is so slow. Exporting from Stata to a csv file takes ages (exporting in the dta Stata format is much faster), and loading the data via read_csv is also very slow. Using the read_stata pandas function is even worse.
I wonder is there are any other options? Like exporting a format other than csv? My csv dataset is approx 6-7 Gb large.
Any help appreciated
Thanks
Pretty efficient pd.read_stata()/.to_stata(), see here
Related
I have this large dataset around 6gb and have processed and cleaned the data using PySpark and now want to save it so I can use it elsewhere for machine learning uses
I am trying to find the fastest way of saving the datasets.
I followed this link, but its taking so long to save the csv or the parquet.
How to export a table dataframe in PySpark to csv?
Please can someone provide some information on how I can do this
I have a massive dataset (text file) that is nearly 4GB and would like to work with the dataset using a pandas dataframe. I can read in the file but it takes a couple of minutes to read in all of the data.
So, I would like to leverage the speed of C using the Cython library.
I am having trouble finding out how to read a text file into a pandas dataframe using Cython.
Any guidance would be helpful.
Read it once and store it back as other file formats with faster I/O (e.g. HDF, pickle). You'll most likely see a 10x-20x improvement.
There's a rough comparison on each file format I/O speed and disk space in the official documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#performance-considerations
I am trying to be really specific about my issue. I have a dataframe with some 200+ columns and 1mil+ rows. I am reading or writing it to a excel file which takes more than 45 mins if I recorded right.
df = pd.read_csv("data_file.csv", low_memory=False, header=0, delimiter = ',', na_values = ('', 'nan'))
df.to_excel('data_file.xlsx', header=0, index=False)
My question- is there anyway we can read or write faster to a file with pandas dataframe because this is just one file example. I have many more such files with me
Two thoughts:
Investigate Dask, which provides a Pandas like DataFrame that can distribute processing of large datasets across multiple CPUs or clusters. Hard to say to what degree you will get a speed up, if your performance is purely IO bound, but certainly worth investigating. Take a quick look at the Dask use cases to get an understanding of its capabilities.
If you are going to repeatedly read the same CSV input files, then I would suggest converting these to HDF, as reading HDF is orders of magnitude faster than reading the equivalent CSV file. It's as simple as reading the file into a DataFrame and then writing it back out using DataFrame.to_hdf(). Obviously this will only help if you can do this conversion as a once off exercise, and then use the HDF files from that point forward whenever you run your code.
Regards,
Ian
That is a big file you are working with. If you need to process the data then you can't really get around the long read and write times.
Do NOT write to xlsx, use csv, writing to xlsx is taking long time.
Write to csv. It takes a minute on my cheap laptop with SSD.
I have a huge file csv file with around 4 million column and around 300 rows. File size is about 4.3G. I want to read this file and run some machine learning algorithm on the data.
I tried reading the file via pandas read_csv in python but it is taking long time for reading even a single row ( I suspect due to large number of columns ). I checked few other options like numpy fromfile, but nothing seems to be working.
Can someone please suggest some way to load file with many columns in python?
Pandas/numpy should be able to handle that volume of data no problem. I hope you have at least 8GB of RAM on that machine. To import a CSV file with Numpy, try something like
data = np.loadtxt('test.csv', dtype=np.uint8, delimiter=',')
If there is missing data, np.genfromtext might work instead. If none of these meet your needs and you have enough RAM to hold a duplicate of the data temporarily, you could first build a Python list of lists, one per row using readline and str.split. Then pass that to Pandas or numpy, assuming that's how you intend to operate on the data. You could then save it to disk in a format for easier ingestion later. hdf5 was already mentioned and is a good option. You can also save a numpy array to disk with numpy.savez or my favorite the speedy bloscpack.(un)pack_ndarray_file.
csv is very inefficient for storing large datasets. You should convert your csv file into a better suited format. Try hdf5 (h5py.org or pytables.org), it is very fast and allows you to read parts of the dataset without fully loading it into memory.
According to this answer, pandas (which you already tried) is the fastest library available to read a CSV in Python, or at least was in 2014.
I'm having memory problems while using Pandas on some big CSV files (more than 30 million rows). So, I'm wondering what is the best solution for this? I need to merge couple big tables. Thanks a lot!
Possible duplicate of Fastest way to parse large CSV files in Pandas.
The inference is, if you are loading the csv file data often, then a better way would be to parse it once (with conventional read_csv) and store it in HDF5 format. Pandas (with PyTables library), provides an efficient way to handle this issue [docs].
Also, the answer to What is the fastest way to upload a big csv file in notebook to work with python pandas? shows you the timed execution (timeit) of sample dataset with csv vs csv.gz vs Pickle vs HDF5 comparison.