Read TXT file with Cython and Pandas - python

I have a massive dataset (text file) that is nearly 4GB and would like to work with the dataset using a pandas dataframe. I can read in the file but it takes a couple of minutes to read in all of the data.
So, I would like to leverage the speed of C using the Cython library.
I am having trouble finding out how to read a text file into a pandas dataframe using Cython.
Any guidance would be helpful.

Read it once and store it back as other file formats with faster I/O (e.g. HDF, pickle). You'll most likely see a 10x-20x improvement.
There's a rough comparison on each file format I/O speed and disk space in the official documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#performance-considerations

Related

Efficiently read in large csv files using pandas or dask in python

I want to read in large csv files into python in the fastest way possible. I have a csv file of ~100 million rows. I came across this primer https://medium.com/casual-inference/the-most-time-efficient-ways-to-import-csv-data-in-python-cc159b44063d and they go through a few packages
csv
pandas
dask
datatable
paratext
For my purposes, "csv" is too raw and I want to leverage the type inference included in the other packages. I need it to work on both windows and linux machines and have also looked into datatable and paratext, but have had problems installing right package dependencies (neither are on the anaconda package repo). So that leaves pandas and dask. At first glance, dask seems much faster, but I only realized that it only does the computations if you call ".compute"
My specific use case is that even though the raw csv file is 100+ million rows, I only need a subset of it loaded into memory. For example, all rows with date >= T. Is there a more efficient way of doing this than just the below example? Both pandas and dask take similar time.
EDIT: The csv file updates daily and there is no pre-known order of the rows of the file. Ie it is not necessarily the case that the most recent dates are at the end of the file
import pandas as pd
import dask as dd
from datetime import datetime
s = datetime.now()
data1 = pd.read_csv("test.csv", parse_dates=["DATE"])
data1 = data1[data1.DATE>=datetime(2019,12,24)]
print(datetime.now()-s)
s = datetime.now()
data2 = dd.read_csv("test.csv", parse_dates=["DATE"])
data2 = data2[data2.DATE>=datetime(2019,12,24)].compute()
print(datetime.now()-s)
Your Dask solution looks good to me. For parsing CSV in particular you might want to use Dask's multiprocessing scheduler. Most Pandas operations are better with threads, but text-based processing like CSV, is an exception.
data2 = data2[data2.DATE>=datetime(2019,12,24)].compute(scheduler="processes")
See https://docs.dask.org/en/latest/scheduling.html for more information.
CSV is not an efficient file format for filtering, CSV files don't have indexes for data fields, don't have key based access. For each filter operation you always have to read all lines.
You can improve the performance marginally by using a library that's written in C or is doing some stuff a little smarter than another library, but don't expect miracles/ I'd expect an improvement of a few percent to perhaps 3x if you identify/implement an optimized C version reading your lines and performing the initial filtering.
If you read the CSV file more often then it might be useful to convert the file during the first read (multiple options exist: hand craftet helpers, indexes, sorting, data bases, ...)
and perform subsequent reads on the 'data base'.
If you know that the new CSV file is a the same as the previous version plus lines appended to the end of the file, you had to memorize the position of the last line of the previous version and just add the new lines to your optimized data file. (data base, ...)
Other file formats might be hundreds or thousand times more efficient, but creating these files the first time is probably at least as expensive as your searching (so if reading only once, there's nothing you can optimize)
If none of above conditions is true you cannot expect a huge performance increase
You might look at What is the fastest way to search the csv file?
for accelerations (assuming files can be sorted/indexed by a search / filter criteria)

Large python dictionary. Storing, loading, and writing to it

I have a large python dictionary of values (around 50 GB), and I've stored it as a JSON file. I am having efficiency issues when it comes to opening the file and writing to the file. I know you can use ijson to read the file efficiently, but how can I write to it efficiently?
Should I even be using a Python dictionary to store my data? Is there a limit to how large a python dictionary can be? (the dictionary will get larger).
The data basically stores the path length between nodes in a large graph. I can't store the data as a graph because searching for a connection between two nodes takes too long.
Any help would be much appreciated. Thank you!
Although it will truly depend on what operations you want to perform on your network dataset you might want to considering storing this as a pandas Dataframe and then write it to disk using Parquet or Arrow.
That data could then be loaded to networkx or even to Spark (GraphX) for any network related operations.
Parquet is compressed and columnar and makes reading and writing to files much faster especially for large datasets.
From the Pandas Doc:
Apache Parquet provides a partitioned binary columnar serialization
for data frames. It is designed to make reading and writing data
frames efficient, and to make sharing data across data analysis
languages easy. Parquet can use a variety of compression techniques to
shrink the file size as much as possible while still maintaining good
read performance.
Parquet is designed to faithfully serialize and de-serialize DataFrame
s, supporting all of the pandas dtypes, including extension dtypes
such as datetime with tz.
Read further here: Pandas Parquet
try to use it with pandas: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
pandas.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, lines=False, chunksize=None, compression='infer')
Convert a JSON string to pandas object
it very lightweight and useful library to work with large data

is there a Faster way to write or read in/to with pandas data frame with about 1 million row

I am trying to be really specific about my issue. I have a dataframe with some 200+ columns and 1mil+ rows. I am reading or writing it to a excel file which takes more than 45 mins if I recorded right.
df = pd.read_csv("data_file.csv", low_memory=False, header=0, delimiter = ',', na_values = ('', 'nan'))
df.to_excel('data_file.xlsx', header=0, index=False)
My question- is there anyway we can read or write faster to a file with pandas dataframe because this is just one file example. I have many more such files with me
Two thoughts:
Investigate Dask, which provides a Pandas like DataFrame that can distribute processing of large datasets across multiple CPUs or clusters. Hard to say to what degree you will get a speed up, if your performance is purely IO bound, but certainly worth investigating. Take a quick look at the Dask use cases to get an understanding of its capabilities.
If you are going to repeatedly read the same CSV input files, then I would suggest converting these to HDF, as reading HDF is orders of magnitude faster than reading the equivalent CSV file. It's as simple as reading the file into a DataFrame and then writing it back out using DataFrame.to_hdf(). Obviously this will only help if you can do this conversion as a once off exercise, and then use the HDF files from that point forward whenever you run your code.
Regards,
Ian
That is a big file you are working with. If you need to process the data then you can't really get around the long read and write times.
Do NOT write to xlsx, use csv, writing to xlsx is taking long time.
Write to csv. It takes a minute on my cheap laptop with SSD.

Reading file with huge number of columns in python

I have a huge file csv file with around 4 million column and around 300 rows. File size is about 4.3G. I want to read this file and run some machine learning algorithm on the data.
I tried reading the file via pandas read_csv in python but it is taking long time for reading even a single row ( I suspect due to large number of columns ). I checked few other options like numpy fromfile, but nothing seems to be working.
Can someone please suggest some way to load file with many columns in python?
Pandas/numpy should be able to handle that volume of data no problem. I hope you have at least 8GB of RAM on that machine. To import a CSV file with Numpy, try something like
data = np.loadtxt('test.csv', dtype=np.uint8, delimiter=',')
If there is missing data, np.genfromtext might work instead. If none of these meet your needs and you have enough RAM to hold a duplicate of the data temporarily, you could first build a Python list of lists, one per row using readline and str.split. Then pass that to Pandas or numpy, assuming that's how you intend to operate on the data. You could then save it to disk in a format for easier ingestion later. hdf5 was already mentioned and is a good option. You can also save a numpy array to disk with numpy.savez or my favorite the speedy bloscpack.(un)pack_ndarray_file.
csv is very inefficient for storing large datasets. You should convert your csv file into a better suited format. Try hdf5 (h5py.org or pytables.org), it is very fast and allows you to read parts of the dataset without fully loading it into memory.
According to this answer, pandas (which you already tried) is the fastest library available to read a CSV in Python, or at least was in 2014.

python pandas memory errors while working with big CSV files

I'm having memory problems while using Pandas on some big CSV files (more than 30 million rows). So, I'm wondering what is the best solution for this? I need to merge couple big tables. Thanks a lot!
Possible duplicate of Fastest way to parse large CSV files in Pandas.
The inference is, if you are loading the csv file data often, then a better way would be to parse it once (with conventional read_csv) and store it in HDF5 format. Pandas (with PyTables library), provides an efficient way to handle this issue [docs].
Also, the answer to What is the fastest way to upload a big csv file in notebook to work with python pandas? shows you the timed execution (timeit) of sample dataset with csv vs csv.gz vs Pickle vs HDF5 comparison.

Categories