Read a large csv as a Pandas DataFrame faster - python

I have a csv that I am reading into a Pandas DataFrame but it takes about 35 minutes to read. The csv is approximately 120 GB. I found a module called cudf that allows a GPU DataFrame however it is only for Linux. Is there something similar for Windows?
chunk_list = []
combined_array = pd.DataFrame()
for chunk in tqdm(pd.read_csv('\\large_array.csv', header = None,
low_memory = False, error_bad_lines = False, chunksize = 10000)):
print(' --- Complete')
chunk_list.append(chunk)
array = pd.concat(chunk_list)
print(array)

You can also look at dask-dataframe if you really want to read it into a pandas api like dataframe.
For reading csvs , this will parallelize your io task across multiple cores plus nodes. This will probably alleviate memory pressures by scaling across nodes as with 120 GB csv you will probably be memory bound too.
Another good alternative might be using arrow.

Do you have GPU ? if yes, please look at BlazingSQL, the GPU SQL engine in a Python package.
In this article, describe Querying a Terabyte with BlazingSQL. And BlazingSQL support read from CSV.
After you get GPU dataframe convert to Pandas dataframe with
# from cuDF DataFrame to pandas DataFrame
df = gdf.to_pandas()

Related

Pandas / Dask - group by & aggregating a large CSV blows the memory and/or takes quite a while

I'm trying a small POC to try to group by & aggregate to reduce data from a large CSV in pandas and Dask, and I'm observing high memory usage and/or slower than I would expect processing times... does anyone have any tips for a python/pandas/dask noob to improve this?
Background
I have a request to build a file ingestion tool that would:
Be able to take in files of a few GBs where each row contains user id and some other info
do some transformations
reduce the data to { user -> [collection of info]}
send batches of this data to our web services
Based on my research, since files are only few GBs, I found that Spark, etc would be overkill, and Pandas/Dask may be a good fit, hence the POC.
Problem
Processing a 1GB csv takes ~1 min for both pandas and Dask, and consumes 1.5GB ram for pandas and 9GB ram for dask (!!!)
Processing a 2GB csv takes ~3 mins and 2.8GB ram for pandas, Dask crashes!
What am I doing wrong here?
for pandas, since I'm processing the CSV in small chunks, I did not expect the RAM usage to be so high
for Dask, everything I read online suggested that Dask processes the CSV in blocks indicated by blocksize, and as such the ram usage should expect to be blocksize * size per block, but I wouldn't expect total to be 9GB when the block size is only 6.4MB. I don't know why on earth its ram usage skyrockets to 9GB for a 1GB csv input
(Note: if I don't set the block size dask crashes even on input 1GB)
My code
I can't share the CSV, but it has 1 integer column followed by 8 text columns. Both user_id and order_id columns referenced below are text columns.
1GB csv has 14000001 lines
2GB csv has 28000001 lines
5GB csv has 70000001 lines
I generated these csvs with random data, and the user_id column I randomly picked from 10 pre-randomly-generated values, so I'd expect the final output to be 10 user ids each with a collection of who knows how many order ids.
Pandas
#!/usr/bin/env python3
from pandas import DataFrame, read_csv
import pandas as pd
import sys
test_csv_location = '1gb.csv'
chunk_size = 100000
pieces = list()
for chunk in pd.read_csv(test_csv_location, chunksize=chunk_size, delimiter='|', iterator=True):
df = chunk.groupby('user_id')['order_id'].agg(size= len,list= lambda x: list(x))
pieces.append(df)
final = pd.concat(pieces).groupby('user_id')['list'].agg(size= len,list=sum)
final.to_csv('pandastest.csv', index=False)
Dask
#!/usr/bin/env python3
from dask.distributed import Client
import dask.dataframe as ddf
import sys
test_csv_location = '1gb.csv'
df = ddf.read_csv(test_csv_location, blocksize=6400000, delimiter='|')
# For each user, reduce to a list of order ids
grouped = df.groupby('user_id')
collection = grouped['order_id'].apply(list, meta=('order_id', 'f8'))
collection.to_csv('./dasktest.csv', single_file=True)
The groupby operation is expensive because dask will try to shuffle data across workers to check who has which user_id values. If user_id has a lot of unique values (sounds like it), there is a lot of cross-checks to be done across workers/partitions.
There are at least two ways out of it:
set user_id as index. This will be expensive during the indexing stage, but subsequent operations will be faster because now dask doesn't have to check each partition for the values of user_id it has.
df = df.set_index('user_id')
collection = df.groupby('user_id')['order_id'].apply(list, meta=('order_id', 'f8'))
collection.to_csv('./dasktest.csv', single_file=True)
if your files have a structure that you know about, e.g. as an extreme example, if user_id are somewhat sorted, so that first the csv file contains only user_id values that start with 1 (or A, or whatever other symbols are used), then with 2, etc, then you could use that information to form partitions in 'blocks' (loose term) in a way that groupby would be needed only within those 'blocks'.

Melt a big data frame without pandas

I have a 3GB dataset with 40k rows and 60k columns which Pandas is unable to read and I would like to melt the file based on the current index.
The current file looks like this:
The first column is an index and I would like to melt all the file based on this index.
I tried pandas and dask, but all of them crush when reading the big file.
Do you have any suggestions?
thanks
You need to use the chunksize property of pandas. See for example How to read a 6 GB csv file with pandas.
You will process N rows at one time, without loading the whole dataframe. N will depend on your computer: if N is low, it will cost less memory but it will increase the run time and will cost more IO load.
# create an object reading your file 100 rows at a time
reader = pd.read_csv( 'bigfile.tsv', sep='\t', header=None, chunksize=100 )
# process each chunk at a time
for chunk in file:
result = chunk.melt()
# export the results into a new file
result.to_csv( 'bigfile_melted.tsv', header=None, sep='\t', mode='a' )
Furthermore, you can use the argument dtype=np.int32 for read_csv if you have integer or dtype=np.float32 to process data faster if you do not need precision.
NB: here you have examples of memory usage: Using Chunksize in Pandas.

parallelize conversion of a single 16M row csv to Parquet with dask

The following operation works, but takes nearly 2h:
from dask import dataframe as ddf
ddf.read_csv('data.csv').to_parquet('data.pq')
Is there a way to parallelize this?
The file data.csv is ~2G uncompressed with 16 million rows by 22 columns.
I'm not sure if it is a problem with data or not. I made a toy example on my machine and the same command takes ~9 seconds
import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask.distributed import Client
import dask
client = Client()
# if you wish to connect to the dashboard
client
# fake df size ~2.1 GB
# takes ~180 seconds
N = int(5e6)
df = pd.DataFrame({i: np.random.rand(N)
for i in range(22)})
df.to_csv("data.csv", index=False)
# the following takes ~9 seconds on my machine
dd.read_csv("data.csv").to_parquet("data_pq")
read_csv has a blocksize parameter (docs) that you can use to control the size of the resulting partition and hence, the number of partitions. That, from what I understand, will result in reading the partitions in parallel - each worker will read block size at relevant offset.
You can set blocksize so that it yields the required number of partition to take advantage of the cores you have. For example
cores = 8
size = os.path.getsize('data.csv')
ddf = dd.read_csv("data.csv", blocksize=np.rint(size/cores))
print(ddf.npartitions)
Will output:
8
Better yet, you can try to modify the size so that the resulting parquet has partitions of recommended size (which I have seen different numbers in different places :-|).
Writing out multiple CSV files in parallel is also easy with the repartition method:
df = dd.read_csv('data.csv')
df = df.repartition(npartitions=20)
df.to_parquet('./data_pq', write_index=False, compression='snappy')
Dask likes working with partitions that are around 100 MB, so 20 partitions should be good for a 2GB dataset.
You could also speed this up, by splitting the CSV before reading it, so Dask can read the CSV files in parallel. You could use these tactics to break up the 2GB CSV into 20 different CSVs and then write them out without repartitioning:
df = dd.read_csv('folder_with_small_csvs/*.csv')
df.to_parquet('./data_pq', write_index=False, compression='snappy')

Pandas read_csv large file Performance improvement

Just was wondering if there is a way to improve the performance of reading large csv files into a pandas dataframe. I have 3 large (3.5MM records each) pipe delimited file which I want to load into dataframe and perform some task on it. Currently I am using pandas.read_csv() defining the cols and there datatypes in the parameter like below. I did see some improvement by defining the datatype of the columns but it still takes more than 3 minutes to load.
import pandas as pd
df = pd.read_csv(file_, index_col=None, usecols = sourceFields, sep='|', header=0, dtype={'date':'str', 'gwTimeUtc':'str', 'asset':'|str',
'instrumentId':'|str', 'askPrice':'float64', 'bidPrice':'float64',
'askQuantity':'float64', 'bidQuantity':'float64', 'currency':'|str',
'venue':'|str', 'owner':'|str', 'status':'|str', 'priceNotation':'|str', 'nominalQuantity':'float64'})
Depending on what you wish to do with the data, a good option is dask.dataframe. This library works out-of-memory, and allows you to perform a subset of pandas operations lazily. You can then bring the results in memory as a pandas dataframe. Below is example code you can try:
import dask.dataframe as dd, pandas as pd
# point to all files beginning with "file"
dask_df = dd.read_csv('file*.csv')
# define your calculations as you would in pandas
dask_df['col2'] = dask_df['col1'] * 2
# compute results & return to pandas
df = dask_df.compute()
Crucially, nothing significant is computed until the very last line.
The .feather file is significantly faster than .csv. Pandas has built-in support for feather files.
Read the csv in using pd.read_csv(path) and then export it to a feather file: pd.to_feather(path). Now, read the feather file instead of csv.
In my case, a 950 MB csv file was compressed to a 180 MB feather file. Instead of taking 30 seconds to read, it takes about 1 second. I know I am a bit late to the party, but feather files are seriously underrated.

Large, persistent DataFrame in pandas

I am exploring switching to python and pandas as a long-time SAS user.
However, when running some tests today, I was surprised that python ran out of memory when trying to pandas.read_csv() a 128mb csv file. It had about 200,000 rows and 200 columns of mostly numeric data.
With SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.
Is there something analogous in pandas?
I regularly work with large files and do not have access to a distributed computing network.
Wes is of course right! I'm just chiming in to provide a little more complete example code. I had the same issue with a 129 Mb file, which was solved by:
import pandas as pd
tp = pd.read_csv('large_dataset.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iterable with chunks of 1000 rows.
df = pd.concat(tp, ignore_index=True) # df is DataFrame. If errors, do `list(tp)` instead of `tp`
In principle it shouldn't run out of memory, but there are currently memory problems with read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).
At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.
This is an older thread, but I just wanted to dump my workaround solution here. I initially tried the chunksize parameter (even with quite small values like 10000), but it didn't help much; had still technical issues with the memory size (my CSV was ~ 7.5 Gb).
Right now, I just read chunks of the CSV files in a for-loop approach and add them e.g., to an SQLite database step by step:
import pandas as pd
import sqlite3
from pandas.io import sql
import subprocess
# In and output file paths
in_csv = '../data/my_large.csv'
out_sqlite = '../data/my.sqlite'
table_name = 'my_table' # name for the SQLite database table
chunksize = 100000 # number of lines to process at each iteration
# columns that should be read from the CSV file
columns = ['molecule_id','charge','db','drugsnow','hba','hbd','loc','nrb','smiles']
# Get number of lines in the CSV file
nlines = subprocess.check_output('wc -l %s' % in_csv, shell=True)
nlines = int(nlines.split()[0])
# connect to database
cnx = sqlite3.connect(out_sqlite)
# Iteratively read CSV and dump lines into the SQLite table
for i in range(0, nlines, chunksize):
df = pd.read_csv(in_csv,
header=None, # no header, define column header manually later
nrows=chunksize, # number of rows to read at each iteration
skiprows=i) # skip rows that were already read
# columns to read
df.columns = columns
sql.to_sql(df,
name=table_name,
con=cnx,
index=False, # don't use CSV file index
index_label='molecule_id', # use a unique column from DataFrame as index
if_exists='append')
cnx.close()
Below is my working flow.
import sqlalchemy as sa
import pandas as pd
import psycopg2
count = 0
con = sa.create_engine('postgresql://postgres:pwd#localhost:00001/r')
#con = sa.create_engine('sqlite:///XXXXX.db') SQLite
chunks = pd.read_csv('..file', chunksize=10000, encoding="ISO-8859-1",
sep=',', error_bad_lines=False, index_col=False, dtype='unicode')
Base on your file size, you'd better optimized the chunksize.
for chunk in chunks:
chunk.to_sql(name='Table', if_exists='append', con=con)
count += 1
print(count)
After have all data in Database, You can query out those you need from database.
If you want to load huge csv files, dask might be a good option. It mimics the pandas api, so it feels quite similar to pandas
link to dask on github
You can use Pytable rather than pandas df.
It is designed for large data sets and the file format is in hdf5.
So the processing time is relatively fast.

Categories