Large, persistent DataFrame in pandas - python

I am exploring switching to python and pandas as a long-time SAS user.
However, when running some tests today, I was surprised that python ran out of memory when trying to pandas.read_csv() a 128mb csv file. It had about 200,000 rows and 200 columns of mostly numeric data.
With SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.
Is there something analogous in pandas?
I regularly work with large files and do not have access to a distributed computing network.

Wes is of course right! I'm just chiming in to provide a little more complete example code. I had the same issue with a 129 Mb file, which was solved by:
import pandas as pd
tp = pd.read_csv('large_dataset.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iterable with chunks of 1000 rows.
df = pd.concat(tp, ignore_index=True) # df is DataFrame. If errors, do `list(tp)` instead of `tp`

In principle it shouldn't run out of memory, but there are currently memory problems with read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).
At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.

This is an older thread, but I just wanted to dump my workaround solution here. I initially tried the chunksize parameter (even with quite small values like 10000), but it didn't help much; had still technical issues with the memory size (my CSV was ~ 7.5 Gb).
Right now, I just read chunks of the CSV files in a for-loop approach and add them e.g., to an SQLite database step by step:
import pandas as pd
import sqlite3
from pandas.io import sql
import subprocess
# In and output file paths
in_csv = '../data/my_large.csv'
out_sqlite = '../data/my.sqlite'
table_name = 'my_table' # name for the SQLite database table
chunksize = 100000 # number of lines to process at each iteration
# columns that should be read from the CSV file
columns = ['molecule_id','charge','db','drugsnow','hba','hbd','loc','nrb','smiles']
# Get number of lines in the CSV file
nlines = subprocess.check_output('wc -l %s' % in_csv, shell=True)
nlines = int(nlines.split()[0])
# connect to database
cnx = sqlite3.connect(out_sqlite)
# Iteratively read CSV and dump lines into the SQLite table
for i in range(0, nlines, chunksize):
df = pd.read_csv(in_csv,
header=None, # no header, define column header manually later
nrows=chunksize, # number of rows to read at each iteration
skiprows=i) # skip rows that were already read
# columns to read
df.columns = columns
sql.to_sql(df,
name=table_name,
con=cnx,
index=False, # don't use CSV file index
index_label='molecule_id', # use a unique column from DataFrame as index
if_exists='append')
cnx.close()

Below is my working flow.
import sqlalchemy as sa
import pandas as pd
import psycopg2
count = 0
con = sa.create_engine('postgresql://postgres:pwd#localhost:00001/r')
#con = sa.create_engine('sqlite:///XXXXX.db') SQLite
chunks = pd.read_csv('..file', chunksize=10000, encoding="ISO-8859-1",
sep=',', error_bad_lines=False, index_col=False, dtype='unicode')
Base on your file size, you'd better optimized the chunksize.
for chunk in chunks:
chunk.to_sql(name='Table', if_exists='append', con=con)
count += 1
print(count)
After have all data in Database, You can query out those you need from database.

If you want to load huge csv files, dask might be a good option. It mimics the pandas api, so it feels quite similar to pandas
link to dask on github

You can use Pytable rather than pandas df.
It is designed for large data sets and the file format is in hdf5.
So the processing time is relatively fast.

Related

How to split a large .csv file using dask?

I am trying to use dask in order to split a huge tab-delimited file into smaller chunks on an AWS Batch array of 100,000 cores.
In AWS Batch each core has a unique environment variable AWS_BATCH_JOB_ARRAY_INDEX ranging from 0 to 99,999 (which is copied into the idx variable in the snippet below). Thus, I am trying to use the following code:
import os
import dask.dataframe as dd
idx = int(os.environ["AWS_BATCH_JOB_ARRAY_INDEX"])
df = dd.read_csv(f"s3://main-bucket/workdir/huge_file.tsv", sep='\t')
df = df.repartition(npartitions=100_000)
df = df.partitions[idx]
df = df.persist() # this call isn't needed before calling to df.to_csv (see comment by Sultan)
df = df.compute() # this call isn't needed before calling to df.to_csv (see comment by Sultan)
df.to_csv(f"/tmp/split_{idx}.tsv", sep="\t", index=False)
print(idx, df.shape, df.head(5))
Do I need to call presist and/or compute before calling df.to_csv?
When I have to split a big file into multiple smaller ones, I simply run the following code.
Read and repartition
import dask.dataframe as dd
df = dd.read_csv("file.csv")
df = df.repartition(npartitions=100)
Save to csv
o = df.to_csv("out_csv/part_*.csv", index=False)
Save to parquet
o = df.to_parquet("out_parquet/")
Here you can use write_metadata_file=False if you want to avoid metadata.
Few notes:
I don't think you really need persist and compute as you can directly save to disk. When you have problems like memory error is safer to save to disk rather than compute.
I found using parquet format at least 3x faster than csv when it's time to write.

Melt a big data frame without pandas

I have a 3GB dataset with 40k rows and 60k columns which Pandas is unable to read and I would like to melt the file based on the current index.
The current file looks like this:
The first column is an index and I would like to melt all the file based on this index.
I tried pandas and dask, but all of them crush when reading the big file.
Do you have any suggestions?
thanks
You need to use the chunksize property of pandas. See for example How to read a 6 GB csv file with pandas.
You will process N rows at one time, without loading the whole dataframe. N will depend on your computer: if N is low, it will cost less memory but it will increase the run time and will cost more IO load.
# create an object reading your file 100 rows at a time
reader = pd.read_csv( 'bigfile.tsv', sep='\t', header=None, chunksize=100 )
# process each chunk at a time
for chunk in file:
result = chunk.melt()
# export the results into a new file
result.to_csv( 'bigfile_melted.tsv', header=None, sep='\t', mode='a' )
Furthermore, you can use the argument dtype=np.int32 for read_csv if you have integer or dtype=np.float32 to process data faster if you do not need precision.
NB: here you have examples of memory usage: Using Chunksize in Pandas.

Read CSV with PyArrow

I have large CSV files that I'd ultimately like to convert to parquet. Pandas won't help because of memory constraints and its difficulty handling NULL values (which are common in my data). I checked the PyArrow docs and there are tools for reading parquet files, but I didn't see anything about reading CSVs. Did I miss something, or is this feature somehow incompatible with PyArrow?
We're working on this feature, there is a pull request up now: https://github.com/apache/arrow/pull/2576. You can help by testing it out!
You can read the CSV in chunks with pd.read_csv(chunksize=...), then write a chunk at a time with Pyarrow.
The one caveat is, as you mentioned, Pandas will give inconsistent dtypes if you have a column that is all nulls in one chunk, so you have to make sure the chunk size is larger than the longest run of nulls in your data.
This reads CSV from stdin and writes Parquet to stdout (Python 3).
#!/usr/bin/env python
import sys
import pandas as pd
import pyarrow.parquet
# This has to be big enough you don't get a chunk of all nulls: https://issues.apache.org/jira/browse/ARROW-2659
SPLIT_ROWS = 2 ** 16
def main():
writer = None
for split in pd.read_csv(sys.stdin.buffer, chunksize=SPLIT_ROWS):
table = pyarrow.Table.from_pandas(split, preserve_index=False)
# Timestamps have issues if you don't convert to ms. https://github.com/dask/fastparquet/issues/82
writer = writer or pyarrow.parquet.ParquetWriter(sys.stdout.buffer, table.schema, coerce_timestamps='ms', compression='gzip')
writer.write_table(table)
writer.close()
if __name__ == "__main__":
main()

Pandas read_csv large file Performance improvement

Just was wondering if there is a way to improve the performance of reading large csv files into a pandas dataframe. I have 3 large (3.5MM records each) pipe delimited file which I want to load into dataframe and perform some task on it. Currently I am using pandas.read_csv() defining the cols and there datatypes in the parameter like below. I did see some improvement by defining the datatype of the columns but it still takes more than 3 minutes to load.
import pandas as pd
df = pd.read_csv(file_, index_col=None, usecols = sourceFields, sep='|', header=0, dtype={'date':'str', 'gwTimeUtc':'str', 'asset':'|str',
'instrumentId':'|str', 'askPrice':'float64', 'bidPrice':'float64',
'askQuantity':'float64', 'bidQuantity':'float64', 'currency':'|str',
'venue':'|str', 'owner':'|str', 'status':'|str', 'priceNotation':'|str', 'nominalQuantity':'float64'})
Depending on what you wish to do with the data, a good option is dask.dataframe. This library works out-of-memory, and allows you to perform a subset of pandas operations lazily. You can then bring the results in memory as a pandas dataframe. Below is example code you can try:
import dask.dataframe as dd, pandas as pd
# point to all files beginning with "file"
dask_df = dd.read_csv('file*.csv')
# define your calculations as you would in pandas
dask_df['col2'] = dask_df['col1'] * 2
# compute results & return to pandas
df = dask_df.compute()
Crucially, nothing significant is computed until the very last line.
The .feather file is significantly faster than .csv. Pandas has built-in support for feather files.
Read the csv in using pd.read_csv(path) and then export it to a feather file: pd.to_feather(path). Now, read the feather file instead of csv.
In my case, a 950 MB csv file was compressed to a 180 MB feather file. Instead of taking 30 seconds to read, it takes about 1 second. I know I am a bit late to the party, but feather files are seriously underrated.

Pandas MemoryError when reading large CSV followed by `.iloc` slicing columns

I've been trying to process a 1.4GB CSV file with Pandas, but keep having memory problems. I have tried different things in attempt to make Pandas read_csv work to no avail.
It didn't work when I used the iterator=True and chunksize=number parameters. Moreover, the smaller the chunksize, the slower it is to process the same amount of data.
(Simple heavier overhead doesn't explain it because it was way too slower when number of chunks is big. I suspect when processing every chunk, panda needs to go though all the chunks before it to "get to it", instead of jumping right to the start of the chunk. This seems the only way this can be explained.)
Then as a last resort, I split the CSV files into 6 parts, and tried to read them one by one, but still get MemoryError.
(I have monitored the memory usage of python when running the code below, and found that each time python finishes processing a file and moves on to the next, the memory usage goes up. It seemed quite obvious that panda didn't release memory for the previous file when it's already finished processing it.)
The code may not make sense but that's because I removed the part where it writes into an SQL database to simplify it and isolate the problem.
import csv,pandas as pd
import glob
filenameStem = 'Crimes'
counter = 0
for filename in glob.glob(filenameStem + '_part*.csv'): # reading files Crimes_part1.csv through Crimes_part6.csv
chunk = pd.read_csv(filename)
df = chunk.iloc[:,[5,8,15,16]]
df = df.dropna(how='any')
counter += 1
print(counter)
you may try to parse only those columns that you need (as #BrenBarn said in comments):
import os
import glob
import pandas as pd
def get_merged_csv(flist, **kwargs):
return pd.concat([pd.read_csv(f, **kwargs) for f in flist], ignore_index=True)
fmask = 'Crimes_part*.csv'
cols = [5,8,15,16]
df = get_merged_csv(glob.glob(fmask), index_col=None, usecols=cols).dropna(how='any')
print(df.head())
PS this will include only 4 out of at least 17 columns in your resulting data frame
Thanks for the reply.
After some debugging, I have located the problem. The "iloc" subsetting of pandas created a circular reference, which prevented garbage recollection. Detailed discussion can be found here
I have found same issues in csv file. First to make csv as chunks and fix the chunksize.use the chunksize or iterator parameter to return the data in chunks.
Syntax:
csv_onechunk = padas.read_csv(filepath, sep = delimiter, skiprows = 1, chunksize = 10000)
then concatenate the chunks (Only valid with C parser)

Categories