How can I chunk through a CSV using Arrow? - python

What I am trying to do
I am using PyArrow to read some CSVs and convert them to Parquet. Some of the files I read have plenty of columns and have a high memory footprint (enough to crash the machine running the job). I am trying to chunk through the file while reading the CSV in a similar way to how Pandas read_csv with chunksize works.
For example this is how the chunking code would work in pandas:
chunks = pandas.read_csv(data, chunksize=100, iterator=True)
# Iterate through chunks
for chunk in chunks:
do_stuff(chunk)
I want to port a similar functionality to Arrow
What I have tried to do
I noticed that Arrow has ReadOptions which include a block_size parameter, and I thought maybe I could use it like:
# Reading in-memory csv file
arrow_table = arrow_csv.read_csv(
input_file=input_buffer,
read_options=arrow_csv.ReadOptions(
use_threads=True,
block_size=4096
)
)
# Iterate through batches
for batch in arrow_table.to_batches():
do_stuff(batch)
As this (block_size) does not seem to return an iterator, I am under the impression that this will still make Arrow read the entire table in memory and thus recreate my problem.
Lastly, I am aware that I can first read the csv using Pandas and chunk through it then convert to Arrow tables. But I am trying to avoid using Pandas and only use Arrow.
I am happy to provide additional information if needed

The function you are looking for is pyarrow.csv.open_csv which returns a pyarrow.csv.CSVStreamingReader. The size of the batches will be controlled by the block_size option you noticed. For a complete example:
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.csv
in_path = '/home/pace/dev/benchmarks-proj/benchmarks/data/nyctaxi_2010-01.csv.gz'
out_path = '/home/pace/dev/benchmarks-proj/benchmarks/data/temp/iterative.parquet'
convert_options = pyarrow.csv.ConvertOptions()
convert_options.column_types = {
'rate_code': pa.utf8(),
'store_and_fwd_flag': pa.utf8()
}
writer = None
with pyarrow.csv.open_csv(in_path, convert_options=convert_options) as reader:
for next_chunk in reader:
if next_chunk is None:
break
if writer is None:
writer = pq.ParquetWriter(out_path, next_chunk.schema)
next_table = pa.Table.from_batches([next_chunk])
writer.write_table(next_table)
writer.close()
This example also highlights one of the challenges the streaming CSV reader introduces. It needs to return batches with consistent data types. However, when parsing CSV you typically need to infer the data type. In my example data the first few MB of the file have integral values for the rate_code column. Somewhere in the middle of the batch there is a non-integer value (* in this case) for that column. To work around this issue you can specify the types for columns up front as I am doing here.

Related

Read very huge csv file in chunks using generators and pandas in python

I have a very huge CSVs of 40ishGB , how I can read it chunk by chunk and add a column with value "today's date".
Approaches I tried is directly reading and my system crashed.
Then I used chunks in pd.read_csv which is well one solution to it.
I was wondering if someone suggests a way how we can use generators to do this and add the column to every chunk?
I think using pd.read_csv with chunksize is already quite like using a generator.
This will add a new column to the end, and assign a value of 1 to each row for the column.
with open('test.csv', 'r') as fin, open('test_output.csv', 'w') as fout:
line = fin.readline()
fout.write(line.rstrip('\n') + ',new_column\n')
while True:
line = fin.readline()
if line:
fout.write(line.rstrip('\n') + ',1\n')
else:
break
Here is a convtools based example:
from datetime import date
from convtools import conversion as c
from convtools.contrib.tables import Table
Table.from_csv("input.csv", header=True).update(
**{"name of the new column": date.today().strftime("%Y-%m-%d")}
).into_csv("output.csv")
It will process the stream from one file to another, applying the transformation in the middle.

Read CSV with PyArrow

I have large CSV files that I'd ultimately like to convert to parquet. Pandas won't help because of memory constraints and its difficulty handling NULL values (which are common in my data). I checked the PyArrow docs and there are tools for reading parquet files, but I didn't see anything about reading CSVs. Did I miss something, or is this feature somehow incompatible with PyArrow?
We're working on this feature, there is a pull request up now: https://github.com/apache/arrow/pull/2576. You can help by testing it out!
You can read the CSV in chunks with pd.read_csv(chunksize=...), then write a chunk at a time with Pyarrow.
The one caveat is, as you mentioned, Pandas will give inconsistent dtypes if you have a column that is all nulls in one chunk, so you have to make sure the chunk size is larger than the longest run of nulls in your data.
This reads CSV from stdin and writes Parquet to stdout (Python 3).
#!/usr/bin/env python
import sys
import pandas as pd
import pyarrow.parquet
# This has to be big enough you don't get a chunk of all nulls: https://issues.apache.org/jira/browse/ARROW-2659
SPLIT_ROWS = 2 ** 16
def main():
writer = None
for split in pd.read_csv(sys.stdin.buffer, chunksize=SPLIT_ROWS):
table = pyarrow.Table.from_pandas(split, preserve_index=False)
# Timestamps have issues if you don't convert to ms. https://github.com/dask/fastparquet/issues/82
writer = writer or pyarrow.parquet.ParquetWriter(sys.stdout.buffer, table.schema, coerce_timestamps='ms', compression='gzip')
writer.write_table(table)
writer.close()
if __name__ == "__main__":
main()

Reading a part of csv file

I have a really large csv file about 10GB. When ever I try to read in into iPython notebook using
data = pd.read_csv("data.csv")
my laptop gets stuck. Is it possible to just read like 10,000 rows or 500 MB of a csv file.
It is possible. You can create an iterator yielding chunks of your csv of a certain size at a time as a DataFrame by passing iterator=True with your desired chunksize to read_csv.
df_iter = pd.read_csv('data.csv', chunksize=10000, iterator=True)
for iter_num, chunk in enumerate(df_iter, 1):
print(f'Processing iteration {iter_num}')
# do things with chunk
Or more briefly
for chunk in pd.read_csv('data.csv', chunksize=10000):
# do things with chunk
Alternatively if there was just a specific part of the csv you wanted to read, you could use the skiprows and nrows options to start at a particular line and subsequently read n rows, as the naming suggests.
Likely a memory issue. On read_csv you can set chunksize (where you can specify number of rows).
Alternatively, if you don't need all the columns, you can change usecols on read_csv to import only the columns you need.

With a gzip dataframe, how can I read/decompress this line by line?

I have an extremely large dataframe saved as a gzip file. The data also needs a good deal of manipulation before being saved.
One could try to convert this entire gzip dataframe into text format, save this to a variable, parse/clean the data, and then save as a .csv file via pandas.read_csv(). However, this is extremely memory intensive.
I would like to read/decompress this file line by line (as this would be the most memory-efficient solution, I think), parse this (e.g. with regex re or perhaps a pandas solution) and then save each line into a pandas dataframe.
Python has a gzip library for this:
with gzip.open('filename.gzip', 'rb') as input_file:
reader = reader(input_file, delimiter="\t")
data = [row for row in reader]
df = pd.DataFrame(data)
However, this seems to drop all information into the 'reader' variable, and then parses. How can one do this in a more (memory) efficient manner?
Should I be using a different library instead of gzip?
It's not quite clear what do you want to do with your huge GZIP file. IIUC you can't read the whole data into memory, because your GZIP file is huge. So the only option you have is to process your data in chunks.
Assuming that you want to read your data from the GZIP file, process it and write it to compressed HDF5 file:
hdf_key = 'my_hdf_ID'
cols_to_index = ['colA','colZ'] # list of indexed columns, use `cols_to_index=True` if you want to index ALL columns
store = pd.HDFStore('/path/to/filename.h5')
chunksize = 10**5
for chunk in pd.read_csv('filename.gz', sep='\s*', chunksize=chunksize):
# process data in the `chunk` DF
# don't index data columns in each iteration - we'll do it later
store.append(hdf_key, chunk, data_columns=cols_to_index, index=False, complib='blosc', complevel=4)
# index data columns in HDFStore
store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')
store.close()
Perhaps extract your data with gunzip -c, pipe it to your Python script and work with standard input there:
$ gunzip -c source.gz | python ./line_parser.py | gzip -c - > destination.gz
In the Python script line_parser.py:
#!/usr/bin/env python
import sys
for line in sys.stdin:
sys.stdout.write(line)
Replace sys.stdout.write(line) with code to process each line in your custom way.
Have you considered using HDFStore:
HDFStore is a dict-like object which reads and writes pandas using the high performance HDF5 format using the excellent PyTables library. See the cookbook for some advanced strategies
Create Store, save DataFrame and close store.
# Note compression.
store = pd.HDFStore('my_store.h5', mode='w', comp_level=9, complib='blosc')
with store:
store['my_dataframe'] = df
Reopen store, retrieve dataframe and close store.
with pd.HDFStore('my_store.h5', mode='r') as store:
df = store.get('my_dataframe')

Large, persistent DataFrame in pandas

I am exploring switching to python and pandas as a long-time SAS user.
However, when running some tests today, I was surprised that python ran out of memory when trying to pandas.read_csv() a 128mb csv file. It had about 200,000 rows and 200 columns of mostly numeric data.
With SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.
Is there something analogous in pandas?
I regularly work with large files and do not have access to a distributed computing network.
Wes is of course right! I'm just chiming in to provide a little more complete example code. I had the same issue with a 129 Mb file, which was solved by:
import pandas as pd
tp = pd.read_csv('large_dataset.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iterable with chunks of 1000 rows.
df = pd.concat(tp, ignore_index=True) # df is DataFrame. If errors, do `list(tp)` instead of `tp`
In principle it shouldn't run out of memory, but there are currently memory problems with read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).
At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.
This is an older thread, but I just wanted to dump my workaround solution here. I initially tried the chunksize parameter (even with quite small values like 10000), but it didn't help much; had still technical issues with the memory size (my CSV was ~ 7.5 Gb).
Right now, I just read chunks of the CSV files in a for-loop approach and add them e.g., to an SQLite database step by step:
import pandas as pd
import sqlite3
from pandas.io import sql
import subprocess
# In and output file paths
in_csv = '../data/my_large.csv'
out_sqlite = '../data/my.sqlite'
table_name = 'my_table' # name for the SQLite database table
chunksize = 100000 # number of lines to process at each iteration
# columns that should be read from the CSV file
columns = ['molecule_id','charge','db','drugsnow','hba','hbd','loc','nrb','smiles']
# Get number of lines in the CSV file
nlines = subprocess.check_output('wc -l %s' % in_csv, shell=True)
nlines = int(nlines.split()[0])
# connect to database
cnx = sqlite3.connect(out_sqlite)
# Iteratively read CSV and dump lines into the SQLite table
for i in range(0, nlines, chunksize):
df = pd.read_csv(in_csv,
header=None, # no header, define column header manually later
nrows=chunksize, # number of rows to read at each iteration
skiprows=i) # skip rows that were already read
# columns to read
df.columns = columns
sql.to_sql(df,
name=table_name,
con=cnx,
index=False, # don't use CSV file index
index_label='molecule_id', # use a unique column from DataFrame as index
if_exists='append')
cnx.close()
Below is my working flow.
import sqlalchemy as sa
import pandas as pd
import psycopg2
count = 0
con = sa.create_engine('postgresql://postgres:pwd#localhost:00001/r')
#con = sa.create_engine('sqlite:///XXXXX.db') SQLite
chunks = pd.read_csv('..file', chunksize=10000, encoding="ISO-8859-1",
sep=',', error_bad_lines=False, index_col=False, dtype='unicode')
Base on your file size, you'd better optimized the chunksize.
for chunk in chunks:
chunk.to_sql(name='Table', if_exists='append', con=con)
count += 1
print(count)
After have all data in Database, You can query out those you need from database.
If you want to load huge csv files, dask might be a good option. It mimics the pandas api, so it feels quite similar to pandas
link to dask on github
You can use Pytable rather than pandas df.
It is designed for large data sets and the file format is in hdf5.
So the processing time is relatively fast.

Categories