Speeding up read_csv in python pandas - python

I am trying to parse a huge csv file (around 50 million rows) using Pandas 'read_csv' method.
Below is the code snippet I am using:
df_chunk = pd.read_csv(db_export_file, delimiter='~!#', engine='python', header=None, keep_default_na=False, na_values=[''], chunksize=10 ** 6, iterator=True)
Therafter using the pd.concat method I am getting the whole set of dataframe which is used for further processing.
Everything is working fine instead, the read operation from that csv file takes almost 6 mins to create the dataframe.
My question is that, is there any other way to make this process faster using the same module and method?
Below is the sample data presented as a csv file
155487~!#-64721487465~!#A1_NUM~!#1.000
155487~!#-45875722734~!#A32_ENG~!#This is a good facility
458448~!#-14588001153~!#T12_Timing~!#23-02-2015 14:50:30
458448~!#-10741214586~!#Q2_56!#
Thanks in advance

I think your best choice is split the csv
split -l LINES_PER_FILE YOUR.CSV OUTPUT_NAMES
and then read all chunks using multiprocessing. You have an example here:
import os
import pandas as pd
from multiprocessing import Pool
# wrap your csv importer in a function that can be mapped
def read_csv(filename):
'converts a filename to a pandas dataframe'
return pd.read_csv(filename)
def main():
# set up your pool
pool = Pool(processes=8) # or whatever your hardware can support
# get a list of file names
files = os.listdir('.')
file_list = [filename for filename in files if filename.split('.')[1]=='csv']
# have your pool map the file names to dataframes
df_list = pool.map(read_csv, file_list)
# reduce the list of dataframes to a single dataframe
combined_df = pd.concat(df_list, ignore_index=True)
if __name__ == '__main__':
main()

My case and how it was solved
I had a similarly huge dataset and a custom converter was mandatory to be implemented. The pandas.read_csv() was taking ages because of the custom converter.
The solution for me was to use modin. It was simple, just had to change the import on top and everything else was done automatically.
Take a look at the page: https://github.com/modin-project/modin

Related

How to split a large .csv file using dask?

I am trying to use dask in order to split a huge tab-delimited file into smaller chunks on an AWS Batch array of 100,000 cores.
In AWS Batch each core has a unique environment variable AWS_BATCH_JOB_ARRAY_INDEX ranging from 0 to 99,999 (which is copied into the idx variable in the snippet below). Thus, I am trying to use the following code:
import os
import dask.dataframe as dd
idx = int(os.environ["AWS_BATCH_JOB_ARRAY_INDEX"])
df = dd.read_csv(f"s3://main-bucket/workdir/huge_file.tsv", sep='\t')
df = df.repartition(npartitions=100_000)
df = df.partitions[idx]
df = df.persist() # this call isn't needed before calling to df.to_csv (see comment by Sultan)
df = df.compute() # this call isn't needed before calling to df.to_csv (see comment by Sultan)
df.to_csv(f"/tmp/split_{idx}.tsv", sep="\t", index=False)
print(idx, df.shape, df.head(5))
Do I need to call presist and/or compute before calling df.to_csv?
When I have to split a big file into multiple smaller ones, I simply run the following code.
Read and repartition
import dask.dataframe as dd
df = dd.read_csv("file.csv")
df = df.repartition(npartitions=100)
Save to csv
o = df.to_csv("out_csv/part_*.csv", index=False)
Save to parquet
o = df.to_parquet("out_parquet/")
Here you can use write_metadata_file=False if you want to avoid metadata.
Few notes:
I don't think you really need persist and compute as you can directly save to disk. When you have problems like memory error is safer to save to disk rather than compute.
I found using parquet format at least 3x faster than csv when it's time to write.

Method called twice instead of single call in Dask's multiprocessing

I am trying to download a file from google storage bucket and parse them. There are millions of such file, that needs to be downloaded, parsed and do some operations(Natural language processing etc) on them.
I am trying below code using dask's parallel processing and it is working but it is calling extract_skill twice instead of once for each row in panda's dataframe. Please help me understand why extract_skill method is being called twice.
import pandas as pd
import numpy as np
import dask
import dask.dataframe as dd
# downloading file and extract skill sets and store in skill_sets column
chunk_size = 20
df_list = np.array_split(temp_df, temp_df.shape[0]/chunk_size)
temp_df["skill_sets"] = ""
result_df = pd.DataFrame(data={}, columns=temp_df.columns)
for df_ in df_list:
df_["skill_sets"] = dd.from_pandas(df_, npartitions=4, sort=False, name='x').apply(extract_skill, axis=1, meta='object').compute()
result_df = pd.concat([result_df, df_], axis=0)
extract_skill()
def extract_skill(row):
// download file, parse and do some nlp stuff
file_name = row['file_path']
......
......
return skill_sets
Thanks in advance.
The DataFrame.apply method runs your function on a small sample of data in order to determine the datatypes and columns of the output. See the docstring of this function and look for the keyword "meta" for more information.

Dask dataframes: reading multiple files & storing filename in column

I regularly use dask.dataframe to read multiple files, as so:
import dask.dataframe as dd
df = dd.read_csv('*.csv')
However, the origin of each row, i.e. which file the data was read from, seems to be forever lost.
Is there a way to add this as a column, e.g. df.loc[:100, 'partition'] = 'file1.csv' if file1.csv is the first file and contains 100 rows. This would be applied to each "partition" / file that is read into the dataframe, when compute is triggered as part of a workflow.
The idea is that different logic can then be applied depending on the source.
Dask functions read_csv, read_table, and read_fwf now include a parameter include_path_column:
include_path_column:bool or str, optional
Whether or not to include the path to each particular file.
If True a new column is added to the dataframe called path.
If str, sets new column name. Default is False.
Assuming you have or can make a file_list list that has the file path of each csv file, and each individual file fits in RAM (you mentioned 100 rows), then this should work:
import pandas as pd
import dask.dataframe as dd
from dask import delayed
def read_and_label_csv(filename):
# reads each csv file to a pandas.DataFrame
df_csv = pd.read_csv(filename)
df_csv['partition'] = filename.split('\\')[-1]
return df_csv
# create a list of functions ready to return a pandas.DataFrame
dfs = [delayed(read_and_label_csv)(fname) for fname in file_list]
# using delayed, assemble the pandas.DataFrames into a dask.DataFrame
ddf = dd.from_delayed(dfs)
With some customization, of course. If your csv files are bigger-than-RAM, then a concatentation of dask.DataFrames is probably the way to go.

How to use several input files and do parallel processing in python?

I have 30 csv files. I want to give it as input in for loop, in pandas?
Each file has names such as fileaa, fileab,fileac,filead,....
I have multiple input files and And i would like to receive one output.
Usually i use read_csv but due to memory error, 'read_csv' doesn't work.
f = "./file.csv"
df = pd.read_csv(f, sep="/", header=0, dtype=str)
So i would like to try parallel processing in python 2.7
You might want to have a look at dask.
Dask docs show a demo on how to read in many csv files and output a single dask dataframe:
import dask.dataframe as dd
df = dd.read_csv('*.csv')
And then MANY (but not all) of the pandas methods are available, i.e.:
df.head()
It would be useful to read more on dask dataframe to understand difference with pandas dataframe

Large, persistent DataFrame in pandas

I am exploring switching to python and pandas as a long-time SAS user.
However, when running some tests today, I was surprised that python ran out of memory when trying to pandas.read_csv() a 128mb csv file. It had about 200,000 rows and 200 columns of mostly numeric data.
With SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.
Is there something analogous in pandas?
I regularly work with large files and do not have access to a distributed computing network.
Wes is of course right! I'm just chiming in to provide a little more complete example code. I had the same issue with a 129 Mb file, which was solved by:
import pandas as pd
tp = pd.read_csv('large_dataset.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iterable with chunks of 1000 rows.
df = pd.concat(tp, ignore_index=True) # df is DataFrame. If errors, do `list(tp)` instead of `tp`
In principle it shouldn't run out of memory, but there are currently memory problems with read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).
At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.
This is an older thread, but I just wanted to dump my workaround solution here. I initially tried the chunksize parameter (even with quite small values like 10000), but it didn't help much; had still technical issues with the memory size (my CSV was ~ 7.5 Gb).
Right now, I just read chunks of the CSV files in a for-loop approach and add them e.g., to an SQLite database step by step:
import pandas as pd
import sqlite3
from pandas.io import sql
import subprocess
# In and output file paths
in_csv = '../data/my_large.csv'
out_sqlite = '../data/my.sqlite'
table_name = 'my_table' # name for the SQLite database table
chunksize = 100000 # number of lines to process at each iteration
# columns that should be read from the CSV file
columns = ['molecule_id','charge','db','drugsnow','hba','hbd','loc','nrb','smiles']
# Get number of lines in the CSV file
nlines = subprocess.check_output('wc -l %s' % in_csv, shell=True)
nlines = int(nlines.split()[0])
# connect to database
cnx = sqlite3.connect(out_sqlite)
# Iteratively read CSV and dump lines into the SQLite table
for i in range(0, nlines, chunksize):
df = pd.read_csv(in_csv,
header=None, # no header, define column header manually later
nrows=chunksize, # number of rows to read at each iteration
skiprows=i) # skip rows that were already read
# columns to read
df.columns = columns
sql.to_sql(df,
name=table_name,
con=cnx,
index=False, # don't use CSV file index
index_label='molecule_id', # use a unique column from DataFrame as index
if_exists='append')
cnx.close()
Below is my working flow.
import sqlalchemy as sa
import pandas as pd
import psycopg2
count = 0
con = sa.create_engine('postgresql://postgres:pwd#localhost:00001/r')
#con = sa.create_engine('sqlite:///XXXXX.db') SQLite
chunks = pd.read_csv('..file', chunksize=10000, encoding="ISO-8859-1",
sep=',', error_bad_lines=False, index_col=False, dtype='unicode')
Base on your file size, you'd better optimized the chunksize.
for chunk in chunks:
chunk.to_sql(name='Table', if_exists='append', con=con)
count += 1
print(count)
After have all data in Database, You can query out those you need from database.
If you want to load huge csv files, dask might be a good option. It mimics the pandas api, so it feels quite similar to pandas
link to dask on github
You can use Pytable rather than pandas df.
It is designed for large data sets and the file format is in hdf5.
So the processing time is relatively fast.

Categories