Using pandas to efficiently read in a large CSV file without crashing - python

I am trying to read a .csv file called ratings.csv from http://grouplens.org/datasets/movielens/20m/ the file is 533.4MB in my computer.
This is what am writing in jupyter notebook
import pandas as pd
ratings = pd.read_cv('./movielens/ratings.csv', sep=',')
The problem from here is the kernel would break or die and ask me to restart and its keeps repeating the same. There is no any error. Please can you suggest any alternative of solving this, it is as if my computer has no capability of running this.
This works but it keeps rewriting
chunksize = 20000
for ratings in pd.read_csv('./movielens/ratings.csv', chunksize=chunksize):
ratings.append(ratings)
ratings.head()
Only the last chunk is written others are written-off

You should consider using the chunksize parameter in read_csv when reading in your dataframe, because it returns a TextFileReader object you can then pass to pd.concat to concatenate your chunks.
chunksize = 100000
tfr = pd.read_csv('./movielens/ratings.csv', chunksize=chunksize, iterator=True)
df = pd.concat(tfr, ignore_index=True)
If you just want to process each chunk individually, use,
chunksize = 20000
for chunk in pd.read_csv('./movielens/ratings.csv',
chunksize=chunksize,
iterator=True):
do_something_with_chunk(chunk)

try like this - 1) load with dask and then 2) convert to pandas
import pandas as pd
import dask.dataframe as dd
import time
t=time.clock()
df_train = dd.read_csv('../data/train.csv')
df_train=df_train.compute()
print("load train: " , time.clock()-t)

Related

How to improve the speed of reading multiple csv files in python

It's my first time creating a code for processing files with a lot of data, so I am kinda stuck here.
What I'm trying to do is to read a list of path, listing all of the csv files that need to be read, retrieve the HEAD and TAIL from each files and put it inside a list.
I have 621 csv files in total, with each files consisted of 5800 rows, and 251 columns
This is the data sample
[LOGGING],RD81DL96_1,3,4,5,2,,,,
LOG01,,,,,,,,,
DATETIME,INDEX,SHORT[DEC.0],SHORT[DEC.0],SHORT[DEC.0],SHORT[DEC.0],SHORT[DEC.0],SHORT[DEC.0],SHORT[DEC.0],SHORT[DEC.0]
TIME,INDEX,FF-1(1A) ,FF-1(1B) ,FF-1(1C) ,FF-1(2A),FF-2(1A) ,FF-2(1B) ,FF-2(1C),FF-2(2A)
47:29.6,1,172,0,139,1258,0,0,400,0
47:34.6,2,172,0,139,1258,0,0,400,0
47:39.6,3,172,0,139,1258,0,0,400,0
47:44.6,4,172,0,139,1263,0,0,400,0
47:49.6,5,172,0,139,1263,0,0,450,0
47:54.6,6,172,0,139,1263,0,0,450,0
The problem is, while it took about 13 seconds to read all the files (still kinda slow honestly)
But when I add a single line of append code, the process took a lot of times to finish, about 4 minutes.
Below is the snipset of the code:
# CsvList: [File Path, Change Date, File size, File Name]
for x, file in enumerate(CsvList):
timeColumn = ['TIME']
df = dd.read_csv(file[0], sep =',', skiprows = 3, encoding= 'CP932', engine='python', usecols=timeColumn)
# The process became long when this code is added
startEndList.append(list(df.head(1)) + list(df.tail(1)))
Why that happened? I'm using dask.dataframe
Currently, your code isn't really leveraging Dask's parallelizing capabilities because:
df.head and df.tail calls will trigger a "compute" (i.e., convert your Dask DataFrame into a pandas DataFrame -- which is what we try to minimize in lazy evaluations with Dask), and
the for-loop is running sequentially because you're creating Dask DataFrames and converting them to pandas DataFrames, all inside the loop.
So, your current example is similar to just using pandas within the for-loop, but with the added Dask-to-pandas-conversion overhead.
Since you need to work on each of your files, I'd suggest checking out Dask Delayed, which might be more elegant+ueful here. The following (pseudo-code) will parallelize the pandas operation on each of your files:
import dask
import pandas as pd
for file in list_of_files:
df = dask.delayed(pd.read_csv)(file)
result.append(df.head(1) + df.tail(1))
dask.compute(*result)
The output of dask.visualize(*result) when I used 4 csv-files confirms parallelism:
If you really want to use Dask DataFrame here, you may try to:
read all files into a single Dask DataFrame,
make sure each Dask "partition" corresponds to one file,
use Dask Dataframe apply to get the head and tail values and append them to a new list
call compute on the new list
A first approach using only Python as starting point:
import pandas as pd
import io
def read_first_and_last_lines(filename):
with open(filename, 'rb') as fp:
# skip first 4 rows (headers)
[next(fp) for _ in range(4)]
# first line
first_line = fp.readline()
# start at -2x length of first line from the end of file
fp.seek(-2 * len(first_line), 2)
# last line
last_line = fp.readlines()[-1]
return first_line + last_line
data = []
for filename in pathlib.Path('data').glob('*.csv'):
data.append(read_first_and_last_lines(filename))
buf = io.BytesIO()
buf.writelines(data)
buf.seek(0)
df = pd.read_csv(buf, header=None, encoding='CP932')

How to export to multiple excel sheet of a single csv file through pandas in python

I have imported a large txt file in python pandas. Now I want to export the csv file to multiple excel as data is too large to fit in a single excel sheet.
I use the following commands:
import pandas as pd
df = pd.read_csv('basel.txt',delimiter='|')
df.to_excel('basel.txt')
Unfortunately I got the following error:
****ValueError: This sheet is too large! Your sheet size is: 1158008, 18 Max sheet size is: 1048576, 16384****
import pandas as pd
chunksize = 10 ** 6
for chunk in pd.read_csv('basel.txt', chunksize=chunksize):
chunk.to_excel('basel_'+str(chunk)+'.excel')
you may read the pandas file in chunks and save each chunk in excel file
You can split into chunks and write each chunk in one sheet.
np.array_split splits into number of chunks
np.split requires an equal division.
import numpy as np
nsheets = 10 # you may change it
for i, temp in enumerate(np.array_split(df, nsheets)):
temp.to_excel('basel.xls', sheet_name=f'sheet_{i}')
You can write half of the dataset into a different excel sheet:
import pandas as pd
df = pd.read_csv('basel.txt',delimiter='|')
df.iloc[:df.shape[0]//2,:].to_excel('basel.xls', sheet_name='First Sheet')
df.iloc[df.shape[0]//2:,:].to_excel('basel.xls', sheet_name='Second Sheet')

How to create dataframes from chunks

I have huge scv file(630 mln rows),and my computer can t read it in 1 dataframe(out of memory)(After it i wanna to teach model for each dataframe).I did 630 chunks, and wanna create dataframe from each chunk(It s will 630 dataframes). Cant find or undestand no one solution of this situation.Can someone support me pls. Mb i think wrong in general and someone can say smtng new opinion on this situation. Code:
import os
import pandas as pd
lol=0
def load_csv():
path="D:\\mml\\"
csv_path = os.path.join(path,"eartquaqe_train.csv")
return pd.read_csv(csv_path,sep=',',chunksize=1000000)
dannie = load_csv()
for chunk in dannie:
lol=lol+1
print(lol)
630
Use the pandas.read_csv() method and specify either the chunksize parameter or create an iterator over all you csv rows using skiprows like:
import pandas as pd
path = 'D:\...'
a = list(range(0,6300))
for line in range(0,6300-630,630):
df = pd.read_csv(path,skiprows=a[0:line]+a[line+630:])
print(df)
OR
import pandas as pd
path = 'D:\...'
df = pd.read_csv(path,chunksize=6300)
for chunk in df:
print(chunk)
Use -
for chunk in dannie:
chunk.to_csv('{}.csv'.format(lol))
lol+=1
Read here for more info

Reading a very large single HDF file with Dask whith sorted_index=True

I have a large.h5 file containing a pandas dataframe, and it is too large to fit into RAM. I expected chunksize to be helpfull, but whatever its values npartitions is always equal to 1 whenever sorted_index=True (ie when I know in advance that the index is already sorted).
import dask.dataframe as dd
df = dd.read_hdf('large.h5', key='data', chunksize=10000, mode='r', sorted_index=True)
So that when I run:
df.mean().compute()
A single partition is made and my RAM quickly saturated.
Weirdly,
import dask.dataframe as dd
df = dd.read_hdf('large.h5', key='data', chunksize=10000, mode='r', sorted_index=False)
Works perfectly ...

How to import a gzip file larger than RAM limit into a Pandas DataFrame? "Kill 9" Use HDF5?

I have a gzip which is approximately 90 GB. This is well within disk space, but far larger than RAM.
How can I import this into a pandas dataframe? I tried the following in the command line:
# start with Python 3.4.5
import pandas as pd
filename = 'filename.gzip' # size 90 GB
df = read_table(filename, compression='gzip')
However, after several minutes, Python shuts down with Kill 9.
After defining the database object df, I was planning to save it into HDF5.
What is the correct way to do this? How can I use pandas.read_table() to do this?
I'd do it this way:
filename = 'filename.gzip' # size 90 GB
hdf_fn = 'result.h5'
hdf_key = 'my_huge_df'
cols = ['colA','colB','colC','ColZ'] # put here a list of all your columns
cols_to_index = ['colA','colZ'] # put here the list of YOUR columns, that you want to index
chunksize = 10**6 # you may want to adjust it ...
store = pd.HDFStore(hdf_fn)
for chunk in pd.read_table(filename, compression='gzip', header=None, names=cols, chunksize=chunksize):
# don't index data columns in each iteration - we'll do it later
store.append(hdf_key, chunk, data_columns=cols_to_index, index=False)
# index data columns in HDFStore
store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')
store.close()

Categories