Opening a 20GB file for analysis with pandas

Opening a 20GB file for analysis with pandas - python

i am new to data Science and Dta Analytics i hope my question is not too naive. I am currently trying to open a file with pandas and python for machine learning purposes it would be ideal for me to have them all in a DataFrame. Now The file is 18GB large and my RAM is 32 GB but i keep getting memory errors.
from your experience is it possible?
if not do you know know of a better to way to go around this? (hive table? increase the size of my RAM to 64? create a database and access it from python)
Every input will be welcome!
Thanks in advance.

You should try read and process one predefined chunk of data each time
by using chunksize as explained here
for chunk in pd.read_csv(f, sep = ' ', header = None, chunksize = 512):
# process your chunk here

Can you work with the data in chunks? If so you can use the iterator interface of pandas to go through the file.
df_iterator = pd.read_csv('test.csv', index_col=0, iterator=True, chunksize=5)
for df in df_iterator:
print(df)
# do something meaningful
print('finished iteration on {} rows'.format(df.shape[0]))
print()

Related

Python-pandas "read_csv" is not reading the whole .TXT file

First of all, I have found several questions with the same title/topic here and I have tried the solutions that have been suggested, but none has worked for me
Here is the issue:
I want to extract a sample of workers from a huge .txt file (> 50 GB)
I am using HPC cluster for this purpose.
Every row in the data represents a worker which has many info (column variables). The idea is to a extract subsample of workers based on the first two letters in the ID variable:
df = pd.read_csv('path-to-my-txt-file', encoding= 'ISO-8859-1', sep = '\t', low_memory=False, error_bad_lines=False, dtype=str)
df = df.rename(columns = {'Worker ID' : 'worker_id'})
# extract subsample based on first 2 lettter in worker id
new_df = df[df.worker_id.str.startswith('DK', na=False)]
new_df.to_csv('DK_worker.csv', index = False)
The problem is that the resulting .CSV file has only 10-15 % of the number of rows that should be there (I have another source of information on the approximate number of rows that I should expect).
I think the data has some encoding issues. I have tried something like 'utf-8', 'latin_1' .. nothing has changed.
Do you see anything wrong in this code that may cause this problem? have I missed some argument?
I am not a Python expert :)
Many thanks in advance.

you can't load a 50GB file into your computers RAM, it would not be possible to store that much data. And I doubt the csv module can handle files of that size. What you need to do is open the file in small pieces, then process each piece.
def process_data(piece):
# process the chunk ...
def read_in_chunks(file_object, chunk_size=1024):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
with open('path-to-my-txt-file.csv') as f:
for piece in read_in_chunks(f):
process_data(piece)

Read a large csv as a Pandas DataFrame faster

I have a csv that I am reading into a Pandas DataFrame but it takes about 35 minutes to read. The csv is approximately 120 GB. I found a module called cudf that allows a GPU DataFrame however it is only for Linux. Is there something similar for Windows?
chunk_list = []
combined_array = pd.DataFrame()
for chunk in tqdm(pd.read_csv('\\large_array.csv', header = None,
low_memory = False, error_bad_lines = False, chunksize = 10000)):
print(' --- Complete')
chunk_list.append(chunk)
array = pd.concat(chunk_list)
print(array)

You can also look at dask-dataframe if you really want to read it into a pandas api like dataframe.
For reading csvs , this will parallelize your io task across multiple cores plus nodes. This will probably alleviate memory pressures by scaling across nodes as with 120 GB csv you will probably be memory bound too.
Another good alternative might be using arrow.

Do you have GPU ? if yes, please look at BlazingSQL, the GPU SQL engine in a Python package.
In this article, describe Querying a Terabyte with BlazingSQL. And BlazingSQL support read from CSV.
After you get GPU dataframe convert to Pandas dataframe with
# from cuDF DataFrame to pandas DataFrame
df = gdf.to_pandas()

IPython in jupyter notebooks: reading a large datafile with pandas becomes very slow (high memory consumption?)

I have a huge data file I want to process in jupyter notebook.
I use pandas in a for loop for that ans specify which lines Im reading from the file:
import pandas as pd
import gc
from tqdm import tqdm
# Create a training file with simple derived features
rowstoread = 150_000
chunks = 50
for chunks in tqdm(range(chunks)):
rowstoskip = range(1, chunks*rowstoread-1) if segment > 0 else 0
chunk = pd.read_csv("datafile.csv", dtype={'attribute_1': np.int16, 'attribute_2': np.float64}, skiprows=rowstoskip, nrows=rowstoread)
x = chunk['attribute_1'].values
y = chunk['attribute_2'].values[-1]
#process data here and try to get rid of memory afterwards
del chunk, x, y
gc.collect()
Although I try to free memory of the data I read afterwards the import starts fast and becomes very slow depending on the number of the current chunk.
Is there something I'm missing? Does someone know the reason for it and how to fix?
Thanks in advance,
smaica
Edit:
Thanks to #Wen-Ben I can circumvent this issue with the chunk method from pandas read_csv. Nevertheless Im wonderung why this happens

From my experience gc.collect() doesn't do much.
If you have a large file that can fit onto disk, then you can use other libraries such as Sframes.
Here's an example to read a csv file:
sf = SFrame(data='~/mydata/foo.csv')
The API is very similar to Pandas.

Pandas read_stata() with large .dta files

I am working with a Stata .dta file that is around 3.3 gigabytes, so it is large but not excessively large. I am interested in using IPython and tried to import the .dta file using Pandas but something wonky is going on. My box has 32 gigabytes of RAM and attempting to load the .dta file results in all the RAM being used (after ~30 minutes) and my computer to stall out. This doesn't 'feel' right in that I am able to open the file in R using read.dta() from the foreign package no problem, and working with the file in Stata is fine. The code I am using is:
%time myfile = pd.read_stata(data_dir + 'my_dta_file.dta')
and I am using IPython in Enthought's Canopy program. The reason for the '%time' is because I am interested in benchmarking this against R's read.dta().
My questions are:
Is there something I am doing wrong that is resulting in Pandas having issues?
Is there a workaround to get the data into a Pandas dataframe?

Here is a little function that has been handy for me, using some pandas features that might not have been available when the question was originally posed:
def load_large_dta(fname):
import sys
reader = pd.read_stata(fname, iterator=True)
df = pd.DataFrame()
try:
chunk = reader.get_chunk(100*1000)
while len(chunk) > 0:
df = df.append(chunk, ignore_index=True)
chunk = reader.get_chunk(100*1000)
print '.',
sys.stdout.flush()
except (StopIteration, KeyboardInterrupt):
pass
print '\nloaded {} rows'.format(len(df))
return df
I loaded an 11G Stata file in 100 minutes with this, and it's nice to have something to play with if I get tired of waiting and hit cntl-c.
This notebook shows it in action.

For all the people who end on this page, please upgrade Pandas to the latest version. I had this exact problem with a stalled computer during load (300 MB Stata file but only 8 GB system ram), and upgrading from v0.14 to v0.16.2 solved the issue in a snap.
Currently, it's v 0.16.2. There have been significant improvements to speed though I don't know the specifics. See: most efficient I/O setup between Stata and Python (Pandas)

There is a simpler way to solve it using Pandas' built-in function read_stata.
Assume your large file is named as large.dta.
import pandas as pd
reader=pd.read_stata("large.dta",chunksize=100000)
df = pd.DataFrame()
for itm in reader:
df=df.append(itm)
df.to_csv("large.csv")

Question 1.
There's not much I can say about this.
Question 2.
Consider exporting your .dta file to .csv using Stata command outsheet or export delimited and then using read_csv() in pandas. In fact, you could take the newly created .csv file, use it as input for R and compare with pandas (if that's of interest). read_csv is likely to have had more testing than read_stata.
Run help outsheet for details of the exporting.

You should not be reading a 3GB+ file into an in-memory data object, that's a recipe for disaster (and has nothing to do with pandas).
The right way to do this is to mem-map the file and access the data as needed.
You should consider converting your file to a more appropriate format (csv or hdf) and then you can use the Dask wrapper around pandas DataFrame for chunk-loading the data as needed:
from dask import dataframe as dd
# If you don't want to use all the columns, make a selection
columns = ['column1', 'column2']
data = dd.read_csv('your_file.csv', use_columns=columns)
This will transparently take care of chunk-loading, multicore data handling and all that stuff.

Large, persistent DataFrame in pandas

I am exploring switching to python and pandas as a long-time SAS user.
However, when running some tests today, I was surprised that python ran out of memory when trying to pandas.read_csv() a 128mb csv file. It had about 200,000 rows and 200 columns of mostly numeric data.
With SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.
Is there something analogous in pandas?
I regularly work with large files and do not have access to a distributed computing network.

Wes is of course right! I'm just chiming in to provide a little more complete example code. I had the same issue with a 129 Mb file, which was solved by:
import pandas as pd
tp = pd.read_csv('large_dataset.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iterable with chunks of 1000 rows.
df = pd.concat(tp, ignore_index=True) # df is DataFrame. If errors, do `list(tp)` instead of `tp`

In principle it shouldn't run out of memory, but there are currently memory problems with read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).
At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.

This is an older thread, but I just wanted to dump my workaround solution here. I initially tried the chunksize parameter (even with quite small values like 10000), but it didn't help much; had still technical issues with the memory size (my CSV was ~ 7.5 Gb).
Right now, I just read chunks of the CSV files in a for-loop approach and add them e.g., to an SQLite database step by step:
import pandas as pd
import sqlite3
from pandas.io import sql
import subprocess
# In and output file paths
in_csv = '../data/my_large.csv'
out_sqlite = '../data/my.sqlite'
table_name = 'my_table' # name for the SQLite database table
chunksize = 100000 # number of lines to process at each iteration
# columns that should be read from the CSV file
columns = ['molecule_id','charge','db','drugsnow','hba','hbd','loc','nrb','smiles']
# Get number of lines in the CSV file
nlines = subprocess.check_output('wc -l %s' % in_csv, shell=True)
nlines = int(nlines.split()[0])
# connect to database
cnx = sqlite3.connect(out_sqlite)
# Iteratively read CSV and dump lines into the SQLite table
for i in range(0, nlines, chunksize):
df = pd.read_csv(in_csv,
header=None, # no header, define column header manually later
nrows=chunksize, # number of rows to read at each iteration
skiprows=i) # skip rows that were already read
# columns to read
df.columns = columns
sql.to_sql(df,
name=table_name,
con=cnx,
index=False, # don't use CSV file index
index_label='molecule_id', # use a unique column from DataFrame as index
if_exists='append')
cnx.close()

Below is my working flow.
import sqlalchemy as sa
import pandas as pd
import psycopg2
count = 0
con = sa.create_engine('postgresql://postgres:pwd#localhost:00001/r')
#con = sa.create_engine('sqlite:///XXXXX.db') SQLite
chunks = pd.read_csv('..file', chunksize=10000, encoding="ISO-8859-1",
sep=',', error_bad_lines=False, index_col=False, dtype='unicode')
Base on your file size, you'd better optimized the chunksize.
for chunk in chunks:
chunk.to_sql(name='Table', if_exists='append', con=con)
count += 1
print(count)
After have all data in Database, You can query out those you need from database.

If you want to load huge csv files, dask might be a good option. It mimics the pandas api, so it feels quite similar to pandas
link to dask on github

You can use Pytable rather than pandas df.
It is designed for large data sets and the file format is in hdf5.
So the processing time is relatively fast.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Opening a 20GB file for analysis with pandas - python

You should try read and process one predefined chunk of data each time by using chunksize as explained here for chunk in pd.read_csv(f, sep = ' ', header = None, chunksize = 512): # process your chunk here

Related

Python-pandas "read_csv" is not reading the whole .TXT file

Read a large csv as a Pandas DataFrame faster

IPython in jupyter notebooks: reading a large datafile with pandas becomes very slow (high memory consumption?)

Pandas read_stata() with large .dta files

Large, persistent DataFrame in pandas

Categories

Resources