Speed up pandas csv read and subsequent downcast - python

Straightforward question - I'm doing the following:
train_set = pd.read_csv('./input/train_1.csv').fillna(0)
for col in train_set.columns[1:]:
train_set[col] = pd.to_numeric(train_set[col],downcast='integer')
first column of the dataframe is a string - the rest are ints. Read_csv gives floats, which I don't need. The downsampling results in almost 50% reduction in RAM used, but slows the process down significantly. Can I do the whole thing in one step? Or does anybody know how to multithread this?
thx

I would suggest you try these two functions and see the performance again:
Convert when you read the file
# or uint8/int16/int64 depends on your data
pd.read_csv('input.txt', sep=' ', dtype=np.int32)
# or you can use converters with lambda function
pd.read_csv('test.csv', sep=' ', converters={'1':lambda x : int(x)})
Convert your dataframe after reading file
df['MyColumnName'] = df['MyColumnName'].astype(int)

Related

how to make a sparse pandas DataFrame from a csv file

I have a rather large (1.3 GB, unzipped) csv file, with 2 dense columns and 1.4 K sparse columns, about 1 M rows.
I need to make a pandas.DataFrame from it.
For small files I can simply do:
df = pd.read_csv('file.csv')
For the large file I have now, I get a memory error, clearly due to the DataFrame size (tested by sys.getsizeof(df)
Based on this document:
https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating
it looks like I can make a DataFrame with mixed dense and sparse columns.
However, I can only see instructions to add individual sparse columns, not a chunk of them all together, from the csv file.
Reading the csv sparse columns one by one and adding them to df using:
for colname_i in names_of_sparse_columns:
data = pd.read_csv('file.csv', usecols = [colname_i])
df[colname_i] = pd.arrays.SparseArray(data.values.transpose()[0])
works, and df stays very small, as desired, but the execution time is absurdly long.
I tried of course:
pd.read_csv(path_to_input_csv, usecols = names_of_sparse_columns, dtype = "Sparse[float]")
but that generates this error:
NotImplementedError: Extension Array: <class 'pandas.core.arrays.sparse.array.SparseArray'> must implement _from_sequence_of_strings in order to be used in parser methods
Any idea how I can do this more efficiently?
I checked several posts, but they all seem to be after something slightly different from this.
EDIT adding a small example, to clarify
import numpy as np
import pandas as pd
import sys
# Create an unpivoted sparse dataset
lengths = list(np.random.randint(low = 1, high = 5, size = 10000))
cols = []
for l in lengths:
cols.extend(list(np.random.choice(100, size = l, replace = False)))
rows = np.repeat(np.arange(10000), lengths)
vals = np.repeat(1, sum(lengths))
df_unpivoted = pd.DataFrame({"row" : rows, "col" : cols, "val" : vals})
# Pivot and save to a csv file
df = df_unpivoted.pivot(index = "row", columns = "col", values = "val")
df.to_csv("sparse.csv", index = False)
This file occupies 1 MB on my PC.
Instead:
sys.getsizeof(df)
# 8080016
This looks like 8 MB to me.
So there is clearly a large increase in size when making a pd.DataFrame from a sparse csv file (in this case I made the file from the data frame, but it's the same as reading in the csv file using pd.read_csv()).
And this is my point: I cannot use pd.read_csv() to load the whole csv file into memory.
Here it's only 8 MB, that's no problem at all; with the actual 1.3 GB csv I referred to, it goes to such a huge size that it crashes our machine's memory.
I guess it's easy to try that, by replacing 10000 with 1000000 and 100 with 1500 in the above simulation.
If I do instead:
names_of_sparse_columns = df.columns.values
df_sparse = pd.DataFrame()
for colname_i in names_of_sparse_columns:
data = pd.read_csv('sparse.csv', usecols = [colname_i])
df_sparse[colname_i] = pd.arrays.SparseArray(data.values.transpose()[0])
The resulting object is much smaller:
sys.getsizeof(df_sparse)
# 416700
In fact even smaller than the file.
And this is my second point: doing this column-by-column addition of sparse columns is very slow.
I was looking for advice on how to make df_sparse from a file like "sparse.csv" faster / more efficiently.
In fact, while I was writing this example, I noticed that:
sys.getsizeof(df_unpivoted)
# 399504
So maybe the solution could be to read the csv file line by line and unpivot it. The rest of the handling I need to do however would still require that I write out a pivoted csv, so back to square one.
EDIT 2 more information
Just as well that I describe the rest of the handling I need to do, too.
When I can use a non-sparse data frame, there is an ID column in the file:
df["ID"] = list(np.random.choice(20, df.shape[0]))
I need to make a summary of how many data exist, per ID, per data column:
df.groupby("ID").count()
The unfortunate bit is that the sparse data frame does not support this.
I found a workaround, but it's very inefficient and slow.
If anyone can advise on that aspect, too, it would be useful.
I would have guessed there would be a way to load the sparse part of the csv into some form of sparse array, and make a summary by ID.
Maybe I'm approaching this completely the wrong way, and that's why I am asking this large competent audience for advice.
I don't have the faintest idea why someone would have made a CSV in that format. I would just read it in as chunks and fix the chunks.
# Read in chunks of data, melt it into an dataframe that makes sense
data = [c.melt(id_vars=dense_columns, var_name="Column_label", value_name="Thing").dropna()
for c in pd.read_csv('file.csv', iterator=True, chunksize=100000)]
# Concat the data together
data = pd.concat(data, axis=0)
Change the chunksize and the name of the value column as needed. You could also read in chunks and turn the chunks into a sparse dataframe if needed, but it seems that you'd be better off with a melted dataframe for what you want to do, IMO.
You can always chunk it again going the other way as well. Change the number of chunks as needed for your data.
with open('out_file.csv', mode='w') as out:
for i, chunk in enumerate(np.array_split(df, 100)):
chunk.iloc[:, 2:] = chunk.iloc[:, 2:].sparse.to_dense()
chunk.to_csv(out, header=i==0)
The same file.csv should not be read on every iteration; this line of code:
data = pd.read_csv('file.csv', ...)
should be moved ahead of the for-loop.
To iterate through names_of_sparse_columns:
df = pd.read_csv('file.csv', header = 0).copy()
data = pd.read_csv('file.csv', header = 0).copy()
for colname_i in names_of_sparse_columns:
dataFromThisSparseColumn = data[colname_i]
df[colname_i] = np.reshape(dataFromThisSparseColumn, -1)

Read only last column from large txt data file with a lot of columns

I'm a novice in Python.
I'm writing up a jupyter notebook for data analysis which is supposed to work on already provided datafiles.
These datafiles (.txt) contain each a large table of floats, with delimitator ' '. They are ugly in the sense that they have relatively few rows (~2k) and a lot of columns (~100k).
The "single-file" detailed analysis works fine (I have more than enough RAM to load one of these files entirely in memory, e.g. via np.loadtxt(), and work on it); but I then wanted to attempt a multi-file cross analysis in which I would be only interested in the last column of each file. I cannot find a fast/efficient/nice way of doing this.
What I can do is to np.loadtxt() these files one at a time, then each time copy the last column of the resulting array and delete the rest; and repeat. This is painfully slow but it's working. I was wondering if I could do better!
I also tried this, inspired by something I saw searching the web:
data=[]
for i in range(N_istar):
for j in range(N_col_pos):
with open(filename(i,j), 'r') as f:
lastcol=[]
line=f.readline()
while line:
sp=line.split()
lastcol.append(sp[-1])
data.append(lastcol)
but this either goes on forever or takes a ridiculous amount of time.
Any suggestions?
You can use pandas read_csv(usecols=). You must know the index or name of the column. Code is clean and short, see below example.
In case you do not know the index of the last column, you can read the first row and count the number of seperators.
Example
test.csv
a b c d
0 1 2 3
2 4 6 8
python code
import pandas as pd
seperator = r"\s*" # default this will be ",". Using a regex does make it slower.
# column names
pd.read_csv('test.csv', sep=seperator, usecols=['d'])
# column index
pd.read_csv('test.csv', sep=seperator, header=None, usecols=[3])
# Unknown number of columns
with open('test.csv') as current_file:
last_column_index = len(current_file.readline().split())
pd.read_csv('test.csv', sep=seperator, header=None, usecols=[last_column_index])
You were on the right track with the np.readtxt. Usually files are read and processed quickly in python either by using np or pd.
The best trick would be to read each file using pandas/numpy and concatenate the last columns (if you have enough memory). If you ran you, drop the columns you don't use and leave only the last one.
Just to give you the right direction:
df1 = pd.DataFrame({'A': np.random.choice([1,2,3,4,5,6], size=5),
'B': np.random.choice([1,2,3,4,5,6,7,8], size=5)})
print(df1)
df2 = pd.DataFrame({'A': np.random.choice([1,2,3,4,5,6], size=5),
'B': np.random.choice([1,2,3,4,5,6,7,8], size=5)})
print(df2)
conc_df = pd.concat([df1['B'], df2['B']])
print(conc_df.head(10))
numpy aslo provide reshape() method for you.
ex: f.csv
1,2,3
4,5,6
7,8,9
code
import numpy as np
data = np.loadtxt('f.csv', delimiter=',')
last_col = data[:, -1:].reshape(1,-1)[0]
result
>>>last_col.tolist()
[3.0, 6.0, 9.0]

KeyError when reading CSV in chunks with pandas [duplicate]

I am trying to read a large csv file (aprox. 6 GB) in pandas and i am getting a memory error:
MemoryError Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')
...
MemoryError:
Any help on this?
The error shows that the machine does not have enough memory to read the entire
CSV into a DataFrame at one time. Assuming you do not need the entire dataset in
memory all at one time, one way to avoid the problem would be to process the CSV in
chunks (by specifying the chunksize parameter):
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
The chunksize parameter specifies the number of rows per chunk.
(The last chunk may contain fewer than chunksize rows, of course.)
pandas >= 1.2
read_csv with chunksize returns a context manager, to be used like so:
chunksize = 10 ** 6
with pd.read_csv(filename, chunksize=chunksize) as reader:
for chunk in reader:
process(chunk)
See GH38225
Chunking shouldn't always be the first port of call for this problem.
Is the file large due to repeated non-numeric data or unwanted columns?
If so, you can sometimes see massive memory savings by reading in columns as categories and selecting required columns via pd.read_csv usecols parameter.
Does your workflow require slicing, manipulating, exporting?
If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.
If all else fails, read line by line via chunks.
Chunk via pandas or via csv library as a last resort.
For large data l recommend you use the library "dask" e.g:
# Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv('s3://.../2018-*-*.csv')
You can read more from the documentation here.
Another great alternative would be to use modin because all the functionality is identical to pandas yet it leverages on distributed dataframe libraries such as dask.
From my projects another superior library is datatables.
# Datatable python library
import datatable as dt
df = dt.fread("s3://.../2018-*-*.csv")
I proceeded like this:
chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\
names=['lat','long','rf','date','slno'],index_col='slno',\
header=None,parse_dates=['date'])
df=pd.DataFrame()
%time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)
You can read in the data as chunks and save each chunk as pickle.
import pandas as pd
import pickle
in_path = "" #Path where the large file is
out_path = "" #Path to save the pickle files to
chunk_size = 400000 #size of chunks relies on your available memory
separator = "~"
reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size,
low_memory=False)
for i, chunk in enumerate(reader):
out_file = out_path + "/data_{}.pkl".format(i+1)
with open(out_file, "wb") as f:
pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)
In the next step you read in the pickles and append each pickle to your desired dataframe.
import glob
pickle_path = "" #Same Path as out_path i.e. where the pickle files are
data_p_files=[]
for name in glob.glob(pickle_path + "/data_*.pkl"):
data_p_files.append(name)
df = pd.DataFrame([])
for i in range(len(data_p_files)):
df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)
I want to make a more comprehensive answer based off of the most of the potential solutions that are already provided. I also want to point out one more potential aid that may help reading process.
Option 1: dtypes
"dtypes" is a pretty powerful parameter that you can use to reduce the memory pressure of read methods. See this and this answer. Pandas, on default, try to infer dtypes of the data.
Referring to data structures, every data stored, a memory allocation takes place. At a basic level refer to the values below (The table below illustrates values for C programming language):
The maximum value of UNSIGNED CHAR = 255
The minimum value of SHORT INT = -32768
The maximum value of SHORT INT = 32767
The minimum value of INT = -2147483648
The maximum value of INT = 2147483647
The minimum value of CHAR = -128
The maximum value of CHAR = 127
The minimum value of LONG = -9223372036854775808
The maximum value of LONG = 9223372036854775807
Refer to this page to see the matching between NumPy and C types.
Let's say you have an array of integers of digits. You can both theoretically and practically assign, say array of 16-bit integer type, but you would then allocate more memory than you actually need to store that array. To prevent this, you can set dtype option on read_csv. You do not want to store the array items as long integer where actually you can fit them with 8-bit integer (np.int8 or np.uint8).
Observe the following dtype map.
Source: https://pbpython.com/pandas_dtypes.html
You can pass dtype parameter as a parameter on pandas methods as dict on read like {column: type}.
import numpy as np
import pandas as pd
df_dtype = {
"column_1": int,
"column_2": str,
"column_3": np.int16,
"column_4": np.uint8,
...
"column_n": np.float32
}
df = pd.read_csv('path/to/file', dtype=df_dtype)
Option 2: Read by Chunks
Reading the data in chunks allows you to access a part of the data in-memory, and you can apply preprocessing on your data and preserve the processed data rather than raw data. It'd be much better if you combine this option with the first one, dtypes.
I want to point out the pandas cookbook sections for that process, where you can find it here. Note those two sections there;
Reading a csv chunk-by-chunk
Reading only certain rows of a csv chunk-by-chunk
Option 3: Dask
Dask is a framework that is defined in Dask's website as:
Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love
It was born to cover the necessary parts where pandas cannot reach. Dask is a powerful framework that allows you much more data access by processing it in a distributed way.
You can use dask to preprocess your data as a whole, Dask takes care of the chunking part, so unlike pandas you can just define your processing steps and let Dask do the work. Dask does not apply the computations before it is explicitly pushed by compute and/or persist (see the answer here for the difference).
Other Aids (Ideas)
ETL flow designed for the data. Keeping only what is needed from the raw data.
First, apply ETL to whole data with frameworks like Dask or PySpark, and export the processed data.
Then see if the processed data can be fit in the memory as a whole.
Consider increasing your RAM.
Consider working with that data on a cloud platform.
Before using chunksize option if you want to be sure about the process function that you want to write inside the chunking for-loop as mentioned by #unutbu you can simply use nrows option.
small_df = pd.read_csv(filename, nrows=100)
Once you are sure that the process block is ready, you can put that in the chunking for loop for the entire dataframe.
The function read_csv and read_table is almost the same. But you must assign the delimiter “,” when you use the function read_table in your program.
def get_from_action_data(fname, chunk_size=100000):
reader = pd.read_csv(fname, header=0, iterator=True)
chunks = []
loop = True
while loop:
try:
chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]
chunks.append(chunk)
except StopIteration:
loop = False
print("Iteration is stopped")
df_ac = pd.concat(chunks, ignore_index=True)
Solution 1:
Using pandas with large data
Solution 2:
TextFileReader = pd.read_csv(path, chunksize=1000) # the number of rows per chunk
dfList = []
for df in TextFileReader:
dfList.append(df)
df = pd.concat(dfList,sort=False)
Here follows an example:
chunkTemp = []
queryTemp = []
query = pd.DataFrame()
for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False):
#REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION
chunk = chunk.rename(columns = {c: c.replace(' ', '') for c in chunk.columns})
#YOU CAN EITHER:
#1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET
chunkTemp.append(chunk)
#2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT
query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)]
#BUFFERING PROCESSED DATA
queryTemp.append(query)
#! NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOP
print("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME")
chunk = pd.concat(chunkTemp)
print("Database: LOADED")
#CONCATENATING PROCESSED DATA
query = pd.concat(queryTemp)
print(query)
You can try sframe, that have the same syntax as pandas but allows you to manipulate files that are bigger than your RAM.
If you use pandas read large file into chunk and then yield row by row, here is what I have done
import pandas as pd
def chunck_generator(filename, header=False,chunk_size = 10 ** 5):
for chunk in pd.read_csv(filename,delimiter=',', iterator=True, chunksize=chunk_size, parse_dates=[1] ):
yield (chunk)
def _generator( filename, header=False,chunk_size = 10 ** 5):
chunk = chunck_generator(filename, header=False,chunk_size = 10 ** 5)
for row in chunk:
yield row
if __name__ == "__main__":
filename = r'file.csv'
generator = generator(filename=filename)
while True:
print(next(generator))
In case someone is still looking for something like this, I found that this new library called modin can help. It uses distributed computing that can help with the read. Here's a nice article comparing its functionality with pandas. It essentially uses the same functions as pandas.
import modin.pandas as pd
pd.read_csv(CSV_FILE_NAME)
If you have csv file with millions of data entry and you want to load full dataset you should use dask_cudf,
import dask_cudf as dc
df = dc.read_csv("large_data.csv")
In addition to the answers above, for those who want to process CSV and then export to csv, parquet or SQL, d6tstack is another good option. You can load multiple files and it deals with data schema changes (added/removed columns). Chunked out of core support is already built in.
def apply(dfg):
# do stuff
return dfg
c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)
# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)
# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd#localhost/db', 'tablename') # fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd#localhost/db', 'tablename') # fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd#localhost/db', 'tablename') # slow but flexible

Questions about read_csv and str dtype

I have a large text file where the columns are of the following form:
1255 32627 some random stuff which might have numbers 1245
1.I would like to use read_csv to give me a data frame with three columns. The first two columns should be dtype uint32 and the third just has everything afterwards in a string. That is the line above should be split into 1255, 32627 and some random stuff which might have numbers 1245. This for example does not do it but at least shows the dtypes:
pd.read_csv("foo.txt", sep=' ', header=None, dtype={0:np.uint32, 1:np.uint32, 2:np.str})
2.My second question is about the str dtype.How much RAM does it use and if I know the max length of a string can I reduce that?
Is there a reason you need to use pd.read_csv()? The code below is straightforward and easily modifies your column values to your requirements.
from numpy import uint32
from csv import reader
from pandas import DataFrame
file = 'path/to/file.csv'
with open(file, 'r') as f:
r = reader(f)
for row in r:
column_1 = uint32(row[0])
column_2 = uint32(row[1])
column_3 = ' '.join([str(col) for col in row[2::]])
data = [column_1, column_2, column_3]
frame = DataFrame(data)
I don't understand the question. Do you expect your strings to be extremely long? A 32-bit Python installation is limited to a string 2-3GB long. A 64-bit installation is much much larger, limited only by the amount of RAM you can stuff into your system.
You can use the Series.str.cat method, documentation for which is available here:
df = pd.read_csv("foo.txt", sep=' ', header=None)
# Create a new column which concatenates all columns
df['new'] = df.apply(lambda row: row.iloc[2:].apply(str).str.cat(sep = ' '),axis=1)
df = df[[0,1,'new']]
Not sure exactly what you mean by your second question but if you want to check the size of a string in memory you can use
import sys
print (sys.getsizeof('some string'))
Sorry, I have no idea how knowing the maximum length will help you in saving memory and whether that is even possible

How to read a large csv and write it again using a Dataframe in Pandas? [duplicate]

I am trying to read a large csv file (aprox. 6 GB) in pandas and i am getting a memory error:
MemoryError Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')
...
MemoryError:
Any help on this?
The error shows that the machine does not have enough memory to read the entire
CSV into a DataFrame at one time. Assuming you do not need the entire dataset in
memory all at one time, one way to avoid the problem would be to process the CSV in
chunks (by specifying the chunksize parameter):
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
The chunksize parameter specifies the number of rows per chunk.
(The last chunk may contain fewer than chunksize rows, of course.)
pandas >= 1.2
read_csv with chunksize returns a context manager, to be used like so:
chunksize = 10 ** 6
with pd.read_csv(filename, chunksize=chunksize) as reader:
for chunk in reader:
process(chunk)
See GH38225
Chunking shouldn't always be the first port of call for this problem.
Is the file large due to repeated non-numeric data or unwanted columns?
If so, you can sometimes see massive memory savings by reading in columns as categories and selecting required columns via pd.read_csv usecols parameter.
Does your workflow require slicing, manipulating, exporting?
If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.
If all else fails, read line by line via chunks.
Chunk via pandas or via csv library as a last resort.
For large data l recommend you use the library "dask" e.g:
# Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv('s3://.../2018-*-*.csv')
You can read more from the documentation here.
Another great alternative would be to use modin because all the functionality is identical to pandas yet it leverages on distributed dataframe libraries such as dask.
From my projects another superior library is datatables.
# Datatable python library
import datatable as dt
df = dt.fread("s3://.../2018-*-*.csv")
I proceeded like this:
chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\
names=['lat','long','rf','date','slno'],index_col='slno',\
header=None,parse_dates=['date'])
df=pd.DataFrame()
%time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)
You can read in the data as chunks and save each chunk as pickle.
import pandas as pd
import pickle
in_path = "" #Path where the large file is
out_path = "" #Path to save the pickle files to
chunk_size = 400000 #size of chunks relies on your available memory
separator = "~"
reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size,
low_memory=False)
for i, chunk in enumerate(reader):
out_file = out_path + "/data_{}.pkl".format(i+1)
with open(out_file, "wb") as f:
pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)
In the next step you read in the pickles and append each pickle to your desired dataframe.
import glob
pickle_path = "" #Same Path as out_path i.e. where the pickle files are
data_p_files=[]
for name in glob.glob(pickle_path + "/data_*.pkl"):
data_p_files.append(name)
df = pd.DataFrame([])
for i in range(len(data_p_files)):
df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)
I want to make a more comprehensive answer based off of the most of the potential solutions that are already provided. I also want to point out one more potential aid that may help reading process.
Option 1: dtypes
"dtypes" is a pretty powerful parameter that you can use to reduce the memory pressure of read methods. See this and this answer. Pandas, on default, try to infer dtypes of the data.
Referring to data structures, every data stored, a memory allocation takes place. At a basic level refer to the values below (The table below illustrates values for C programming language):
The maximum value of UNSIGNED CHAR = 255
The minimum value of SHORT INT = -32768
The maximum value of SHORT INT = 32767
The minimum value of INT = -2147483648
The maximum value of INT = 2147483647
The minimum value of CHAR = -128
The maximum value of CHAR = 127
The minimum value of LONG = -9223372036854775808
The maximum value of LONG = 9223372036854775807
Refer to this page to see the matching between NumPy and C types.
Let's say you have an array of integers of digits. You can both theoretically and practically assign, say array of 16-bit integer type, but you would then allocate more memory than you actually need to store that array. To prevent this, you can set dtype option on read_csv. You do not want to store the array items as long integer where actually you can fit them with 8-bit integer (np.int8 or np.uint8).
Observe the following dtype map.
Source: https://pbpython.com/pandas_dtypes.html
You can pass dtype parameter as a parameter on pandas methods as dict on read like {column: type}.
import numpy as np
import pandas as pd
df_dtype = {
"column_1": int,
"column_2": str,
"column_3": np.int16,
"column_4": np.uint8,
...
"column_n": np.float32
}
df = pd.read_csv('path/to/file', dtype=df_dtype)
Option 2: Read by Chunks
Reading the data in chunks allows you to access a part of the data in-memory, and you can apply preprocessing on your data and preserve the processed data rather than raw data. It'd be much better if you combine this option with the first one, dtypes.
I want to point out the pandas cookbook sections for that process, where you can find it here. Note those two sections there;
Reading a csv chunk-by-chunk
Reading only certain rows of a csv chunk-by-chunk
Option 3: Dask
Dask is a framework that is defined in Dask's website as:
Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love
It was born to cover the necessary parts where pandas cannot reach. Dask is a powerful framework that allows you much more data access by processing it in a distributed way.
You can use dask to preprocess your data as a whole, Dask takes care of the chunking part, so unlike pandas you can just define your processing steps and let Dask do the work. Dask does not apply the computations before it is explicitly pushed by compute and/or persist (see the answer here for the difference).
Other Aids (Ideas)
ETL flow designed for the data. Keeping only what is needed from the raw data.
First, apply ETL to whole data with frameworks like Dask or PySpark, and export the processed data.
Then see if the processed data can be fit in the memory as a whole.
Consider increasing your RAM.
Consider working with that data on a cloud platform.
Before using chunksize option if you want to be sure about the process function that you want to write inside the chunking for-loop as mentioned by #unutbu you can simply use nrows option.
small_df = pd.read_csv(filename, nrows=100)
Once you are sure that the process block is ready, you can put that in the chunking for loop for the entire dataframe.
The function read_csv and read_table is almost the same. But you must assign the delimiter “,” when you use the function read_table in your program.
def get_from_action_data(fname, chunk_size=100000):
reader = pd.read_csv(fname, header=0, iterator=True)
chunks = []
loop = True
while loop:
try:
chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]
chunks.append(chunk)
except StopIteration:
loop = False
print("Iteration is stopped")
df_ac = pd.concat(chunks, ignore_index=True)
Solution 1:
Using pandas with large data
Solution 2:
TextFileReader = pd.read_csv(path, chunksize=1000) # the number of rows per chunk
dfList = []
for df in TextFileReader:
dfList.append(df)
df = pd.concat(dfList,sort=False)
Here follows an example:
chunkTemp = []
queryTemp = []
query = pd.DataFrame()
for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False):
#REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION
chunk = chunk.rename(columns = {c: c.replace(' ', '') for c in chunk.columns})
#YOU CAN EITHER:
#1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET
chunkTemp.append(chunk)
#2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT
query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)]
#BUFFERING PROCESSED DATA
queryTemp.append(query)
#! NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOP
print("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME")
chunk = pd.concat(chunkTemp)
print("Database: LOADED")
#CONCATENATING PROCESSED DATA
query = pd.concat(queryTemp)
print(query)
You can try sframe, that have the same syntax as pandas but allows you to manipulate files that are bigger than your RAM.
If you use pandas read large file into chunk and then yield row by row, here is what I have done
import pandas as pd
def chunck_generator(filename, header=False,chunk_size = 10 ** 5):
for chunk in pd.read_csv(filename,delimiter=',', iterator=True, chunksize=chunk_size, parse_dates=[1] ):
yield (chunk)
def _generator( filename, header=False,chunk_size = 10 ** 5):
chunk = chunck_generator(filename, header=False,chunk_size = 10 ** 5)
for row in chunk:
yield row
if __name__ == "__main__":
filename = r'file.csv'
generator = generator(filename=filename)
while True:
print(next(generator))
In case someone is still looking for something like this, I found that this new library called modin can help. It uses distributed computing that can help with the read. Here's a nice article comparing its functionality with pandas. It essentially uses the same functions as pandas.
import modin.pandas as pd
pd.read_csv(CSV_FILE_NAME)
If you have csv file with millions of data entry and you want to load full dataset you should use dask_cudf,
import dask_cudf as dc
df = dc.read_csv("large_data.csv")
In addition to the answers above, for those who want to process CSV and then export to csv, parquet or SQL, d6tstack is another good option. You can load multiple files and it deals with data schema changes (added/removed columns). Chunked out of core support is already built in.
def apply(dfg):
# do stuff
return dfg
c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)
# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)
# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd#localhost/db', 'tablename') # fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd#localhost/db', 'tablename') # fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd#localhost/db', 'tablename') # slow but flexible

Categories