I am trying to select random rows from a HDFStore table of about 1 GB. RAM usage explodes when I ask for about 50 random rows.
I am using pandas 0-11-dev, python 2.7, linux64.
In this first case the RAM usage fits the size of chunk
with pd.get_store("train.h5",'r') as train:
for chunk in train.select('train',chunksize=50):
pass
In this second case, it seems like the whole table is loaded into RAM
r=random.choice(400000,size=40,replace=False)
train.select('train',pd.Term("index",r))
In this last case, RAM usage fits the equivalent chunk size
r=random.choice(400000,size=30,replace=False)
train.select('train',pd.Term("index",r))
I am puzzled, why moving from 30 to 40 random rows induces such a dramatic increase in RAM usage.
Note the table has been indexed when created such that index=range(nrows(table)) using the following code:
def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ):
max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize)
with pd.get_store( storefile,'w') as store:
for i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))):
chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]]
store.append(table_name,chunk, min_itemsize={'values':max_len})
Thanks for insight
EDIT TO ANSWER Zelazny7
Here's the file I used to write Train.csv to train.h5. I wrote this using elements of Zelazny7's code from How to trouble-shoot HDFStore Exception: cannot find the correct atom type
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
def object_max_len(x):
if x.dtype != 'object':
return
else:
return len(max(x.fillna(''), key=lambda x: len(str(x))))
def txtfile2dtypes(infile, sep="\t", header=0, chunksize=50000 ):
max_len = pd.read_table(infile,header=header, sep=sep,nrows=5).apply( object_max_len).max()
dtypes0 = pd.read_table(infile,header=header, sep=sep,nrows=5).dtypes
for chunk in pd.read_table(infile,header=header, sep=sep, chunksize=chunksize):
max_len = max((pd.DataFrame(chunk.apply( object_max_len)).max(),max_len))
for i,k in enumerate(zip( dtypes0[:], chunk.dtypes)):
if (k[0] != k[1]) and (k[1] == 'object'):
dtypes0[i] = k[1]
#as of pandas-0.11 nan requires a float64 dtype
dtypes0.values[dtypes0 == np.int64] = np.dtype('float64')
return max_len, dtypes0
def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ):
max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize)
with pd.get_store( storefile,'w') as store:
for i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))):
chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]]
store.append(table_name,chunk, min_itemsize={'values':max_len})
Applied as
txtfile2hdfstore('Train.csv','train.h5','train',sep=',')
This is a known issue, see the reference here: https://github.com/pydata/pandas/pull/2755
Essentially the query is turned into a numexpr expression for evaluation. There is an issue
where I can't pass a lot of or conditions to numexpr (its dependent on the total length of the
generated expression).
So I just limit the expression that we pass to numexpr. If it exceeds a certain number of or conditions, then the query is done as a filter, rather than an in-kernel selection. Basically this means the table is read and then reindexed.
This is on my enhancements list: https://github.com/pydata/pandas/issues/2391 (17).
As a workaround, just split your queries up into multiple ones and concat the results. Should be much faster, and use a constant amount of memory
Related
I have stumbled across similar questions but all of them used either .csv or .txt files and a solution was to read in the data line by line or in chunks. I am not aware of that being possible with parquet files as they were designed to be read by columns and not rows.
My first attempt below works well with a smaller subset/test dataset created from the original full dataset.
def process_single_group(group_df):
# Super simple version of my function
group_df['E'] = group_df['C'] + group_df['D']
group_df['F'] = group_df['C'] - group_df['D']
group_df['G'] = group_df['C'] * group_df['D']
return group_df
def group_of_groups_loop(group_of_group_df):
df_agg = pd.DataFrame()
for i, group_df in enumerate(group_of_group_df):
t = process_single_group(group_df)
df_agg = pd.concat([df_agg, t])
return df_agg
num_processes = os.cpu_count()
pool = Pool(processes=num_processes)
df = pd.read_parquet('path/to/dataset.parquet')
# With small dataset, there are about 300 groups
# With full dataset, there are about ~800k groups
groupby = df.groupby(by=['A', 'B'])
# Split the groups into chunks of groups
groupby_split = np.array_split(groupby, num_processes)
# Create a list where each element is a chunk of groups
# The starmap expects this sort of input
input_list = [[gb] for gb in groupby_split]
x = pd.concat(pool.starmap(group_of_groups_loop, input_list))
pool.join()
pool.close()
x.to_parquet('path/to/save/file.parquet')
but when I switch to the full parquet file, I get the error:
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
which I expected.
My solution to this was to break the very large number of groups into smaller chunks of groups (size similar to the subset) and loop over each one like with the subset earlier.
EDIT I forgot to add the second np.array_split within the loop.
def process_single_group(group_df):
# Super simple version of my function
group_df['E'] = group_df['C'] + group_df['D']
group_df['F'] = group_df['C'] - group_df['D']
group_df['G'] = group_df['C'] * group_df['D']
return group_df
def group_of_groups_loop(group_of_group_df):
df_agg = pd.DataFrame()
for i, group_df in enumerate(group_of_group_df):
t = process_single_group(group_df)
df_agg = pd.concat([df_agg, t])
return df_agg
num_processes = os.cpu_count()
pool = Pool(processes=num_processes)
df = pd.read_parquet('path/to/dataset.parquet')
# With small dataset, there are about 300 groups
# With full dataset, there are about ~800k groups
groupby = df.groupby(by=['A', 'B'])
# Split the large number of groups into smaller chunks
N = 10
groupby_split = np.array_split(groupby, N)
final_agg_df = pd.DataFrame()
# iterate over each of the smaller chunks
for groupby_group in groupby_split:
groupby_split_split = np.array_split(groupby_group, num_processes)
# Create a list to use as argument for starmap
input_list = [[gb] for gb in groupby_split_split]
x = pd.concat(pool.starmap(group_of_groups_loop, input_list))
pool.join()
pool.close()
final_agg_df = pd.concat([final_agg_df, x])
final_agg_df.to_parquet('path/to/save/file.parquet')
But this is still giving me the same error...
I thought that since the pool was created prior to reading in the large parquet file (I read a solution earlier that mentioned doing this) that each process would only be given the small chunk of groups.
I am wondering if there is something I missed? And also if there is a better way of doing this in general (queue? dask? function logic?)
Thanks in advance!
I want to read all data from a table with 10+ gb of data into a dataframe. When i try to read with read_sql i get memory overload error. I want to do some processing on that data and update table with new data. How i can do this efficiently. My PC have 26gb of ram but data is max 11 gb of size, still i get memory overload error.
In Dask its taking so much time. Below is code.
import dateparser
import dask.dataframe as dd
import numpy as np
df = dd.read_sql_table('fbo_xml_json_raw_data', index_col='id', uri='postgresql://postgres:passwordk#address:5432/database')
def make_year(data):
if data and data.isdigit() and int(data) >= 0:
data = '20' + data
elif data and data.isdigit() and int(data) < 0:
data = '19' + data
return data
def response_date(data):
if data and data.isdigit() and int(data[-2:]) >= 0:
data = data[:-2] + '20' + data[-2:]
elif data and data.isdigit() and int(data[-2:]) < 0:
data = data[:-2] + '19' + data[-2:]
if data and dateparser.parse(data):
return dateparser.parse(data).date().strftime('%Y-%m-%d')
def parse_date(data):
if data and dateparser.parse(data):
return dateparser.parse(data).date().strftime('%Y-%m-%d')
df.ARCHDATE = df.ARCHDATE.apply(parse_date)
df.YEAR = df.YEAR.apply(make_year)
df.DATE = df.DATE + df.YEAR
df.DATE = df.DATE.apply(parse_date)
df.RESPDATE = df.RESPDATE.apply(response_date)
See here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html
See that chunksize arg? You can chunk your data so it fits into memory.
It will return a chunk reading object so you can apply operations iteratively over the chunks.
You can probably also incorporate multiprocessing as well.
This will add a layer of complexity since you're no longer working on the DataFrame itself but an object containing chunks.
Since you're using Dask this "should" apply. I'm not sure how Dask handles chunking. It's been a while since I touched Pandas/Dask compatbility.
The main issue seems to be the exclusive use of pd.Series.apply. But apply is just a row-wise Python-level loop. It will be slow in Pandas and Dask. For performance-critical code, you should favour column-wise operations.
In fact, dask.dataframe supports a useful subset of the Pandas API. Here are a couple of examples:-
Avoid string operations
Convert data to numeric types first; then perform vectorisable operations. For example:
dd['YEAR'] = dd['YEAR'].astype(int)
dd['YEAR'] = dd['YEAR'].mask(dd['YEAR'] >= 0, 20)
dd['YEAR'] = dd['YEAR'].mask(dd['YEAR'] < 0, 19)
Convert to datetime
If you have datetime strings in an appropriate format:
df['ARCHDATE'] = df['ARCHDATE'].astype('M8[us]')
See also dask dataframe how to convert column to to_datetime.
I want to use Dask to read in a large file of atom coordinates at multiple time steps. The format is called XYZ file, and it looks like this:
3
timestep 1
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
3
timestep 2
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
The first line contains the atom number, the second line is just a comment.
After that, the atoms are listed with their names and positions.
After all atoms are listed, the same is repeated for the next time step.
I would now like to load such a trajectory via dask.dataframe.read_csv.
However, I could not figure out how to skip the periodically ocurring lines containing the atom number and the comment. Is this actually possible?
Edit:
Reading this format into a Pandas Dataframe is possible via:
atom_nr = 3
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
pd.read_csv(xyz_filename, skiprows=skip, delim_whitespace=True,
header=None)
But it looks like the Dask dataframe does not support to pass a function to skiprows.
Edit 2:
MRocklin's answer works! Just for completeness, I write down the full code I used.
from io import BytesIO
import pandas as pd
import dask.bytes
import dask.dataframe
import dask.delayed
atom_nr = ...
filename = ...
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
def pandaread(data_in_bytes):
pseudo_file = BytesIO(data_in_bytes[0])
return pd.read_csv(pseudo_file, skiprows=skip, delim_whitespace=True,
header=None)
bts = dask.bytes.read_bytes(filename, delimiter=f"{atom_nr}\ntimestep".encode())
dfs = dask.delayed(pandaread)(bts)
sol = dask.dataframe.from_delayed(dfs)
sol.compute()
The only remaining question is: How do I tell dask to only compute the first n frames? At the moment it seems the full trajectory is read.
Short answer
No, neither pandas.read_csv nor dask.dataframe.read_csv offer this kind of functionality (to my knowledge)
Long Answer
If you can write code to convert some of this data into a pandas dataframe, then you can probably do this on your own with moderate effort using
dask.bytes.read_bytes
dask.dataframe.from_delayed
In general this might look something like the following:
values = read_bytes('filenames.*.txt', delimiter='...', blocksize=2**27)
dfs = [dask.delayed(load_pandas_from_bytes)(v) for v in values]
df = dd.from_delayed(dfs)
Each of the dfs correspond to roughly blocksize bytes of your data (and then up until the next delimiter). You can control how fine you want your partitions to be using this blocksize. If you want you can also select only a few of these dfs objects to get a smaller portion of your data
dfs = dfs[:5] # only the first five blocks of `blocksize` data
I need to efficiently insert about 500k (give or take 100k) rows of data into my PostgreSQL database. After a generous amount of google-ing, I've gotten to this solution, averaging about 150 (wall-clock) seconds.
def db_insert_spectrum(curs, visual_data, recording_id):
sql = """
INSERT INTO spectrums (row, col, value, recording_id)
VALUES %s
"""
# Mass-insertion technique
# visual_data is a 2D array (a nx63 matrix)
values_list = []
for rowIndex, rowData in enumerate(visual_data):
for colIndex, colData in enumerate(rowData): # colData is the value
value = [(rowIndex, colIndex, colData, recording_id)]
values_list.append(value)
psycopg2.extras.execute_batch(curs, sql, values_list, page_size=1000)
Is there a faster way?
Based on the answers given here, COPY is the fastest method. COPY reads from a file or file-like object.
Since memory I/O is many orders of magnitude faster than disk I/O, it is faster to write the data to a StringIO file-like object than to write to an actual file.
The psycopg docs show an example of calling copy_from with a StringIO as input.
Therefore, you could use something like:
try:
# Python2
from cStringIO import StringIO
except ImportError:
# Python3
from io import StringIO
def db_insert_spectrum(curs, visual_data, recording_id):
f = StringIO()
# visual_data is a 2D array (a nx63 matrix)
values_list = []
for rowIndex, rowData in enumerate(visual_data):
items = []
for colIndex, colData in enumerate(rowData):
value = (rowIndex, colIndex, colData, recording_id)
items.append('\t'.join(map(str, value))+'\n')
f.writelines(items)
f.seek(0)
cur.copy_from(f, 'spectrums', columns=('row', 'col', 'value', 'recording_id'))
I don't know whether .execute_batch can accept generator, but can u try something like:
def db_insert_spectrum(curs, visual_data, recording_id):
sql = """
INSERT INTO spectrums (row, col, value, recording_id)
VALUES %s
"""
data_gen = ((rIdx, cIdx, value, recording_id) for rIdx, cData in enumerate(visual_data)
for cIdx, value in enumerate(cData))
psycopg2.extras.execute_batch(curs, sql, data_gen, page_size=1000)
It might be faster.
I am trying to work with around 100 csv files to do a time series analysis.
To build an efficient algorithm to use I've structured my data read_csv function such that it only reads all the files at once and don't have to repeat the same process again and again. To explain further following is my code:
start_date = '2016-06-01'
end_date = '2017-09-02'
allocation = 170000
#contains 100 symbols
usesymbols = ['']
cost_matrix = []
def data():
dates=pd.date_range(start_date,end_date)
df=pd.DataFrame(index=dates)
for symbol in usesymbols:
df_temp=pd.read_csv('/home/furqan/Desktop/python_data/{}.csv'.format(str(symbol)),usecols=['Date','Close'],
parse_dates=True,index_col='Date',na_values=['nan'])
df_temp = df_temp.rename(columns={'Close': symbol})
df=df.join(df_temp)
df=df.fillna(method='ffill')
df=df.fillna(method='bfill')
return df
def powerset(iterable):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(1, len(s)+1))
power_set = list(powerset(usesymbols))
dataframe = data()
Problem is that if I run the above code with 15 symbols it works perfectly.
But that's not sufficient, I want to use 100 symbols.
If I run the code with 100 items in usesymbols, my RAM is used up completely and the machine freezes.
Is there anything that can be done to avoid this situation?
Edited Part:
1) I've 16 GB RAM.
2) the issue is with the variable power_set, if I don't call powerset function data gets retrieved easily.
DataFrame.memory_usage(index=False)
Return:
sizes : Series
A series with column names as index and memory usage of columns with units of bytes.