Read Large data from database table in pandas or dask - python

I want to read all data from a table with 10+ gb of data into a dataframe. When i try to read with read_sql i get memory overload error. I want to do some processing on that data and update table with new data. How i can do this efficiently. My PC have 26gb of ram but data is max 11 gb of size, still i get memory overload error.
In Dask its taking so much time. Below is code.
import dateparser
import dask.dataframe as dd
import numpy as np
df = dd.read_sql_table('fbo_xml_json_raw_data', index_col='id', uri='postgresql://postgres:passwordk#address:5432/database')
def make_year(data):
if data and data.isdigit() and int(data) >= 0:
data = '20' + data
elif data and data.isdigit() and int(data) < 0:
data = '19' + data
return data
def response_date(data):
if data and data.isdigit() and int(data[-2:]) >= 0:
data = data[:-2] + '20' + data[-2:]
elif data and data.isdigit() and int(data[-2:]) < 0:
data = data[:-2] + '19' + data[-2:]
if data and dateparser.parse(data):
return dateparser.parse(data).date().strftime('%Y-%m-%d')
def parse_date(data):
if data and dateparser.parse(data):
return dateparser.parse(data).date().strftime('%Y-%m-%d')
df.ARCHDATE = df.ARCHDATE.apply(parse_date)
df.YEAR = df.YEAR.apply(make_year)
df.DATE = df.DATE + df.YEAR
df.DATE = df.DATE.apply(parse_date)
df.RESPDATE = df.RESPDATE.apply(response_date)

See here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html
See that chunksize arg? You can chunk your data so it fits into memory.
It will return a chunk reading object so you can apply operations iteratively over the chunks.
You can probably also incorporate multiprocessing as well.
This will add a layer of complexity since you're no longer working on the DataFrame itself but an object containing chunks.
Since you're using Dask this "should" apply. I'm not sure how Dask handles chunking. It's been a while since I touched Pandas/Dask compatbility.

The main issue seems to be the exclusive use of pd.Series.apply. But apply is just a row-wise Python-level loop. It will be slow in Pandas and Dask. For performance-critical code, you should favour column-wise operations.
In fact, dask.dataframe supports a useful subset of the Pandas API. Here are a couple of examples:-
Avoid string operations
Convert data to numeric types first; then perform vectorisable operations. For example:
dd['YEAR'] = dd['YEAR'].astype(int)
dd['YEAR'] = dd['YEAR'].mask(dd['YEAR'] >= 0, 20)
dd['YEAR'] = dd['YEAR'].mask(dd['YEAR'] < 0, 19)
Convert to datetime
If you have datetime strings in an appropriate format:
df['ARCHDATE'] = df['ARCHDATE'].astype('M8[us]')
See also dask dataframe how to convert column to to_datetime.

Related

How to convert a single column containing JSON with 250 variables to 250 separate column dataset using arrays?

I have an issue converting a JSON column (which contains around 250 variables) into 250 separate columns. I'm able to use Pandas dataframe, but just for 46k rows it takes 30 minutes and sometimes kernel is crashing due to low memory (for 0.5 million rows in database).
Can somebody help me with code using NumPy arrays (which should decrease conversion time and reduce file size)?
The JSON column has data in below format:
My code :
for x in records:
list_ = list(x)
json_acceptable_string = list_[4].read()
list_features.append(json.loads(json_acceptable_string)
Once I get the list-features I'm preprocessing and using machine learning pipeline. This isn't working for large data.
I think this could help for building your np array
variable_name_list = ['var1','va2',....,'var250']
list_features = np.empty(shape=(len(records),len(variable_name_list)),dtype=str)
for index in range(records):
list_ = list(records[index])
json_acceptable_string = list_[4].read()
tmp_feature_list = []
tmp_feature_dict = json.loads(json_acceptable_string)
for var_name in variable_name_list:
if var_name not in tmp_feature_dict.keys():
tmp_feature_list.append("missing_val")
else :
tmp_feature_list.append(tmp_feature_dict[var_name])
tmp_feature_list = np.asarray(tmp_feature_list,dtype=str).reshape(1,len(variable_name_list))
list_features[index] = tmp_feature_list

Python PANDAS: Converting from pandas/numpy to dask dataframe/array

I am working to try to convert a program to be parallelizable/multithreaded with the excellent dask library. Here is the program I am working on converting:
Python PANDAS: Stack by Enumerated Date to Create Records Vectorized
import pandas as pd
import numpy as np
import dask.dataframe as dd
import dask.array as da
from io import StringIO
test_data = '''id,transaction_dt,units,measures
1,2018-01-01,4,30.5
1,2018-01-03,4,26.3
2,2018-01-01,3,12.7
2,2018-01-03,3,8.8'''
df_test = pd.read_csv(StringIO(test_data), sep=',')
df_test['transaction_dt'] = pd.to_datetime(df_test['transaction_dt'])
df_test = df_test.loc[np.repeat(df_test.index, df_test['units'])]
df_test['transaction_dt'] += pd.to_timedelta(df_test.groupby(level=0).cumcount(), unit='d')
df_test = df_test.reset_index(drop=True)
expected results:
id,transaction_dt,measures
1,2018-01-01,30.5
1,2018-01-02,30.5
1,2018-01-03,30.5
1,2018-01-04,30.5
1,2018-01-03,26.3
1,2018-01-04,26.3
1,2018-01-05,26.3
1,2018-01-06,26.3
2,2018-01-01,12.7
2,2018-01-02,12.7
2,2018-01-03,12.7
2,2018-01-03,8.8
2,2018-01-04,8.8
2,2018-01-05,8.8
It occurred to me that this might be a good candidate to try to parallelize because the separate dask partitions should not need to know anything about each other to accomplish the required operations. Here is a naive representation of how I thought it might work:
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test = dd_test.loc[da.repeat(dd_test.index, dd_test['units'])]
dd_test['transaction_dt'] += dd_test.to_timedelta(dd.groupby(level=0).cumcount(), unit='d')
dd_test = dd_test.reset_index(drop=True)
So far I have been trying to work through the following errors or idiomatic differences:
"NotImplementedError: Only integer valued repeats supported."
I have tried to convert the index into a int column/array to try as well but still run into the issue.
2. dask does not support the mutating operator: "+="
3. No dask .to_timedelta() argument
4. No dask .cumcount() (but I think .cumsum() is interchangable?!)
If there are any dask experts out there who might be able let me know if there are fundamental impediments to preclude me from trying this or any tips on implementation, that would be a great help!
Edit:
I think I have made a bit of progress on this since posting the question:
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test['helper'] = 1
dd_test = dd_test.loc[da.repeat(dd_test.index, dd_test['units'])]
dd_test['transaction_dt'] = dd_test['transaction_dt'] + (dd.test.groupby('id')['helper'].cumsum()).astype('timedelta64[D]')
dd_test = dd_test.reset_index(drop=True)
However, I am still stuck on the dask array repeats error. Any tips still welcome.
Not sure if this is exactly what you are looking for, but I replaced the da.repeat with using np.repeat, along with explicity casting dd_test.index and dd_test['units'] to numpy arrays, and finally adding dd_test['transaction_dt'].astype('M8[us]') to your timedelta calculation.
df_test = pd.read_csv(StringIO(test_data), sep=',')
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test['helper'] = 1
dd_test = dd_test.loc[np.repeat(np.array(dd_test.index),
np.array(dd_test['units']))]
dd_test['transaction_dt'] = dd_test['transaction_dt'].astype('M8[us]') + (dd_test.groupby('id')['helper'].cumsum()).astype('timedelta64[D]')
dd_test = dd_test.reset_index(drop=True)
df_expected = dd_test.compute()

Dask read_csv: skip periodically ocurring lines

I want to use Dask to read in a large file of atom coordinates at multiple time steps. The format is called XYZ file, and it looks like this:
3
timestep 1
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
3
timestep 2
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
The first line contains the atom number, the second line is just a comment.
After that, the atoms are listed with their names and positions.
After all atoms are listed, the same is repeated for the next time step.
I would now like to load such a trajectory via dask.dataframe.read_csv.
However, I could not figure out how to skip the periodically ocurring lines containing the atom number and the comment. Is this actually possible?
Edit:
Reading this format into a Pandas Dataframe is possible via:
atom_nr = 3
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
pd.read_csv(xyz_filename, skiprows=skip, delim_whitespace=True,
header=None)
But it looks like the Dask dataframe does not support to pass a function to skiprows.
Edit 2:
MRocklin's answer works! Just for completeness, I write down the full code I used.
from io import BytesIO
import pandas as pd
import dask.bytes
import dask.dataframe
import dask.delayed
atom_nr = ...
filename = ...
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
def pandaread(data_in_bytes):
pseudo_file = BytesIO(data_in_bytes[0])
return pd.read_csv(pseudo_file, skiprows=skip, delim_whitespace=True,
header=None)
bts = dask.bytes.read_bytes(filename, delimiter=f"{atom_nr}\ntimestep".encode())
dfs = dask.delayed(pandaread)(bts)
sol = dask.dataframe.from_delayed(dfs)
sol.compute()
The only remaining question is: How do I tell dask to only compute the first n frames? At the moment it seems the full trajectory is read.
Short answer
No, neither pandas.read_csv nor dask.dataframe.read_csv offer this kind of functionality (to my knowledge)
Long Answer
If you can write code to convert some of this data into a pandas dataframe, then you can probably do this on your own with moderate effort using
dask.bytes.read_bytes
dask.dataframe.from_delayed
In general this might look something like the following:
values = read_bytes('filenames.*.txt', delimiter='...', blocksize=2**27)
dfs = [dask.delayed(load_pandas_from_bytes)(v) for v in values]
df = dd.from_delayed(dfs)
Each of the dfs correspond to roughly blocksize bytes of your data (and then up until the next delimiter). You can control how fine you want your partitions to be using this blocksize. If you want you can also select only a few of these dfs objects to get a smaller portion of your data
dfs = dfs[:5] # only the first five blocks of `blocksize` data

RAM consumption by pandas DataFrame

I am trying to work with around 100 csv files to do a time series analysis.
To build an efficient algorithm to use I've structured my data read_csv function such that it only reads all the files at once and don't have to repeat the same process again and again. To explain further following is my code:
start_date = '2016-06-01'
end_date = '2017-09-02'
allocation = 170000
#contains 100 symbols
usesymbols = ['']
cost_matrix = []
def data():
dates=pd.date_range(start_date,end_date)
df=pd.DataFrame(index=dates)
for symbol in usesymbols:
df_temp=pd.read_csv('/home/furqan/Desktop/python_data/{}.csv'.format(str(symbol)),usecols=['Date','Close'],
parse_dates=True,index_col='Date',na_values=['nan'])
df_temp = df_temp.rename(columns={'Close': symbol})
df=df.join(df_temp)
df=df.fillna(method='ffill')
df=df.fillna(method='bfill')
return df
def powerset(iterable):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(1, len(s)+1))
power_set = list(powerset(usesymbols))
dataframe = data()
Problem is that if I run the above code with 15 symbols it works perfectly.
But that's not sufficient, I want to use 100 symbols.
If I run the code with 100 items in usesymbols, my RAM is used up completely and the machine freezes.
Is there anything that can be done to avoid this situation?
Edited Part:
1) I've 16 GB RAM.
2) the issue is with the variable power_set, if I don't call powerset function data gets retrieved easily.
DataFrame.memory_usage(index=False)
Return:
sizes : Series
A series with column names as index and memory usage of columns with units of bytes.

HDFStore: table.select and RAM usage

I am trying to select random rows from a HDFStore table of about 1 GB. RAM usage explodes when I ask for about 50 random rows.
I am using pandas 0-11-dev, python 2.7, linux64.
In this first case the RAM usage fits the size of chunk
with pd.get_store("train.h5",'r') as train:
for chunk in train.select('train',chunksize=50):
pass
In this second case, it seems like the whole table is loaded into RAM
r=random.choice(400000,size=40,replace=False)
train.select('train',pd.Term("index",r))
In this last case, RAM usage fits the equivalent chunk size
r=random.choice(400000,size=30,replace=False)
train.select('train',pd.Term("index",r))
I am puzzled, why moving from 30 to 40 random rows induces such a dramatic increase in RAM usage.
Note the table has been indexed when created such that index=range(nrows(table)) using the following code:
def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ):
max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize)
with pd.get_store( storefile,'w') as store:
for i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))):
chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]]
store.append(table_name,chunk, min_itemsize={'values':max_len})
Thanks for insight
EDIT TO ANSWER Zelazny7
Here's the file I used to write Train.csv to train.h5. I wrote this using elements of Zelazny7's code from How to trouble-shoot HDFStore Exception: cannot find the correct atom type
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
def object_max_len(x):
if x.dtype != 'object':
return
else:
return len(max(x.fillna(''), key=lambda x: len(str(x))))
def txtfile2dtypes(infile, sep="\t", header=0, chunksize=50000 ):
max_len = pd.read_table(infile,header=header, sep=sep,nrows=5).apply( object_max_len).max()
dtypes0 = pd.read_table(infile,header=header, sep=sep,nrows=5).dtypes
for chunk in pd.read_table(infile,header=header, sep=sep, chunksize=chunksize):
max_len = max((pd.DataFrame(chunk.apply( object_max_len)).max(),max_len))
for i,k in enumerate(zip( dtypes0[:], chunk.dtypes)):
if (k[0] != k[1]) and (k[1] == 'object'):
dtypes0[i] = k[1]
#as of pandas-0.11 nan requires a float64 dtype
dtypes0.values[dtypes0 == np.int64] = np.dtype('float64')
return max_len, dtypes0
def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ):
max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize)
with pd.get_store( storefile,'w') as store:
for i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))):
chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]]
store.append(table_name,chunk, min_itemsize={'values':max_len})
Applied as
txtfile2hdfstore('Train.csv','train.h5','train',sep=',')
This is a known issue, see the reference here: https://github.com/pydata/pandas/pull/2755
Essentially the query is turned into a numexpr expression for evaluation. There is an issue
where I can't pass a lot of or conditions to numexpr (its dependent on the total length of the
generated expression).
So I just limit the expression that we pass to numexpr. If it exceeds a certain number of or conditions, then the query is done as a filter, rather than an in-kernel selection. Basically this means the table is read and then reindexed.
This is on my enhancements list: https://github.com/pydata/pandas/issues/2391 (17).
As a workaround, just split your queries up into multiple ones and concat the results. Should be much faster, and use a constant amount of memory

Categories