dask DataFrame.assign blows up dask graph - python

So I have an issue with dask DataFrame.append. I generate a lot of derivative features from main data and append them to the main dataframe. After that the dask graph for any set of columns is blown up. Here is small example:
%pylab inline
import numpy as np
import pandas as pd
import dask.dataframe as dd
from dask.dot import dot_graph
df=pd.DataFrame({'x%s'%i:np.random.rand(20) for i in range(5)})
ddf = dd.from_pandas(df, npartitions=2)
dot_graph(ddf['x0'].dask)
here is the dask graph as expected
g=ddf.assign(y=ddf['x0']+ddf['x1'])
dot_graph(g['x0'].dask)
here the graph for same column is exploded with irrelevant computation
Imagine i have lots of lots of spawned columns. So computation graph for any particular column includes irrelevant computations for all the other columns. I.e. in my case I have len(ddf['someColumn'].dask)>100000. So that becomes unusable quickly.
So my question is can this issue be resolved? Are there any existing means to do this? If not - what direction should i look to implement this?
Thanks!

Rather than continuously assigning new columns to the dask dataframe, you might want to build several dask series and then concat them all together at the end
So instead of doing this:
df['x'] = df.w + 1
df['y'] = df.x * 10
df['z'] = df.y ** 2
Do this
x = df.w + 1
y = x + 10
z = y * 2
df = df.assign(x=x, y=y, z=z)
Or this:
dd.concat([df, x, y, z], axis=1)
This may still result in the same number of tasks in your graph however, but will probably result in fewer memory copies.
Alternatively, if all of your transformations are row-wise then you can construct a pandas function and map that across all partitions
def f(part):
part = part.copy()
part['x'] = part.w + 1
part['y'] = part.x * 10
part['z'] = part.y ** 2
return part
df = df.map_partitions(f)
Also, while a million-node task graph is less than ideal, it should also be OK. I've seen larger graphs run comfortably.

Related

Question about Yfinance, IndexError, and Numpy arrays

I am trying to use linear regression using data pulled from yfinance to predict future stock prices, but I am having trouble using linear regression after transposing my data's shape.
Here I create a normalization function
def normalize_data(df):
# df on input should contain only one column with the price data (plus dataframe index)
min = df.min()
max = df.max()
x = df
# time series normalization part
# y will be a column in a dataframe
y = (x - min) / (max - min)
return y
And another function to pull stock prices from Yfinance that calls the normalization function
def closing_price(ticker):
#Asset = pd.DataFrame(yf.download(ticker, start=Start,end=End)['Adj Close'])
Asset = pd.DataFrame(yf.download(ticker, start='2022-07-13',end='2022-09-16')['Adj Close'])
Asset = normalize_data(Asset)
return Asset.to_numpy()
I then pull 11 different stocks using the function
MRO= closing_price('MRO')
HES= closing_price('HES')
FANG= closing_price('FANG')
DVN= closing_price('DVN')
PXD= closing_price('PXD')
COP= closing_price('COP')
CVX= closing_price('CVX')
APA= closing_price('APA')
EOG= closing_price('EOG')
HAL= closing_price('HAL')
BLK = closing_price('BLK')
Which works so far
But when I try to merge the first 10 numpy arrays together,
X = np.array([MRO, HES, FANG, DVN, PXD, COP, CVS, APA, EOG, HAL])[:, :, 0]
X = np.transpose(X)
it gives me the error for the first line when I merge the numpy arrays
<ipython-input-53-a30faf3e4390>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
Have you tried passing the following as is suggested by your error message?
X = np.array([MRO, HES, FANG, DVN, PXD, COP, CVS, APA, EOG, HAL], dtype=float)[:, :, 0]
Alternatively, what are you trying to do with your data afterwards, run a linear regression? Does the data have to be an np array? Often working with data is a lot easier using pandas.DataFrame, and basically all machine learning libraries such as sklearn or statsmodels or any other you might want to use will have pandas support.
To create one big dataset out of these you could try the following:
data = pd.DataFrame() #creating empty dataframe
list_of_tickers = [MRO, HES, FANG, DVN, PXD, COP, CVS, APA, EOG, HAL, BLK]
for ticker in list_of_tickers:
for column in ticker: #because each column will just be labelled "Adj. Close" and you can't name multiple columns the same way
new_name = str(ticker) + "_" + str(column) #columns in data will then be named "MRO_Adj. Close", "HES_Adj. Close", etc
ticker[new_name] = ticker[column]
ticker = ticker.drop(column, axis=1)
data = pd.concat([data, ticker], axis=1)
Additionally, this neatly prevents problems that might arise from issues that different stock tickers have or lack different dates in their dataset, as was correctly pointed out by Kevin Choon Liang Yew in the comments above.

Xarray: out-of-memory error (when calling the stack method of a Dask-backed Xarray the task is being killed by the system)

I was trying to solve a question of a SO post (Writing xarray multiindex data in chunks) and I ended with the code I provide below.
The problem is that in this code the call to the stack method of the DataArray results in an out of memory error.
My question is:
I'm wondering why the stack method of the concatenated object (which is a xarray.DataArray) can't complete successfully, even though this DataArray is backed by Dask (so it should not run out of memory) ? Why does the test machine runs out of memory and the task is killed by the system ?
A little summary of what is happening when running this code:
The code runs smoothly until the line stacked = concatenated.stack(sample=('y','x','time')).
At this this moment, the memory usage keeps increasing until it reaches almost 100% and the task is killed by the system.
The code was executed on a machine with 8GB of RAM.
I thought it would not run out of memory because concatenated is a Dask-backed xarray. DataArray. But it does.
I made various changes to this code, like using delayed operations, changing chunks sizes, using Dask method instead of xarray's, etc. but without success
I thought about two possibilities of what is happening:
The stack operation is NOT Dask-backed
The stack operation is Dask-backed, but even Dask requires a minimum amount of memory (for each chunk) and this amount can't fit in memory
Does someone know what is happening here and how to solve this problem ?
END NOTES:
You can vary nrows and ncols in order to change the size of concatenated.
For example setting nrows = 10000 instead of nrows = 20000 will reduce its size by half.
To be sure that the DataArray is Dask-backed I also tried to save concatenated to netcdf and load it with the chunks parameter. I tried different values for the chunks, but again without success:
concatenated.to_netcdf("concatenated.nc")
concatenated = xr.open_dataarray("concatenated.nc", chunks=10)
Using smaller values for the chunks parameter only results in taking
more time to run out of memory.
This is the code:
import numpy as np
import dask.array as da
import xarray as xr
from numpy.random import RandomState
nrows = 20000
ncols = 20000
row_chunks = 500
col_chunks = 500
# Create a reproducible random numpy array
prng = RandomState(1234567890)
numpy_array = prng.rand(1, nrows, ncols)
data = da.from_array(numpy_array, chunks=(1, row_chunks, col_chunks))
def create_band(data, x, y, band_name):
return xr.DataArray(data,
dims=('band', 'y', 'x'),
coords={'band': [band_name],
'y': y,
'x': x})
def create_coords(data, left, top, celly, cellx):
nrows = data.shape[-2]
ncols = data.shape[-1]
right = left + cellx*ncols
bottom = top - celly*nrows
x = np.linspace(left, right, ncols) + cellx/2.0
y = np.linspace(top, bottom, nrows) - celly/2.0
return x, y
x, y = create_coords(data, 1000, 2000, 30, 30)
bands = ['blue', 'green', 'red', 'nir']
times = ['t1', 't2', 't3']
bands_list = [create_band(data, x, y, band) for band in bands]
src = []
for time in times:
src_t = xr.concat(bands_list, dim='band')\
.expand_dims(dim='time')\
.assign_coords({'time': [time]})
src.append(src_t)
concatenated = xr.concat(src, dim='time')
print(concatenated)
# computed = concatenated.compute() # "computed" is ~35.8GB
# All is fine until here
stacked = concatenated.stack(sample=('y','x','time'))
# After stack we'd like to transpose
transposed = stacked.T

Memory Error when performing operation on Pandas Dataframe slice

The goal is to calculate RMSE between two groups of columns in a pandas dataframe. The problem is that the amount of memory actually used is almost 10x the size of the dataframe. Here is the code I used to calculate RMSE:
import pandas as pd
import numpy as np
from random import shuffle
# set up test df (actual data is a pre-computed DF stored in HDF5)
dim_x, dim_y = 50, 1000000 # actual dataset dim_y = 56410949
cols = ["a_"+str(i) for i in range(1,(dim_x//2)+1)]
cols_b = ["b_"+str(i) for i in range(1,(dim_x//2)+1)]
cols.extend(cols_b)
df = pd.DataFrame(np.random.uniform(0,10,[dim_y, dim_x]), columns=cols)
# calculate rmse : https://stackoverflow.com/a/46349518
a = df.values
diffs = a[:,1:26] - a[:,26:27]
rmse_out = np.sqrt(np.einsum('ij,ij->i',diffs,diffs)/3.0)
df['rmse_out'].to_pickle('results_rmse.p')
When I get the values from the df with a = df.values, the memory usage for that routine approaches 100GB according to top. The routine calculate the difference between these columns, diffs = a[:,1:26] - a[:,26:27], approaches 120GB then produces a Memory Error. How can I modify my code to make it more memory-efficient, avoid the error, and actually calculate my RMSE values?
The solution I used was to chunk the dataframe down:
df = pd.read_hdf('madre_merge_sort32.h5')
for i,d in enumerate(np.array_split(df, 10)):
d.to_pickle(str(i)+".p")
Then I ran through those pickled mini-dfs and calculated rmse in each:
for fn in glob.glob("*.p"):
# process df values
df = pd.read_pickle(fn)
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace=True)
a= df[df.columns[2:]].as_matrix() # first two cols are non-numeric, so skip
# calculate rmse
diffs = a[:,:25] - a[:,25:]
rmse_out = np.sqrt(np.einsum('ij,ij->i',diffs,diffs)/3.0)
df['rmse_out'] = rmse_out
df.to_pickle("out"+fn)
Then I concatenated them:
dfls = []
for fn in glob.glob("out*.p"):
df = pd.read_pickle(fn)
dfls.append(df)
dfcat = pd.concat(dfls)
Chunking seemed to work for me.

How to directly create a categorical series in dask?

I would like to create a categorical dask Series based on a filter on another series. With pandas, I would do the following:
import numpy as np
import pandas as pd
x = pd.Series(np.random.random(10))
test = (x < 0.5).astype(int)
label = pd.Series(pd.Categorical.from_codes(test, categories=['a', 'b']))
If x is a dask Series, is there a way to create an equivalent label dask series without having to explicitly create the pandas series first (e.g., avoiding .compute() and from_pandas)?
Yes, all you need is available, as follows
import dask.array as da
import dask.dataframe as dd
r = da.random.random(1000000, chunks=(10000,)) # dask array
s = dd.from_array(r) # dask series
label = s.map_partitions(
lambda d: pd.Series(pd.Categorical.from_codes(
d < 0.5, categories=['a', 'b'])), meta='category')
(of course, replace your s with real data if you didn't really want random numbers)

Rolling PCA on pandas dataframe

I'm wondering if anyone knows of how to implement a rolling/moving window PCA on a pandas dataframe. I've looked around and found implementations in R and MATLAB but not Python. Any help would be appreciated!
This is not a duplicate - moving window PCA is not the same as PCA on the entire dataframe. Please see pandas.DataFrame.rolling() if you do not understand the difference
Unfortunately, pandas.DataFrame.rolling() seems to flatten the df before rolling, so it cannot be used as one might expect to roll over the rows of the df and pass windows of rows to the PCA.
The following is a work-around for this based on rolling over indices instead of rows. It may not be very elegant but it works:
# Generate some data (1000 time points, 10 features)
data = np.random.random(size=(1000,10))
df = pd.DataFrame(data)
# Set the window size
window = 100
# Initialize an empty df of appropriate size for the output
df_pca = pd.DataFrame( np.zeros((data.shape[0] - window + 1, data.shape[1])) )
# Define PCA fit-transform function
# Note: Instead of attempting to return the result,
# it is written into the previously created output array.
def rolling_pca(window_data):
pca = PCA()
transf = pca.fit_transform(df.iloc[window_data])
df_pca.iloc[int(window_data[0])] = transf[0,:]
return True
# Create a df containing row indices for the workaround
df_idx = pd.DataFrame(np.arange(df.shape[0]))
# Use `rolling` to apply the PCA function
_ = df_idx.rolling(window).apply(rolling_pca)
# The results are now contained here:
print df_pca
A quick check reveals that the values produced by this are identical to control values computed by slicing appropriate windows manually and running PCA on them.

Categories