pd.get_dummies() slow on large levels

pd.get_dummies() slow on large levels - python

I'm unsure if this is already the fastest possible method, or if I'm doing this inefficiently.
I want to hot encode a particular categorical column which has 27k+ possible levels. The column has different values in 2 different datasets, so I combined the levels first before using get_dummies()
def hot_encode_column_in_both_datasets(column_name,df,df2,sparse=True):
col1b = set(df2[column_name].unique())
col1a = set(df[column_name].unique())
combined_cats = list(col1a.union(col1b))
df[column_name] = df[column_name].astype('category', categories=combined_cats)
df2[column_name] = df2[column_name].astype('category', categories=combined_cats)
df = pd.get_dummies(df, columns=[column_name],sparse=sparse)
df2 = pd.get_dummies(df2, columns=[column_name],sparse=sparse)
try:
del df[column_name]
del df2[column_name]
except:
pass
return df,df2
However, Its been running for more than 2 hours and it's still stuck hot encoding.
Could I be doing something wrongly here? Or is it just the nature of running it on large datasets?
Df has 6.8m rows and 27 columns, Df2 has 19990 rows and 27 columns before hot encoding the column that I wanted to.
Advice appreciated, thank you! :)

I reviewed the get_dummies source code briefly, and I think it may not be taking full advantage of the sparsity for your use case. The following approach may be faster, but I did not attempt to scale it all the way up to the 19M records you have:
import numpy as np
import pandas as pd
import scipy.sparse as ssp
np.random.seed(1)
N = 10000
dfa = pd.DataFrame.from_dict({
'col1': np.random.randint(0, 27000, N)
, 'col2b': np.random.choice([1, 2, 3], N)
, 'target': np.random.choice([1, 2, 3], N)
})
# construct an array of the unique values of the column to be encoded
vals = np.array(dfa.col1.unique())
# extract an array of values to be encoded from the dataframe
col1 = dfa.col1.values
# construct a sparse matrix of the appropriate size and an appropriate,
# memory-efficient dtype
spmtx = ssp.dok_matrix((N, len(vals)), dtype=np.uint8)
# do the encoding. NB: This is only vectorized in one of the two dimensions.
# Finding a way to vectorize the second dimension may yield a large speed up
for idx, val in enumerate(vals):
spmtx[np.argwhere(col1 == val), idx] = 1
# Construct a SparseDataFrame from the sparse matrix and apply the index
# from the original dataframe and column names.
dfnew = pd.SparseDataFrame(spmtx, index=dfa.index,
columns=['col1_' + str(el) for el in vals])
dfnew.fillna(0, inplace=True)
UPDATE
Borrowing insights from other answers here and here, I was able to vectorize the solution in both dimensions. In my limited testing, I noted that constructing the SparseDataFrame seems to increase the execution time several fold. So, if you don't need to return a DataFrame-like object, you can save a lot of time. This solution also handles the case where you need to encode 2+ DataFrames into 2-d arrays with equal numbers of columns.
import numpy as np
import pandas as pd
import scipy.sparse as ssp
np.random.seed(1)
N1 = 10000
N2 = 100000
dfa = pd.DataFrame.from_dict({
'col1': np.random.randint(0, 27000, N1)
, 'col2a': np.random.choice([1, 2, 3], N1)
, 'target': np.random.choice([1, 2, 3], N1)
})
dfb = pd.DataFrame.from_dict({
'col1': np.random.randint(0, 27000, N2)
, 'col2b': np.random.choice(['foo', 'bar', 'baz'], N2)
, 'target': np.random.choice([1, 2, 3], N2)
})
# construct an array of the unique values of the column to be encoded
# taking the union of the values from both dataframes.
valsa = set(dfa.col1.unique())
valsb = set(dfb.col1.unique())
vals = np.array(list(valsa.union(valsb)), dtype=np.uint16)
def sparse_ohe(df, col, vals):
"""One-hot encoder using a sparse ndarray."""
colaray = df[col].values
# construct a sparse matrix of the appropriate size and an appropriate,
# memory-efficient dtype
spmtx = ssp.dok_matrix((df.shape[0], vals.shape[0]), dtype=np.uint8)
# do the encoding
spmtx[np.where(colaray.reshape(-1, 1) == vals.reshape(1, -1))] = 1
# Construct a SparseDataFrame from the sparse matrix
dfnew = pd.SparseDataFrame(spmtx, dtype=np.uint8, index=df.index,
columns=[col + '_' + str(el) for el in vals])
dfnew.fillna(0, inplace=True)
return dfnew
dfanew = sparse_ohe(dfa, 'col1', vals)
dfbnew = sparse_ohe(dfb, 'col1', vals)

Related

Generating and Storing Samples of an Exponential Distribution with a name for each sample using a loop

I've got a weird question for a class project. Assuming X ~ Exp(Lambda), Lambda=1.6, I have to generate 100 samples of X, with the indices corresponding to the sample size of each generated sample (S1, S2 ... S100). I've worked out a simple loop which generate the required samples in array, but i am not able to rename the array.
First attempt:
import numpy as np
import matplotlib.pyplot as plt
samples = []
for i in range(1,101,1):
samples.append(np.random.exponential(scale= 1/1.6, size= i))
Second attempt:
import numpy as np
import matplotlib.pyplot as plt
for i in range(1,101,1):
samples = np.random.exponential(scale= 1/1.2, size= i)
col = f'samples {i}'
df_samples[col] = exponential_sample
df_samples = pd.DataFrame(samples)
An example how I would like to visualize the data:
# drawing 50 random samples of size 2 from the exponentially distributed population
sample_size = 2
df2 = pd.DataFrame(index= ['x1', 'x2'] )
for i in range(1, 51):
exponential_sample = np.random.exponential((1/rate), sample_size)
col = f'sample {i}'
df2[col] = exponential_sample
# Taking a peek at the samples
df2
But instead of having a simple size = 2, I would like to have sample size = i. This way, I will be able to generate 1 rows for the first column (S1), 2 rows for the second column (S2), until I reach 100 rows for the 100th column (S100).

You cannot stick vectors of different lengths easily into a df so your mock-up code would not work, but you can concat one vector at a time:
df = pd.DataFrame()
for i in range(100,10100,100):
tmp = pd.DataFrame({f'S{i}':np.random.exponential(scale= 1/1.2, size= i)})
df = pd.concat([df, tmp], axis=1)

Use a dict instead maybe?
samples = {}
for i in range(100,10100,100):
samples[i] = np.random.exponential(scale= 1/1.2, size= i)
Then you can convert it into a pandas Dataframe if you like.

Need to use apply or broadcasting and masking to iterate over a DataFrame

I have a data frame that I need to iterate over. I want to use either apply or broadcasting and masking. This is the pseudocode I am trying to improve upon.
2 The algorithm
Algorithm 1: The algorithm
initialize the population (of size n) uniformly randomly, obeying the bounds;
while a pre-determined number of iterations is not complete do
set the random parameters (two independent parameters for each of the d
variables); find the best and the worst vectors in the population;
for each vector in the population do create a new vector using the
current vector, the best vector, the worst vector, and the random
parameters;
if the new vector is at least as good as the current vector then
current vector = new vector;
This is the code I have so far.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.uniform(-5.0, 10.0, size = (20, 5)), columns = list('ABCDE'))
pd.set_option('display.max_columns', 500)
df
#while portion of pseudocode
f_func = np.square(df).sum(axis=1)
final_func = np.square(f_func)
xti_best = final_func.idxmin()
xti_worst = final_func.idxmax()
print(final_func)
print(df.head())
print(df.tail())
*#for loop of pseudocode
#for row in df.iterrows():
#implement equation from assignment
#define in array math
#xi_new = row.to_numpy() + np.random.uniform(0, 1, size = (1, 5)) * (df.iloc[xti_best].values - np.absolute(row.to_numpy())) - np.random.uniform(0, 1, size = (1, 5)) * (df.iloc[xti_worst].values - np.absolute(row.to_numpy()))
#print(xi_new)*
df2 = df.apply(lambda row: 0 if row == 0 else row + np.random.uniform(0, 1, size = (1, 5)) * (df.iloc[xti_best].values - np.absolute(axis = 1)))
print(df2)
The formula I am trying to use for xi_new is:
#xi_new = xi_current + random value between 0,1(xti_best -abs(xi_current)) - random value(xti_worst - abs(xi_current))

I'm not sure I'm implementing your formula correctly, but hopefully this helps
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.uniform(-5.0, 10.0, size = (20, 5)), columns = list('ABCDE'))
#while portion of pseudocode
f_func = np.square(df).sum(axis=1)
final_func = np.square(f_func)
xti_best_idx = final_func.idxmin()
xti_worst_idx = final_func.idxmax()
xti_best = df.loc[xti_best_idx]
xti_worst = df.loc[xti_worst_idx]
#Calculate random values for the whole df for the two different areas where you need randomness
nrows,ncols = df.shape
r1 = np.random.uniform(0, 1, size = (nrows, ncols))
r2 = np.random.uniform(0, 1, size = (nrows, ncols))
#xi_new = xi_current + random value between 0,1(xti_best -abs(xi_current)) - random value(xti_worst - abs(xi_current))
df= df+r1*xti_best.sub(df.abs())-r2*xti_worst.sub(df.abs())
df

Scatter plot with logical indexing

I have a 100x2 array D and a 100x1 array c (with entries +/- 1) I'm trying to make a scatter plot of the columns in D corresponding to c = 1.
I tried something like this: plt.scatter(D[0][c==1],D[1][c==1]) but it throws up IndexError: too many indices for array
I'm aware that I've use list comprehension or something of that sort. I'm fairly new to Python and hence struggling with the format.
Thanks a lot.

Concept
You can use np.where to select only rows from D that are 1 in your array C:
D = np.array([[0.25, 0.25], [0.75, 0.75]])
C = np.array([1, 0])
Using np.where, we can select only rows that are 1 in C:
>>> D[np.where(C==1)]
array([[0.25, 0.25]])
Example On your actual data:
D = np.random.randn(100, 2)
C = np.random.randint(0, 2, (100, 1))
valid = D[np.where(C.ravel()==1)]
import matplotlib.pyplot as plt
plt.scatter(valid[:, 0], valid[:, 1])
Output:

You can use numpy for this (assuming you have two numpy arrays, otherwise you can convert them into numpy arrays):
import numpy as np
c_ones = np.where(c == 1) # Finds all indices where c == 1
d_0 = D[0][c_ones]
d_1 = D[1][c_ones]
Then you can plot d_0, d_1 as normal.
For converting your lists if needed,
C_np = np.asarray(c)
D_np = np.asarray(D)
And then perform np.where on C_np as shown above.
Would this solve your issue?

Pandas dot product with Multiindex

My problem is quite common in finance.
Given an array w (1xN) of weights and a covariance matrix Q (NxN) of assets, one can calculate the covariance of the portfolio using the quadratic expression w' * Q * w, where * is the dot product.
I want to understand what is the best way to perform this operation when I have an history of weights W (T x N) and a 3D structure for covariance matrix (T, N, N).
import numpy as np
import pandas as pd
returns = pd.DataFrame(0.1 * np.random.randn(100, 4), columns=['A', 'B', 'C', 'D'])
covariance = returns.rolling(20).cov()
weights = pd.DataFrame(np.random.randn(100, 4), columns=['A', 'B', 'C', 'D'])
My solution so far was to converting pandas DataFrames to numpy, perform the calculation doing a loop and then converting back to pandas.
Note that I need to explicitly check for the alignment of labels, since in reality covariance and weights could be calculated by different processes.
cov_dict = {key: covariance.xs(key, axis=0, level=0) for key in covariance.index.get_level_values(0)}
def naive_numpy(weights, cov_dict):
expected_risk = {}
# Extract columns, index before passing to numpy arrays
# Columns
cov_assets = cov_dict[next(iter(cov_dict))].columns
avail_assets = [el for el in cov_assets if el in weights]
# Indexes
cov_dates = list(cov_dict.keys())
avail_dates = weights.index.intersection(cov_dates)
sel_weights = weights.loc[avail_dates, avail_assets]
# Main loop and calculation
for t, value in zip(sel_weights.index, sel_weights.values):
expected_risk[t] = np.sqrt(np.dot(value, np.dot(cov_dict[t].values, value)))
# Back to pandas DataFrame
expected_risk = pd.Series(expected_risk).reindex(weights.index).sort_index()
return expected_risk
Is there pure-pandas way to achieve the same result? Or is there any improvement on the code to make it more efficient? (despite using numpy, it is still quite slow).

I think numpy is definitely the best option. Though you loose that efficiency if you loop on values/dates.
My suggestion for calculating the rolling volatility of a portfolio (with no looping):
returns = pd.DataFrame(0.1 * np.random.randn(100, 4), columns=['A', 'B', 'C', 'D'])
covariance = returns.rolling(20).cov()
weights = pd.DataFrame(np.random.randn(100, 4), columns=['A', 'B', 'C', 'D'])
rows, columns = weights.shape
# Go to numpy:
w = weights.values
cov = covariance.values.reshape(rows, columns, columns)
A = np.matmul(w.reshape(rows, 1, columns), cov)
var = np.matmul(A, w.reshape(rows, columns, 1)).reshape(rows)
std_dev = np.sqrt(var)
# Back to pandas (in case you want that):
pd.Series(std_dev, index = weights.index)

Traversing multiple dataframes simultaneously

I have three dataframes of three users with same column names like time, compass data,accelerometer data, gyroscope data and camera panning information. I want to traverse all the dataframes simultaneously to check for a particular time which user has performed camera panning and return the user(like in which data frame panning has been detected for a particular time). I have tried using dash for achieving parallelism but in vain. below is my code
import pandas as pd
import glob
import numpy as np
import math
from scipy.signal import butter, lfilter
order=3
fs=30
cutoff=4.0
data=[]
gx=[]
gy=[]
g_x2=[]
g_y2=[]
dataList = glob.glob(r'C:\Users\chaitanya\Desktop\Thesis\*.csv')
for csv in dataList:
data.append(pd.read_csv(csv))
for i in range(0, len(data)):
data[i] = data[i].groupby("Time").agg(lambda x: x.value_counts().index[0])
data[i].reset_index(level=0, inplace=True)
def butter_lowpass(cutoff,fs,order=5):
nyq=0.5 * fs
nor=cutoff / nyq
b,a=butter(order,nor,btype='low', analog=False)
return b,a
def lowpass_filter(data,cutoff,fs,order=5):
b,a=butter_lowpass(cutoff,fs,order=order)
y=lfilter(b,a,data)
return y
for i in range(0,len(data)):
gx.append(lowpass_filter(data[i]["Gyro_X"],cutoff,fs,order))
gy.append(lowpass_filter(data[i]["Gyro_Y"],cutoff,fs,order))
g_x2.append(gx[i]*gx[i])
g_y2.append(gy[i]*gy[i])
g_rad=[[] for _ in range(len(data))]
g_ang=[[] for _ in range(len(data))]
for i in range(0,len(data)):
for j in range(0,len(data[i])):
g_ang[i].append(math.degrees(math.atan(gy[i][j]/gx[i][j])))
data[i]["Ang"]=g_ang[i]
panning=[[] for _ in range(len(data))]
for i in range(0,len(data)):
for j in data[i]["Ang"]:
if 0-30<=j<=0+30:
panning[i].append("Panning")
elif 180-30<=j<=180+30:
panning[i].append("left")
else:
panning[i].append("None")
data[i]["Panning"]=panning[i]
result=[[] for _ in range(len(data))]
for i in range (0,len(data)):
result[i].append(data[i].loc[data[i]['Panning']=='Panning','Ang'])

I'm going to make the assumption that you want to traverse simultaneously in time. In any case, you want your three dataframes to have an index in the dimension you want to traverse.
I'll generate 3 dataframes with rows representing random seconds in a 9 second period.
Then, I'll align these with a pd.concat and ffill to be able to reference the last known data for any gaps.
seconds = pd.date_range('2016-08-31', periods=10, freq='S')
n = 6
ssec = seconds.to_series()
sidx = ssec.sample(n).index
df1 = pd.DataFrame(np.random.randint(1, 10, (n, 3)),
ssec.sample(n).index.sort_values(),
['compass', 'accel', 'gyro'])
df2 = pd.DataFrame(np.random.randint(1, 10, (n, 3)),
ssec.sample(n).index.sort_values(),
['compass', 'accel', 'gyro'])
df3 = pd.DataFrame(np.random.randint(1, 10, (n, 3)),
ssec.sample(n).index.sort_values(),
['compass', 'accel', 'gyro'])
df4 = pd.concat([df1, df2, df3], axis=1, keys=['df1', 'df2', 'df3']).ffill()
df4
you can then proceed to walk through via iterrows()
for tstamp, row in df4.iterrows():
print tstamp

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pd.get_dummies() slow on large levels - python

Related

Generating and Storing Samples of an Exponential Distribution with a name for each sample using a loop

Need to use apply or broadcasting and masking to iterate over a DataFrame

Scatter plot with logical indexing

Pandas dot product with Multiindex

Traversing multiple dataframes simultaneously

Categories

Resources