Ensuring performance of sketching/streaming algorithm (countSketch)

Ensuring performance of sketching/streaming algorithm (countSketch) - python

I have implemented what is know as a countSketch in python (page 17: https://arxiv.org/pdf/1411.4357.pdf) but my implementation is currently lacking in performance. The algorithm is to compute the product SA where A is an n x d matrix, S is m x n matrix defined as follows: for every column of S uniformly at random select a row (hash bucket) from the m rows and for that given row, uniformly at random select +1 or -1. So S is a matrix with exactly one nonzero in every column and otherwise all zero.
My intention is to compute SA in a streaming fashion by reading the entries of A. The idea for my implementation is as follows: observe a sequence of triples (i,j,A_ij) and return a sequence (h(i), j, s(i)A_ij) where:
- h(i) is a hash bucket (row of matrix chosen uniformly at random from the m possible rows of S
- s(i) is the random sign function as described above.
I have assumed that the matrix is in row order so that the first row of A arrives in its entirety before the next row of A arrives because this limits the number of calls I need to select a random bucket or the need to use a hash library. I have also assumed that the number of nonzero entries (or the length of the input stream) is known so that I can efficiently encode the iteration.
My problem is that the matrix should compute (1+error)*||Ax||^2 <= ||SAx||^2 <= (1+error)*||Ax||^2 and also have the difference in frobenius norms between A^T S^T S A and A^T A being small. However, while my implementation for the first condition seems to be true, the latter is consistently too small. I was wondering if there is an obvious reason for this that I am missing because it seems to be underestimating the latter quantity.
I am open to feedback on changing the code if there are obvious improvements to be made. The single call to np.choice is made to remove the need to look through a (potentially large) array or hash table containing the row hashes for each row and because the matrix is in row order we can just keep the hash for that row until a new row is seen.
nb. If you don't want to run using numba then just comment out the import and the function decorator and it will run in standard numpy/scipy.
import numpy as np
import numpy.random as npr
import scipy.sparse as sparse
from scipy.sparse import coo_matrix
import numba
from numba import jit
#jit(nopython=True) # comment this if want just numpy
def countSketch(input_rows, input_data,
input_nnz,
sketch_size, seed=None):
'''
input_rows: row indices for data (can be repeats)
input_data: values seen in row location,
input_nnz : number of nonzers in the data (can replace with
len(input_data) but avoided here for speed)
sketch_size: int
seed=None : random seed
'''
hashed_rows = np.empty(input_rows.shape,dtype=np.int32)
current_row = 0
hash_val = npr.choice(sketch_size)
sign_val = npr.choice(np.array([-1.0,1.0]))
#print(hash_val)
hashed_rows[0] = hash_val
#print(hash_val)
for idx in np.arange(input_nnz):
print(idx)
row_id = input_rows[idx]
data_val = input_data[idx]
if row_id == current_row:
hashed_rows[idx] = hash_val
input_data[idx] = sign_val*data_val
else:
# make new hashes
hash_val = npr.choice(sketch_size)
sign_val = npr.choice(np.array([-1.0,1.0]))
hashed_rows[idx] = hash_val
input_data[idx] = sign_val*data_val
return hashed_rows, input_data
def sort_row_order(input_data):
sorted_row_column = np.array((input_data.row,
input_data.col,
input_data.data))
idx = np.argsort(sorted_row_column[0])
sorted_rows = np.array(sorted_row_column[0,idx], dtype=np.int32)
sorted_cols = np.array(sorted_row_column[1,idx], dtype=np.int32)
sorted_data = np.array(sorted_row_column[2,idx], dtype=np.float64)
return sorted_rows, sorted_cols, sorted_data
if __name__=="__main__":
import time
from tabulate import tabulate
matrix = sparse.random(1000, 50, 0.1)
x = np.random.randn(matrix.shape[1])
true_norm = np.linalg.norm(matrix#x,ord=2)**2
tidy_data = sort_row_order(matrix)
sketch_size = 300
start = time.time()
hashed_rows, sketched_data = countSketch(tidy_data[0],\
tidy_data[2], matrix.nnz,sketch_size)
duration_slow = time.time() - start
S_A = sparse.coo_matrix((sketched_data, (hashed_rows,matrix.col)))
approx_norm_slow = np.linalg.norm(S_A#x,ord=2)**2
rel_error_slow = approx_norm_slow/true_norm
#print("Sketch time: {}".format(duration_slow))
start = time.time()
hashed_rows, sketched_data = countSketch(tidy_data[0],\
tidy_data[2], matrix.nnz,sketch_size)
duration = time.time() - start
#print("Sketch time: {}".format(duration))
S_A = sparse.coo_matrix((sketched_data, (hashed_rows,matrix.col)))
approx_norm = np.linalg.norm(S_A#x,ord=2)**2
rel_error = approx_norm/true_norm
#print("Relative norms: {}".format(approx_norm/true_norm))
print(tabulate([[duration_slow, rel_error_slow, 'Yes'],
[duration, rel_error, 'No']],
headers=['Sketch Time', 'Relative Error', 'Dry Run'],
tablefmt='orgtbl'))

Related

Going from code that compares one value to all other values to all values of all other values

I've written the code that will find take a number of n grams, a specific index, and a threshold, and return the values that fall within that threshold. However, currently, it only compares a set of tokens (a given index) to each set of tokens at all other indices. I want to compare each set tokens at all indices to every other set of tokens at all indices. I don't think this is a difficult question, but python is my main language and I struggle with for loops a bit.
So essentially, the variable token in the function should actually iterate over each string in the column, and be compared with comp_token and the index call would be removed, since it would be iterating over all indices.
Let me know if that isn't clear enough and I will think more about how to say this: it is just difficult because the thing I am asking is the thing I am struggling with.
data = ['Time', "NY Times", 'Atlantic']
ph = pd.DataFrame(data, columns=['companies'])
ph.reset_index(inplace=True)
import py_stringmatching as sm
import pandas as pd
import numpy as np
jac = sm.Jaccard()
def predict_label(num, index, thresh):
qg_num_tok = sm.QgramTokenizer(qval = num)
companies = ph.companies.to_list()
ids = ph['index']
companies_qg_num_token = {}
companies_id2index = {}
for i in range(len(companies)):
companies_id2index[i] = companies[i]
companies_qg_num_token[i] = qg_num_tok.tokenize(companies[i])
predicted_label = [1]
token = companies_qg_num_token[index] #index you want: to get all the tokens
for comp_name in ids[1:]:
comp_token = companies_qg_num_token[comp_name]
sim = jac.get_sim_score(token, comp_token)
if sim > thresh:
predicted_label.append(1)
else:
predicted_label.append(0)
#companies_id2index must be equal to token numbner
ph.loc[ph['index'] != companies_id2index[index], 'label'] = 0 #if not equal to index
ph['prediction'] = predicted_label
print(token)
print(companies_id2index[index])
return ph.query('prediction==1')
predict_label(ph, 1, .5)

Avoid for loop in Python DataFrame

Problem 1.
Suppose I have n years of annual returns r and my initial wealth is 100. Every year I have fixed expense of 6. I want to create yearly wealth. I can do it in for loop. But for my purpose it's time consuming. How do I do it in DataFrame?
wealth = pd.Series(index = range(n+1))
wealth[0] = 100
for i in range(n):
wealth.iloc[i+1] = wealth.iloc[i]*(1+r.iloc[i]) - 6
Initially I thought
wealth = ((1 + r - 0.06).cumprod()).multiply(other = 100)
to be the solution. But it is not. Expenses are not 6%. They are fixed. It is 6.
Problem 2.
I want to do the above N times. In each case I generate r by sampling n returns with replacement.
r = returnY.sample(n,replace=True).reset_index(drop=True)
Then for that return, create the wealth path I described above and create a n*N dateframe of wealth paths. I can do this in for loop, but for big N and n, it takes long time to run. Is there an efficient and elegant way to do this?
Problem 3.
Suppose allWealth is the DF with all wealth paths. Want to check %columns in each row less than 0. This is how I resolved it.
yy = allWealth.copy()
yy[yy>0] = 1
yy[yy<=0] = 0
yy.sum(axis = 1)/N
Any better, more elegant solution?

Problem 1: It looks like you want to apply the "reduce" pattern. You can use reduce function from functools.
import numpy as np
from functools import reduce
rs = np.random.random(50)*0.3 #sequence of annual returns
result = reduce(lambda w,r: w*(1+r)-6, rs, 100)
If you want to keep all the intermediate values, use itertools.accumulate() instead. For example, replace the last line with the following:
ts_iter= itertools.accumulate(rs, lambda w,r: w*(1+r)-6, initial=100)
ts = list(ts_iter) #itertools.accumulate returns an iterable
Problem 2: You can first generate a random matrix of nxN by sampling with replacement. Then you can use "apply_along_axis" method for each column.
import numpy as np
rm = np.random.random((n,N))
def sim(rs):
return reduce(lambda w,r: w * (1+r) - 6, rs, 100)
result = np.apply_along_axis(sim, 0, rm)
Problem 3: you don't need to assign ones and zeros to your original dataframe. A mask dataframe of True and False implicitly acts as a dataframe of ones and zeros in this case.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((50,30)))
mask = df < 0.5
mask.sum(axis=1)/30

I used #chi's solution with some small edit.
import numpy as np
import itertools
rm = np.random.random((n,N)) #sequence of annual returns
rm0 = np.insert(rm, 0, 100, axis=1)
def wealth(rs):
return list(itertools.accumulate(rs, lambda w,r: w*(1+r)-6))
result = np.apply_along_axis(wealth, 1, rm0)
itertools.accumulate does not recognize initial. Hence inserted initial wealth at the front of return array.

How do I add a matrix constraint `Ax=b` to a Pyomo model efficiently?

I want to add the constraints Ax=b to a Pyomo model with my numpy arrays A and b as efficient as possible. Unfortunately, the performance is very bad currently. For the following example
import time
import numpy as np
import pyomo.environ as pyo
start = time.time()
rows = 287
cols = 2765
A = np.random.rand(rows, cols)
b = np.random.rand(rows)
mdl = pyo.ConcreteModel()
mdl.rows = range(rows)
mdl.cols = range(cols)
mdl.A = A
mdl.b = b
mdl.x_var = pyo.Var(mdl.cols, bounds=(0.0, None))
mdl.constraints = pyo.ConstraintList()
[mdl.constraints.add(sum(mdl.A[row, col] * mdl.x_var[col] for col in mdl.cols) <= mdl.b[row]) for row in mdl.rows]
mdl.obj = pyo.Objective(expr=sum(mdl.x_var[col] for col in mdl.cols), sense=pyo.minimize)
end = time.time()
print(end - start)
is takes almost 30 seconds because of the add statement and the huge amount of columns. Is it possible to pass A, x, and b directly and fast instead of adding it row by row?

The main thing that is slowing down your construction above is the fact that you are building the constraint list elements within a list comprehension, which is unnecessary and causes a lot of bloat.
This line:
[mdl.constraints.add(sum(mdl.A[row, col] * mdl.x_var[col] for col in mdl.cols) <= mdl.b[row]) for row in mdl.rows]
Constructs a list of the captured results of each ConstraintList.add() expression, which is a "rich" return. That list is an unnecessary byproduct of the loop you desire to do over the add() function. Just change your generation scheme to either a loop or a generator (by using parens) to avoid that capture, as such:
(mdl.constraints.add(sum(mdl.A[row, col] * mdl.x_var[col] for col in mdl.cols) <= mdl.b[row]) for row in mdl.rows)
And the model construction time drops to about 0.02 seconds.

How to find which input value in a loop yielded the output min?

I am trying to solve a min value problem, I could obtain the min values from two loops but, what I really need is also the exact values that correspended to output min.
from __future__ import division
from numpy import*
b1=0.9917949
b2=0.01911
b3=0.000840
b4=0.10175
b5=0.000763
mu=1.66057*10**(-24) #gram
c=3.0*10**8
Mler=open("olasiM.txt","w+")
data=zeros(0,'float')
for A in range(1,25):
M2=zeros(0,'float')
print 'A=',A
for Z in range(1,A+1):
SEMF=mu*c**2*(b1*A+b2*A**(2./3.)-b3*Z+b4*A*((1./2.)-(Z/A))**2+(b5*Z**2)/(A**(1./3.)))
SEMF=array(SEMF)
M2=hstack((M2,SEMF))
minm2=min(M2)
data=hstack((data,minm2))
data=hstack((data,A))
datalist = data.tolist()
for i in range (len(datalist)):
Mler.write(str(datalist[i])+'\n')
Mler.close()
Here, what I want is to see the min value of the SEMF and, corresponding A,Z values, For example, it has to be A=1, Z=1 and SEMF= some#
I also don't know how to write these, A and Z values to the document

The big advantage of numpy over using python lists is vectorized operations. Unfortunately your code fails completely in using them. For example the whole inner loop that has Z as index can easily be vectorized. You instead are computing the single elements using python floats and then stacking them one by one in the numpy array M2.
So I'd refactor that part of the code by:
import numpy as np
# ...
Zs = np.arange(1, A+1, dtype=float)
SEMF = mu*c**2 * (b1*A + b2*A**(2./3.) - b3*Zs + b4*A*((1./2.) - (Zs/A))**2 + (b5*Zs**2)/(A**(1./3.)))
Here the SEMF array should be exactly what you'd obtain as the final M2 array. Now you can find the minimum and stack that value into your data array:
min_val = SEMF.min()
data = hstack((data,minm2))
data = hstack((data,A))
If you also what to keep track for which value of Z you got the minimum you can use the argmin method:
min_val, min_pos = SEMF.min(), SEMF.argmin()
data = hstack((data,np.array([min_val, min_pos, A])))
The final code should look like:
from __future__ import division
import numpy as np
b1 = 0.9917949
b2 = 0.01911
b3 = 0.000840
b4 = 0.10175
b5 = 0.000763
mu = 1.66057*10**(-24) #gram
c = 3.0*10**8
data=zeros(0,'float')
for A in range(1,25):
Zs = np.arange(1, A+1, dtype=float)
SEMF = mu*c**2 * (b1*A + b2*A**(2./3.) - b3*Zs + b4*A*((1./2.) - (Zs/A))**2 + (b5*Zs**2)/(A**(1./3.)))
min_val, min_pos = SEMF.min(), SEMF.argmin()
data = hstack((data,np.array([min_val, min_pos, A])))
datalist = data.tolist()
with open("olasiM.txt","w+") as mler:
for i in range (len(datalist)):
mler.write(str(datalist[i])+'\n')
Note that numpy provides some functions to save/load array to/from files, like savetxt so I suggest that instead of manually saving the values there to use these functions.
Probably some numpy expert could vectorize also the operations for the As. Unfortunately my numpy knowledge isn't that advanced and I don't know how the handle the fact that we'd have a variable number of dimensions due to the range(1, A+1) thing...

Spatial temporal query in python with many records

I have a dataframe of 600 000 x/y points with date-time information, along another field 'status', with extra descriptive information
My objective is, for each record:
sum column 'status' by records that are within a certain spatial temporal buffer
the specific buffer is within t - 8 hours and < 100 meters
Currently I have the data in a pandas data frame.
I could, loop through the rows, and for each record, subset the dates of interest, then calculate a distances and restrict the selection further. However that would still be quite slow with so many records.
THIS TAKES 4.4 hours to run.
I can see that I could create a 3 dimensional kdtree with x, y, date as epoch time. However, I am not certain how to restrict the distances properly when incorporating dates and geographic distances.
Here is some reproducible code for you guys to test on:
Import
import numpy.random as npr
import numpy
import pandas as pd
from pandas import DataFrame, date_range
from datetime import datetime, timedelta
Create data
np.random.seed(111)
Function to generate test data
def CreateDataSet(Number=1):
Output = []
for i in range(Number):
# Create a date range with hour frequency
date = date_range(start='10/1/2012', end='10/31/2012', freq='H')
# Create long lat data
laty = npr.normal(4815862, 5000,size=len(date))
longx = npr.normal(687993, 5000,size=len(date))
# status of interest
status = [0,1]
# Make a random list of statuses
random_status = [status[npr.randint(low=0,high=len(status))] for i in range(len(date))]
# user pool
user = ['sally','derik','james','bob','ryan','chris']
# Make a random list of users
random_user = [user[npr.randint(low=0,high=len(user))] for i in range(len(date))]
Output.extend(zip(random_user, random_status, date, longx, laty))
return pd.DataFrame(Output, columns = ['user', 'status', 'date', 'long', 'lat'])
#Create data
data = CreateDataSet(3)
len(data)
#some time deltas
before = timedelta(hours = 8)
after = timedelta(minutes = 1)
Function to speed up
def work(df):
output = []
#loop through data index's
for i in range(0, len(df)):
l = []
#first we will filter out the data by date to have a smaller list to compute distances for
#create a mask to query all dates between range for date i
date_mask = (df['date'] >= df['date'].iloc[i]-before) & (df['date'] <= df['date'].iloc[i]+after)
#create a mask to query all users who are not user i (themselves)
user_mask = df['user']!=df['user'].iloc[i]
#apply masks
dists_to_check = df[date_mask & user_mask]
#for point i, create coordinate to calculate distances from
a = np.array((df['long'].iloc[i], df['lat'].iloc[i]))
#create array of distances to check on the masked data
b = np.array((dists_to_check['long'].values, dists_to_check['lat'].values))
#for j in the date queried data
for j in range(1, len(dists_to_check)):
#compute the ueclidean distance between point a and each point of b (the date masked data)
x = np.linalg.norm(a-np.array((b[0][j], b[1][j])))
#if the distance is within our range of interest append the index to a list
if x <=100:
l.append(j)
else:
pass
try:
#use the list of desired index's 'l' to query a final subset of the data
data = dists_to_check.iloc[l]
#summarize the column of interest then append to output list
output.append(data['status'].sum())
except IndexError, e:
output.append(0)
#print "There were no data to add"
return pd.DataFrame(output)
Run code and time it
start = datetime.now()
out = work(data)
print datetime.now() - start
Is there a way to do this query in a vectorized way? Or should I be chasing another technique.
<3

Here is what at least somewhat solves my problem. Since the loop can operate on different parts of the data independently, parallelization makes sense here.
using Ipython...
from IPython.parallel import Client
cli = Client()
cli.ids
cli = Client()
dview=cli[:]
with dview.sync_imports():
import numpy as np
import os
from datetime import timedelta
import pandas as pd
#We also need to add the time deltas and output list into the function as
#local variables as well as add the Ipython.parallel decorator
#dview.parallel(block=True)
def work(df):
before = timedelta(hours = 8)
after = timedelta(minutes = 1)
output = []
final time 1:17:54.910206, about 1/4 original time
I would still be very interested for anyone to suggest small speed improvements within the body of the function.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Ensuring performance of sketching/streaming algorithm (countSketch) - python

Related

Going from code that compares one value to all other values to all values of all other values

Avoid for loop in Python DataFrame

How do I add a matrix constraint `Ax=b` to a Pyomo model efficiently?

How to find which input value in a loop yielded the output min?

Spatial temporal query in python with many records

Categories

Resources