Related
I am trying to do a target encoding of the categorical columns of an array X of features based on a target 0-1 array y, i.e. substitute each column level in feature x_i with the mean value of the target (i.e. number of 1's) for that level.
The following code is likely to be inefficient, because of the two 2 loops to mimic the group-by. Is there any room for improvement for such implementation (avoiding the slow pandas group-by)? Thank you
import numpy as np
np.random.seed(9)
rows, cols= 100_00,500
x = np.random.choice(['a','b','c','d','e',"f","g"],size=(rows,cols))
y = np.random.choice([0,1], size =(rows,1))
#learn encoding
for colum in range(X.shape[1]):
c = X[:,colum]
if c.dtype.kind=="U":
unique = np.unique(c)
tmap_num={}
for uni in unique:
tmap_num[uni]=y[c==uni].mean()
maps_num[str(colum)] = tmap_num
#apply encoding
X = X.astype('<U32')
for col, tmap in maps.items():
vals = np.full(X.shape[0], np.nan)
for val, mean_target in tmap.items():
vals[X[:,int(col)]==val] = mean_target
X[:,int(col)] = vals
I have a 2D numpy array with rows being time series of a feature, based on which I'm training a neural network. For generalisation purposes, I would like to subset these time series at random points. I'd like them to have a minimum subset length as well. However, the network requires fixed length time series, so I need to pre-pad the resulting subsets with zeroes.
Currently, I'm doing it using the code below, which includes a nasty for-loop, because I don't know how I can use fancy indexing for this particular problem. As this piece of code is part of the network data generator, it needs to be fast to keep up to pace with the data-hungry GPU. Does anyone know a numpy-way of doing this without the for-loop?
import numpy as np
import matplotlib.pyplot as plt
# Amount of time series to consider
batchsize = 25
# Original length of the time series
timesteps = 150
# As an example, fill the 2D array with sine function time series
sinefunction = np.expand_dims(np.sin(np.arange(timesteps)), axis=0)
originalarray = np.repeat(sinefunction, batchsize, axis=0)
# Now the real thing, we want:
# - to start the time series at a random moment (between 0 and maxstart)
# - to end the time series at a random moment
# - however with a minimum length of the resulting subset time series (minlength)
maxstart = 50
minlength = 75
# get random starts
randomstarts = np.random.choice(np.arange(0, maxstart), size=batchsize)
# get random stops
randomstops = np.random.choice(np.arange(maxstart + minlength, timesteps), size=batchsize)
# determine the resulting random sizes of the subset time series
randomsizes = randomstops - randomstarts
# finally create a new 2D array with all the randomly subset time series, however pre-padded with zeros
# THIS IS THE FOR LOOP WE SHOULD TRY TO AVOID
cutarray = np.zeros_like(originalarray)
for i in range(batchsize):
cutarray[i, -randomsizes[i]:] = originalarray[i, randomstarts[i]:randomstops[i]]
To show what goes in and out of the function:
# Show that it worked
f, ax = plt.subplots(2, 1)
ax[0].imshow(originalarray)
ax[0].set_title('original array')
ax[1].imshow(cutarray)
ax[1].set_title('zero-padded subset array')
Approach #1 : Views-based
We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windowed views into a zeros padded version of the input and assign into a zeros padded version of the output. All of that padding is needed for a vectorized solution on account of the ragged nature. Upside is that working on views would be efficient on memory and performance.
The implementation would look something like this -
from skimage.util.shape import view_as_windows
n = randomsizes.max()
max_extent = randomstarts.max()+n
padlen = max_extent - origalarray.shape[1]
p = np.zeros((origalarray.shape[0],padlen),dtype=origalarray.dtype)
a = np.hstack((origalarray,p))
w = view_as_windows(a,(1,n))[...,0,:]
out_vals = w[np.arange(len(randomstarts)),randomstarts]
out_starts = origalarray.shape[1]-randomsizes
out_extensions_max = out_starts.max()+n
out = np.zeros((origalarray.shape[0],out_extensions_max),dtype=origalarray.dtype)
w2 = view_as_windows(out,(1,n))[...,0,:]
w2[np.arange(len(out_starts)),out_starts] = out_vals
cutarray_out = out[:,:origalarray.shape[1]]
Approach #2 : With masking
cutarray_out = np.zeros_like(origalarray)
r = np.arange(origalarray.shape[1])
m = (randomstarts[:,None]<=r) & (randomstops[:,None]>r)
s = origalarray.shape[1]-randomsizes
m2 = s[:,None]<=r
cutarray_out[m2] = origalarray[m]
I want to take first N rows of my dataset arrange it as a matrix (such that N>=No. of columns) i.e. 6 in this case, find the determinant of |Matrix.T*Matrix|, so my final matrix product will be a 6x6 matrix.
Set the 1st Column 'Serial_no' as index
EDITED QUESTION:
I want to find 7 rows matrix from my complete dataset such that it will give the maximum determinant |Matrix.T*Matrix| of the product. Also I want Index values of the best set.
Dataset:
Serial_no,A,B,C,D,E,F
1,0.379,-0.588,-1.69,-0.0135,0.083,-0.0297
2,-0.144,0.278,0.354,-0.000672,-0.0228,0.014
3,0.295,-0.157,-1.63,-0.00451,0.0778,-0.00969
4,0.371,-0.623,-4.98,-0.000253,0.0872,-0.0109
5,0.369,-3.11,-8.3,-0.0000105,0.0871,-0.0327
6,0.369,-0.899,-7.19,-0.0000177,0.0872,-0.0109
7,0.383,-1.04,-2.76,-0.00418,0.089,-0.033
8,0.369,-1.04,-8.3,-0.00000263,0.0871,-0.0109
9,-0.124,0.421,0.679,0.00246,-0.0216,0.0133
10,0.37,2.15,-17.1,0.000244,0.0871,0.0109
11,0.369,5.61,-14.9,0.0000352,0.0872,0.0327
12,0.369,1.45,-11.6,-0.000000963,0.0872,0.0109
13,0.369,3.53,-9.41,-0.00000186,0.0872,0.0327
14,0.369,6.44,-17.2,0.000513,0.0872,0.0327
15,-0.11,-2.57,4.11,-0.000127,-0.0209,-0.0131
16,-0.11,-2.76,4.43,-0.000606,-0.0211,-0.0132
17,0.37,0.761,-6.09,0.0000571,0.0871,0.0109
18,0.3678,1.45,-3.88,0.00209,0.0865,0.0325
19,0.381,-2.46,-19.4,-0.00274,0.0874,-0.0111
20,0.369,4.36,-11.6,-0.000003,0.0872,0.0327
21,-0.111,-1.74,2.79,0.000000903,-0.0209,-0.0131
22,-0.111,-1.91,3.05,-0.000000953,-0.0209,-0.0131
23,0.368,2.28,-6.09,0.000164,0.0871,0.0327
24,-0.11,-0.913,1.46,-0.0000412,-0.0209,-0.0131
25,-0.111,-1.08,1.73,-0.0000101,-0.0209,-0.0131
26,-0.144,-0.278,0.354,0.000672,-0.0228,-0.014
27,0.344,-0.344,-2.76,-0.00202,0.0877,-0.0107
28,0.369,3.11,-8.3,0.0000105,0.0871,0.0327
29,0.383,1.04,-2.76,0.00418,0.089,0.033
30,-0.124,-0.421,0.679,-0.00246,-0.0216,-0.0133
import pandas as pd
import numpy as np
#importing t dataset with pandas
dataset=pd.read_csv('Dataset.csv')
dataset = dataset.set_index('Serial_no')
X=dataset.iloc[:,:]
len_of_col = len(dataset.columns)
N = int(input("Enter total no. rows : "))
Here you go:
N = 7
# first N rows
mat = df.iloc[:N]
np.linalg.det(mat.T # mat)
# 3.91198281101018e-11
Update: if your data is not too long, a for loop helps you to find all the determinant:
N = 7
def my_det(df,i):
mat = df.iloc[i:i+N]
return np.linalg.det(mat.T # mat)
all_det = [my_det(df,i) for i in range(len(df)-N)]
print(np.argmax(all_det))
# 7
print(np.max(all_det))
# 6.453644515027227e-11
You can do the following:
import pandas as pd
import numpy as np
#importing t dataset with pandas
dataset=pd.read_csv('Dataset.csv')
dataset = dataset.set_index('Serial_no')
X=dataset.iloc[:,:]
len_of_col = len(dataset.columns)
N = int(input("Enter total no. of strain gauges >= No. of Loads : "))
# Note: N has to be equal to the number of cols, not greater
datamatrix = X[:N]
det = np.linalg.det(datamatrix)
I am new to Python. I have a numpy.array which size is 66049x1 (66049 rows and 1 column). The values are sorted smallest to largest and are of float type, with some of them being repeated.
I need to determine the frequency of occurrences of each value (the number of times a given value is equalled but not surpassed, e.g. X<=x in statistical terms), in order to later plot the Sample Cumulative Distribution Function.
The code I am currently using is as follows, but it is extremely slow, as it has to loop 66049x66049=4362470401 times. Is there any way to augment the speed of such piece of code? Will perhaps the use of dictionaries help in any way? Unfortunately I cannot change the size of the arrays I am working with.
+++Function header+++
...
...
directoryPath=raw_input('Directory path for native csv file: ')
csvfile = numpy.genfromtxt(directoryPath, delimiter=",")
x=csvfile[:,2]
x1=numpy.delete(x, 0, 0)
x2=numpy.zeros((x1.shape[0]))
x2=sorted(x1)
x3=numpy.around(x2, decimals=3)
count=numpy.zeros(len(x3))
#Iterates over the x3 array to find the number of occurrences of each value
for i in range(len(x3)):
temp=x3[i]
for j in range(len(x3)):
if (temp<=x3[j]):
count[j]=count[j]+1
#Creates a 2D array with (value, occurrences)
x4=numpy.zeros((len(x3), 2))
for i in range(len(x3)):
x4[i,0]=x3[i]
x4[i,1]=numpy.around((count[i]/x1.shape[0]),decimals=3)
...
...
+++Function continues+++
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data'])
df_p.T.plot(kind='hist')
plt.show()
That whole script took a very short period to execute (~2s) for (100,000x1) array. I didn't time, but if you provide the time it took to do yours we can compare.
I used [Counter][2] from collections to count the number of occurrences, my experiences with it have always been great (timewise). I converted it into DataFrame to plot and used T to transpose.
Your data does replicate a bit, but you can try and refine it some more. As it is, it's pretty fast.
Edit
Create CDF using cumsum()
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data']).T
df_p['cumu'] = df_p['data'].cumsum()
df_p['cumu'].plot(kind='line')
plt.show()
Edit 2
For scatter() plot you must specify the (x,y) explicitly. Also, calling df_p['cumu'] will result in a Series, not a DataFrame.
To properly display a scatter plot you'll need the following:
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data']).T
df_p['cumu'] = df_p['data'].cumsum()
df_p.plot(kind='scatter', x='data', y='cumu')
plt.show()
You should use np.where and then count the length of the obtained vector of indices:
indices = np.where(x3 <= value)
count = len(indices[0])
If efficiency counts, you can use the numpy function bincount, which need integers :
import numpy as np
a=np.random.rand(66049).reshape((66049,1)).round(3)
z=np.bincount(np.int32(1000*a[:,0]))
it takes about 1ms.
Regards.
# for counting a single value
mask = (my_np_array == value_to_count).astype('uint8')
# or a condition
mask = (my_np_array <= max_value).astype('uint8')
count = np.sum(mask)
is there a more efficient way to take an average of an array in prespecified bins? for example, i have an array of numbers and an array corresponding to bin start and end positions in that array, and I want to just take the mean in those bins? I have code that does it below but i am wondering how it can be cut down and improved. thanks.
from scipy import *
from numpy import *
def get_bin_mean(a, b_start, b_end):
ind_upper = nonzero(a >= b_start)[0]
a_upper = a[ind_upper]
a_range = a_upper[nonzero(a_upper < b_end)[0]]
mean_val = mean(a_range)
return mean_val
data = rand(100)
bins = linspace(0, 1, 10)
binned_data = []
n = 0
for n in range(0, len(bins)-1):
b_start = bins[n]
b_end = bins[n+1]
binned_data.append(get_bin_mean(data, b_start, b_end))
print binned_data
It's probably faster and easier to use numpy.digitize():
import numpy
data = numpy.random.random(100)
bins = numpy.linspace(0, 1, 10)
digitized = numpy.digitize(data, bins)
bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]
An alternative to this is to use numpy.histogram():
bin_means = (numpy.histogram(data, bins, weights=data)[0] /
numpy.histogram(data, bins)[0])
Try for yourself which one is faster... :)
The Scipy (>=0.11) function scipy.stats.binned_statistic specifically addresses the above question.
For the same example as in the previous answers, the Scipy solution would be
import numpy as np
from scipy.stats import binned_statistic
data = np.random.rand(100)
bin_means = binned_statistic(data, data, bins=10, range=(0, 1))[0]
Not sure why this thread got necroed; but here is a 2014 approved answer, which should be far faster:
import numpy as np
data = np.random.rand(100)
bins = 10
slices = np.linspace(0, 100, bins+1, True).astype(np.int)
counts = np.diff(slices)
mean = np.add.reduceat(data, slices[:-1]) / counts
print mean
The numpy_indexed package (disclaimer: I am its author) contains functionality to efficiently perform operations of this type:
import numpy_indexed as npi
print(npi.group_by(np.digitize(data, bins)).mean(data))
This is essentially the same solution as the one I posted earlier; but now wrapped in a nice interface, with tests and all :)
I would add, and also to answer the question find mean bin values using histogram2d python that the scipy also have a function specially designed to compute a bidimensional binned statistic for one or more sets of data
import numpy as np
from scipy.stats import binned_statistic_2d
x = np.random.rand(100)
y = np.random.rand(100)
values = np.random.rand(100)
bin_means = binned_statistic_2d(x, y, values, bins=10).statistic
the function scipy.stats.binned_statistic_dd is a generalization of this funcion for higher dimensions datasets
Another alternative is to use the ufunc.at. This method applies in-place a desired operation at specified indices.
We can get the bin position for each datapoint using the searchsorted method.
Then we can use at to increment by 1 the position of histogram at the index given by bin_indexes, every time we encounter an index at bin_indexes.
np.random.seed(1)
data = np.random.random(100) * 100
bins = np.linspace(0, 100, 10)
histogram = np.zeros_like(bins)
bin_indexes = np.searchsorted(bins, data)
np.add.at(histogram, bin_indexes, 1)