How can I calculate scikit-learn rbf_kernel() with very large array? - python

While using the rbf_kernel() function the array is too large and there is a memory issue, so I have to separate the data and calculate it.
from sklearn.metrics.pairwise import rbf_kernel
result = rbf_kernel([[1,1],[2,2],[3,3]], gamma=60) # A data:[1,1] , B data:[2,2], C data:[3,3]
And result looks like
A B C
A 1 2 1
B 1 1 1
C 1 1 2
However, if I insert larger data, there is a memory issue.
result = rbf_kernel([[1,1],[2,2],[3,3],[4,4],[5,5],.... ], gamma=60)
How can I extract the result without putting data all at once?

Try using:
l = [[1,1],[2,2],[3,3],[4,4],[5,5], ...]
newl = []
for i in range(0, len(l), 10):
newl.append(rbf_kernel(l[i:i + 10]))

Related

How to form a matrix from one column in a file?

I have a file that contains this column of info
1.0000000000000002
0.6593496737729044
1.0000000000000002
I can read this data from a file and I want to form a matrix 2*2 from it. I tried a lot, but I got a wrong output.
my code
with open("final_overlap.txt", "r") as final_over:
for i in range(2):
for j in range(2):
i = final_over.readline()
j = final_over.readline()
S = np.array([i,j])
print(S)
The output I want like this.
[[1.0000000000000002 0.6593496737729044]
[0.6593496737729044 1.0000000000000002]]
How can I form this matrix.
Take into account that I have another input, and it has more info, so I want a method that can form a different matrix not only 2*2.
Like this input too
1 1 1.0000000000000002
2 1 0.6593496737729044
2 2 1.0000000000000002
3 1 0.1192165290691592
3 2 0.0954901018165798
3 3 1.0000000000000002
4 1 0.0954901018165798
4 2 0.1192165290691592
4 3 0.6593496737729044
4 4 1.0000000000000002
and the matrix will be 4*4
One more question about the matrix. I got the right answer but if I have input like this.
`
1 1 1 1 0.7746059439198979
2 1 1 1 0.4441350695399573
2 1 2 1 0.2970603935859659
2 2 1 1 0.5696940113278337
2 2 2 1 0.4441350695399575
2 2 2 2 0.7746059439198979
I tried with this code, but I got error "list index out of range"
for line in open('Two_Electron.txt'):
r,c,d,e,v = line.split()
r = int(r)-1
c = int(c)-1
d = int(d)-1
e = int(e)-1
v = float(v)
if c == 0:
data.append( [v] )
else:
data[-1].append(v)
print(data)
# Fill in the upper triangle.
for i in range(len(data)-1):
for j in range(i+1,len(data)):
data[i].append( data[j][i] )
for k in range(len(data)-1):
for l in range(k+1,len(data)):
data[k].append( data[l][k] )
V_ee = np.array(data)
The output should I get.
[[[[0.77460594 0.4441351 ]
[0.4441351 0.56969403]]
[[0.4441351 0.29706043]
[0.29706043 0.4441351 ]]]
[[[0.4441351 0.29706043]
[0.29706043 0.4441351 ]]
[[0.56969403 0.4441351 ]
[0.4441351 0.77460594]]]]
Load the data into a simple list, then build the rows from the list.
with open("final_overlap.txt", "r") as final_over:
data = [float(line) for line in final_over]
S = np.array( [data[0:2], data[1:]] )
print(S)
Output:
[[1. 0.65934967]
[0.65934967 1. ]]
Followup
OK, assuming your data has row and column numbers like your second example, this will read the data, fill in the upper triangle, and convert to np.array.
import numpy as np
# Read in the data to find out the size.
data = []
for line in open('x.txt'):
r,c,v = line.split()
r = int(r)-1
c = int(c)-1
v = float(v)
if c == 0:
data.append( [v] )
else:
data[-1].append(v)
# Fill in the upper triangle.
for i in range(len(data)-1):
for j in range(i+1,len(data)):
data[i].append( data[j][i] )
array = np.array(data)
print(array)
Output:
[[1. 0.65934967 0.11921653 0.0954901 ]
[0.65934967 1. 0.0954901 0.11921653]
[0.11921653 0.0954901 1. 0.65934967]
[0.0954901 0.11921653 0.65934967 1. ]]
It would still be possible to do this, even if you don't have the row and column numbers, just by keeping an internal counter.

Iterate the code in a shortest way for the whole dataset

I have very big df:
df.shape() = (106, 3364)
I want to calculate so called frechet distance by using this Frechet Distance between 2 curves. And it works good. Example:
x = df['1']
x1 = df['1.1']
p = np.array([x, x1])
y = df['2']
y1 = df['2.1']
q = np.array([y, y1])
P_final = list(zip(p[0], p[1]))
Q_final = list(zip(q[0], q[1]))
from frechetdist import frdist
frdist(P_final,Q_final)
But I can not do row by row like:
`1 and 1.1` to `1 and 1.1` which is equal to 0
`1 and 1.1` to `2 and 2.1` which is equal to some number
...
`1 and 1.1` to `1682 and 1682.1` which is equal to some number
I want to create something (first idea is for loop, but maybe you have better solution) to calculate this frdist(P_final,Q_final) between:
first rows to all rows (including itself)
second row to all rows (including itself)
Finally, I supposed to get a matrix size (106,106) with 0 on diagonal (because distance between itself is 0)
matrix =
0 1 2 3 4 5 ... 105
0 0
1 0
2 0
3 0
4 0
5 0
... 0
105 0
Not including my trial code because it is confusing everyone!
EDITED:
Sample data:
1 1.1 2 2.1 3 3.1 4 4.1 5 5.1
0 43.1024 6.7498 45.1027 5.7500 45.1072 3.7568 45.1076 8.7563 42.1076 8.7563
1 46.0595 1.6829 45.0595 9.6829 45.0564 4.6820 45.0533 8.6796 42.0501 3.6775
2 25.0695 5.5454 44.9727 8.6660 41.9726 2.6666 84.9566 3.8484 44.9566 1.8484
3 35.0281 7.7525 45.0322 3.7465 14.0369 3.7463 62.0386 7.7549 65.0422 7.7599
4 35.0292 7.5616 45.0292 4.5616 23.0292 3.5616 45.0292 7.5616 25.0293 7.5613
I just used own sample data in your format (I hope)
import pandas as pd
from frechetdist import frdist
import numpy as np
# create sample data
df = pd.DataFrame([[1,2,3,4,5,6], [3,4,5,6,8,9], [2,3,4,5,2,2], [3,4,5,6,7,3]], columns=['1','1.1','2', '2.1', '3', '3.1'])
# this matrix will hold the result
res = np.ndarray(shape=(df.shape[1] // 2, df.shape[1] // 2), dtype=np.float32)
for row in range(res.shape[0]):
for col in range(row, res.shape[1]):
# extract the two functions
P = [*zip([df.loc[:, f'{row+1}'], df.loc[:, f'{row+1}.1']])]
Q = [*zip([df.loc[:, f'{col+1}'], df.loc[:, f'{col+1}.1']])]
# calculate distance
dist = frdist(P, Q)
# put result back (its symmetric)
res[row, col] = dist
res[col, row] = dist
# output
print(res)
Output:
[[0. 4. 7.5498343]
[4. 0. 5.5677643]
[7.5498343 5.5677643 0. ]]
Hope that helps
EDIT: Some general tips:
If speed matters: check if frdist handles also a numpy array of shape
(n_values, 2) than you could save the rather expensive zip-and-unpack operation
and directly use the arrays or build the data directly in a format the your library needs
Generally, use better column namings (3 and 3.1 is not too obvious). Why you dont call them x3, y3 or x3 and f_x3
I would actually put the data into two different Matrices. If you watch the
code I had to do some not-so-obvious stuff like iterating over shape
divided by two and built indices from string operations because of the given table layout

How to make this for loop faster?

I know that python loops themselves are relatively slow when compared to other languages but when the correct functions are used they become much faster.
I have a pandas dataframe called "acoustics" which contains over 10 million rows:
print(acoustics)
timestamp c0 rowIndex
0 2016-01-01T00:00:12.000Z 13931.500000 8158791
1 2016-01-01T00:00:30.000Z 14084.099609 8158792
2 2016-01-01T00:00:48.000Z 13603.400391 8158793
3 2016-01-01T00:01:06.000Z 13977.299805 8158794
4 2016-01-01T00:01:24.000Z 13611.000000 8158795
5 2016-01-01T00:02:18.000Z 13695.000000 8158796
6 2016-01-01T00:02:36.000Z 13809.400391 8158797
7 2016-01-01T00:02:54.000Z 13756.000000 8158798
and there is the code I wrote:
acoustics = pd.read_csv("AccousticSandDetector.csv", skiprows=[1])
weights = [1/9, 1/18, 1/27, 1/36, 1/54]
sumWeights = np.sum(weights)
deltaAc = []
for i in range(5, len(acoustics)):
time = acoustics.iloc[i]['timestamp']
sum = 0
for c in range(5):
sum += (weights[c]/sumWeights)*(acoustics.iloc[i]['c0']-acoustics.iloc[i-c]['c0'])
print("Row " + str(i) + " of " + str(len(acoustics)) + " is iterated")
deltaAc.append([time, sum])
deltaAc = pd.DataFrame(deltaAc)
It takes a huge amount of time, how can I make it faster?
You can use diff from pandas and create all the differences for each row in an array, then multiply with your weigths and finally sum over the axis 1, such as:
deltaAc = pd.DataFrame({'timestamp': acoustics.loc[5:, 'timestamp'],
'summation': (np.array([acoustics.c0.diff(i) for i in range(5) ]).T[5:]
*np.array(weights)).sum(1)/sumWeights})
and you get the same values than what I get with your code:
print (deltaAc)
timestamp summation
5 2016-01-01T00:02:18.000Z -41.799986
6 2016-01-01T00:02:36.000Z 51.418728
7 2016-01-01T00:02:54.000Z -3.111184
First optimization, weights[c]/sumWeights could be done outside the loop.
weights_array = np.array([1/9, 1/18, 1/27, 1/36, 1/54])
sumWeights = np.sum(weights_array)
tmp = weights_array / sumWeights
...
sum += tmp[c]*...
I'm not familiar with pandas, but if you could extract your columns as 1D numpy array, it would be great for you. It might look something like:
# next lines to be tested, or find the correct way of extracting the column
c0_column = acoustics[['c0']].values
time_column = acoustics[['times']].values
...
sum = numpy.zeros(shape=(len(acoustics)-5,))
delta_ac = []
for c in range(5):
sum += tmp[c]*(c0_column[5:]-c0_column[5-c:len(acoustics)-c])
for i in range(len(acoustics)-5):
deltaAc.append([time[5+i], sum[i])
Dataframes have a great method rolling for constructing and applying windowing transformations; So, you don't need loops at all:
# df is your data frame
window_size = 5
weights = pd.np.array([1/9, 1/18, 1/27, 1/36, 1/54])
weights /= weights.sum()
df.loc[:,'deltaAc'] = df.loc[:, 'c0'].rolling(window_size).apply(lambda x: ((x[-1] - x)*weights).sum())

Update values in numpy array without tedious for loops

I have a few arrays with data like this:
a = np.random.rand(3,3)
b = np.random.rand(3,3)
Using for loops I construct larger matrix
L = np.zeros((9,9))
for i in range(9):
for j in range(9):
L[i,j] = f(a,b,i,j) # values of L depends on values of a and b
Later in my program I will change a and b and I want my L array to change too. So the logic of my program looks like this (in pseudo code)
Create a
Create b
while True:
Create L using a and b
Do the stuff
Change a
Change b
In my program the size of L is large (10^6 x 10^6 and larger).
Constructing this L matrix again and again is tedious and slow process.
Instead of doing for loops again and again I would like just to update values of L matrix according to changed values of a and b. The structure of L is the same each time, the only difference is values of cells. Something like this:
a[0,0] = 2
b[0,0] = 2
L[3,5] = 2*a[0,0]*b[0,0]
L[3,5]
# >>> 8
a[0,0] = 3
b[0,0] = 1
# do some magic here
L[3,5]
>>> 6
Can something like this solve your problem ?
>>> a = 10
>>> b = 20
>>> def func():
# fetch the values of a and b
... return a+b
...
>>> lis = [func]
>>> lis[0]
<function func at 0x109f70730>
>>> lis[0]()
30
>>> a = 20
>>> lis[0]()
40
Basically every time you fetch the value by calling a function
that computes the latest value.

Cumulative integration of elements of numpy arrays

I would like to to the following type of integration:
Say I have 2 arrays
a = np.array[1,2,3,4]
b = np.array[2,4,6,8]
I know how to integrate these using something like:
c = scipy.integrate.simps(b, a)
where c = 15 for above data set.
What I would like to do is multiply the first elements of each array and add to new array called d, i.e. a[0]*b[0] then integrate the first 2 elements the arrays then the first 3 elements, etc. So eventually for this data set, I would get
d = [2 3 8 15]
I have tried a few things but no luck; I am pretty new to writing code.
If I have understood correctly what you need you could do the following:
import numpy as np
from scipy import integrate
a = np.array([2,4,6,8])
b = np.array([1,2,3,4])
d = np.empty_like(b)
d[0] = a[0] * b[0]
for i in range(2, len(a) + 1):
d[i-1] = integrate.simps(b[0:i], a[0:i])
print(d)

Categories