How do I telescope the columns of a numpy array? - python

I have a numpy array and want to "telescope" the values based on the top row. An example is the best way to describe it
Start array:
9 9 8 7 7 7 6
1 2 3 4 5 6 3
3 4 5 6 7 6 3
5 6 7 8 9 6 4
desired output array:
9 8 7 6
3 3 15 3
7 5 19 3
11 7 23 4
The idea is to unique-ify the top row and sum values along the subsequent rows grouped by value in the top row. The top row will be sorted and the array will be about 2000 cells wide and 200,000 cells long. There could be any number of consecutive identical numbers in the top row. My current hack is this (slightly different top row labels in the example and I am printing to screen rather than creating the final array to check the output. Plan is to stack the output to generate the output array)
import numpy as N
kk=N.array([[90,90,85,80,80,80,70],[1,2,3,4,5,6,3],[3,4,5,6,7,6,3],[5,6,7,8,9,6,4]])
ll=kk[:,0]
for i in range(1,len(kk[0])):
if kk[0][i]==kk[0][i-1]:
ll=ll+kk[:,i]
elif kk[0][i]!=kk[0][i-1]:
print "sum=", ll, i,kk[0][i],kk[0][i-1]
ll=kk[:,i]
There are two defects. The major one is that it isn't dealing with the final column and I don't see why. The minor one is that it is summing the top row too. It's obvious why this minor one is happening. I suspect I can cludge my way around that one but the failure to deal with the final column has been frustrating me for a while and I'd really appreciate any suggestions for dealing with it.
thanks for any help

If you have 200,000 rows, a Python loop is likely going to be very slow. With NumPy you can vectorize that operation using np.add.reduceat, but you first need to create an array with the indices of the first item of each group of repeated entries in the first row:
mask = np.concatenate(([True], kk[0, 1:] != kk[0, :-1]))
indices, = np.nonzero(mask)
You can then get your first row by indexing it with the mask boolean array:
>>> kk[0, mask]
array([90, 85, 80, 70])
and the rest of the array using reduceat with indices:
>>> np.add.reduceat(kk[1:], indices, axis=1)
array([[ 3, 3, 15, 3],
[ 7, 5, 19, 3],
[11, 7, 23, 4]])
Assuming that your original array is of the default integer type, you could assemble your array by doing something like:
out = np.empty((kk.shape[0], len(indices)), dtype=kk.dtype)
out[0] = kk[0, mask]
np.add.reduceat(kk[1:], indices, axis=1, out=out[1:])
>>> out
array([[90, 85, 80, 70],
[ 3, 3, 15, 3],
[ 7, 5, 19, 3],
[11, 7, 23, 4]])

You should use the unique function from numpy
import numpy as np
a = np.array([[90,90,85,80,80,80,70],[1,2,3,4,5,6,3],[3,4,5,6,7,6,3],[5,6,7,8,9,6,4]])
u, v = np.unique(a[0], return_inverse=True)
output = np.zeros((a.shape[0], u.shape[0]))
output[0] = u.copy()
for i in xrange(u.shape[0]):
pos = np.where(v==i)[0]
output[1:,i] = np.sum(a[1:,pos], axis=1)
You should notice that u is going to be sorted from lowest to highest. If you want it from highest to lowest you have to do
output = output[:,::-1]
at the end.

You can make use of groupby:
from itertools import groupby
import numpy as N
kk=N.array([[90,90,85,80,80,80,70],[1,2,3,4,5,6,3],[3,4,5,6,7,6,3],[5,6,7,8,9,6,4]])
keys = kk[0]
vals = kk[1:]
uniq = map(lambda x: x[0], groupby(keys))
new = [uniq]
for row in vals:
new.append([sum(map(lambda x: x[1], group)) for _, group in groupby(zip(keys, row), lambda x: x[0])])
print N.array(new)
Provides the output:
[[90 85 80 70]
[ 3 3 15 3]
[ 7 5 19 3]
[11 7 23 4]]

Related

fill 2d array python value from a list / 1d array

How do I fill a 2d array value from an updated 1d list ?
for example I have a list that I get from this code :
a=[]
for k, v in data.items():
b=v/sumcount
a.append(b)
What I want to do is produce several 'a' list and put their value into 2d array with different column. OR put directly the b value into 2D array whic one colum represent loop for number of k.
*My difficulties here is, k is not integer. its dict keys (str). whose length=9
I have tried this but does not work :
row = len(data.items())
matrix=np.zeros((9,2))
for i in range (1,3) :
a=[]
for k, v in data.items():
b=v/sumcount
matrix[x][i].fill(b), for x in range (1, 10)
a list is
1
2
3
4
5
6
7
8
9
and for example I do the outer loop, what I expect is
*for example 1 to 2 outer loop so I expect there will be 2 column and 9 row.
1 6
2 7
3 8
4 9
5 14
6 15
7 16
8 17
9 18
I want to fill matrix value with b
import numpy as np
import pandas as pd
matrix = np.zeros((9, 2))
df = pd.DataFrame({'aaa': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
sumcount = [1, 2]
for i in range(len(sumcount)):
matrix[:, i] = df['aaa']/sumcount[i]
print(matrix)
As far as I understand you: you need to get the result of the column from the dataframe and place it in a numpy array. No need to iterate over each row if your sumcount is the same number. This will work slowly. In general, loops are used as a last resort, if there is no other possibility.
Slicing is used to set values in numpy.
bbb = np.array([df['aaa']/sumcount[i] for i in range(len(sumcount))]).transpose()
print(bbb)
Or do without a loop at all using list comprehension, wrap the result in np.array and apply transpose.

Find the lowest value index in a numpy array per column plus value

This is quite easy:
import numpy as np
np.random.seed(2341)
data = (np.random.rand(3,4) * 100).astype(int)
so I have
[[35 20 47 39]
[ 6 17 77 85]
[ 8 25 2 3]]
Great, now lets get the indices of the smallest values per row:
kmin = np.argmin(data, axis=1)
this outputs
[1 0 2]
So in the first row, the second element is the smallest. In the second row the first and in the 3rd row it's the 3rd element. But how do I access those values and get them as one column?
I tried this syntax:
min_vals = data[:, kmin]
but the result is an 3x3 array. I need an output like this:
[[20]
[ 6]
[ 2]]
I know that I get the values on a different way too, but later on I have to implement Matlab code like this
data(1:n1,kmin,1);
where I need to select the lowest values again.
You can use np.choose function for it.
min_vals = np.choose(kmin, data.T)
I got this.
[20 6 2]

Using numpy.where to calculate new pandas column, with multiple conditions

I have a problem with regards as to how to appropriately code this condition. I'm currently creating a new pandas column in my dataframe, new_column, which performs a subtraction on the values in column test, based on what index of the data we are at. I'm currently using this code to get it to subtract a different value every 4 times:
subtraction_value = 3
subtraction_value = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]}
data['new_column'] = np.where(data.index%4,
data['test']-subtraction_value,
data['test']-subtraction_value_2)
print (data['new_column']
[6,1,2,1,-5,0,-1,3,4,6]
However, I now wish to get it performing the higher subtraction on the first two positions in the column, and then 3 subtractions with the original value, another two with the higher subtraction value, 3 small subtractions, and so forth. I thought I could do it this way, with an | condition in my np.where statement:
data['new_column'] = np.where((data.index%4) | (data.index%5),
data['test']-subtraction_value,
data['test']-subtraction_value_2)
However, this didn't work, and I feel my maths may be slightly off. My desired output would look like this:
print(data['new_column'])
[6,-2,2,1,-2,-3,-4,3,7,6])
As you can see, this slightly shifts the pattern. Can I still use numpy.where() here, or do I have to take a new approach? Any help would be greatly appreciated!
As mentioned in the comment section, the output should equal
[6,-2,2,1,-2,-3,-4,2,7,6] instead of [6,-2,2,1,-2,-3,-4,3,7,6] according to your logic. Given that, you can do the following:
import pandas as pd
import numpy as np
from itertools import chain
subtraction_value = 3
subtraction_value_2 = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]})
index_pos_large_subtraction = list(chain.from_iterable((data.index[i], data.index[i+1]) for i in range(0, len(data)-1, 5)))
data['new_column'] = np.where(~data.index.isin(index_pos_large_subtraction), data['test']-subtraction_value, data['test']-subtraction_value_2)
# The next line is equivalent to the previous one
# data['new_column'] = np.where(data.index.isin(index_pos_large_subtraction), data['test']-subtraction_value_2, data['test']-subtraction_value)
---------------------------------------------
test new_column
0 12 6
1 4 -2
2 5 2
3 4 1
4 1 -2
5 3 -3
6 2 -4
7 5 2
8 10 7
9 9 6
---------------------------------------------
As you can see, np.where works fine. Your masking condition is the problem and needs to be adjusted, you are not selecting rows according to your logic.

Efficient calculation across dictionary consisting of thousands of correlation matrizes

Based on a large dataset of daily observations from 20 assets, I created a dictionary which comprises (rolling) correlation matrices. I am using the date index as a key for the dictionary.
What I want to do now (in an efficient manner) is to compare all correlation matrizes within the dictionary and save the result in a new matrix. The idea is to compare correlation structures over time.
import pandas as pd
import numpy as np
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import cophenet
key_list = dict_corr.keys()
# Create empty matrix
X = np.empty(shape=[len(key_list),len(key_list)])
key1_index = 0
key2_index = 0
for key1 in key_list:
# Extract correlation matrix from dictionary
corr1_temp = d[key1]
# Transform correlation matrix into distance matrix
dist1_temp = ((1-corr1_temp)/2.)**.5
# Extract hierarchical structure from distance matrix
link1_temp = linkage(dist1_temp,'single')
for key2 in key_list:
corr2_temp = d[key2]
dist2_temp = ((1-corr2_temp)/2.)**.5
link2_temp = linkage(dist2_temp,'single')
# Compare hierarchical structure between the two correlation matrizes -> results in 2x2 matrix
temp = np.corrcoef(cophenet(link1_temp),cophenet(link2_temp))
# Extract from the resulting 2x2 matrix the correlation
X[key1_index, key2_index] = temp[1,0]
key2_index =+ 1
key1_index =+1
I'm well aware of the fact that using two for loops is probably the least efficient way to do it.
So I'm grateful for any helpful comment how to speed up the calculations!
Best
You can look at itertools and then insert your code to compute the correlation within a function (compute_corr) called in the single for loop:
import itertools
for key_1, key_2 in itertools.combinations(dict_corr, 2):
correlation = compute_corr(key_1, key_2, dict_corr)
#now store correlation in a list
If you care about the order use itertools.permutations(dict_corr, 2) instead of combinations.
EDIT
Since you want all possible combination of keys (also a key with itself), you should use itertools.product.
l_corr = [] #list to store all the output from the function
for key_1, key_2 in itertools.product(key_list, repeat= 2 ):
l_corr.append(compute_corr(key_1, key_2, dict_corr))
Now l_corr will be long: len(key_list)*len(key_list).
You can convert this list to a matrix in this way:
np.array(l_corr).reshape(len(key_list),len(key_list))
Dummy example:
def compute_corr(key_1, key_2, dict_corr):
return key_1 * key_2 #dummy result from the function
dict_corr={1:"a",2:"b",3:"c",4:"d",5:"f"}
key_list = dict_corr.keys()
l_corr = []
for key_1, key_2 in itertools.product(key_list, repeat= 2 ):
print(key_1, key_2)
l_corr.append(compute_corr(key_1, key_2, dict_corr))
Combinations:
1 1
1 2
1 3
1 4
1 5
2 1
2 2
2 3
2 4
2 5
3 1
3 2
3 3
3 4
3 5
4 1
4 2
4 3
4 4
4 5
5 1
5 2
5 3
5 4
5 5
Create the final matrix:
np.array(l_corr).reshape(len(key_list),len(key_list))
array([[ 1, 2, 3, 4, 5],
[ 2, 4, 6, 8, 10],
[ 3, 6, 9, 12, 15],
[ 4, 8, 12, 16, 20],
[ 5, 10, 15, 20, 25]])
Let me know in case I missed something. Hope this may help you

combine/replace arrays of selected indices/rows along a specific axis of a N-D array with their sum

I have N-D numpy array, e.g.:
a = np.arange(21).reshape(7,3)
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]
[12 13 14]
[15 16 17]
[18 19 20]]
and an array of index arrays along specific axis, e.g.:
rows = np.array([np.array([1,2]), np.array([0,4,6])])
What I want to achieve is for each array in rows(e.g. [1,2]), I want to sum the rows of a along a specific axis (e.g. axis=0) and put it in place of the first index in that selection of rows (e.g. 1) and delete the rest of the rows in that selection (e.g. 2). The final output for the above example will look like:
[[30 33 36] <- sum of rows(0,4,6)
[ 9 11 13] <- sum of rows(1,2)
[ 9 10 11] <- row (3)
[15 16 17]] <- row (5)
I have this solution for it:
for i in rows:
a[i[0]] = a[i].sum(axis=0)
a = np.delete(a, np.hstack([i[1:] for i in rows]), axis=0)
And I know I can do it with ravel and pandas, but I feel like there should be a more elegant way of doing it. Also, the solution has to work for any N-D array and a selected axis. Thank you.
In case there is no better answer, here is the solution mentioned in the question for a dynamic axis inspired by #hpaulj's answer to this post:
axs=0
for i in rows:
I = [slice(None)]*a.ndim
I_0 = [slice(None)]*a.ndim
I[axs] = i
I_0[axs] = i[0]
a[tuple(I_0)] = a[tuple(I)].sum(axis=axs)
a = np.delete(a, np.hstack([i[1:] for i in rows]), axis=axs)
and adding per #hpaulj's comment to index rest instead of deleting (while above code dynamically selects axis, this is not selecting axis dynamically, but it is easy to convert to dynamic style):
axs=0
for i in rows:
a[i[0]] = a[i].sum(axis=axs)
a = a[np.setdiff1d(np.arange(a.shape[axs]), np.hstack([i[1:] for i in rows])), :]
You can use map and getitem to select only certain rows.
import numpy as np
a = np.arange(21).reshape(7,3)
rows = np.array([0, 4, 6])
# here is the code:
# only select rows that are in the array called "rows"
b = list(map(a.__getitem__, rows))
a = np.sum(b, axis = 0) #np.sum works fine on a list
OK, now you do this across multiple index lists.
rows = np.array([np.array([1,2]), np.array([0,4,6])])
# here is the code:
b=[ np.sum(list(map(a.__getitem__, el)),axis =0) for el in rows]
print(np.array(b))

Categories