Matrix Manipulation: Subtract 2D Matrix and 3D Matrix in numpy - python

If I have 3-d Matrix like:
cor =: 3 3 3 $ i.5
cor
0 1 2
3 4 0
1 2 3
4 0 1
2 3 4
0 1 2
3 4 0
1 2 3
4 0 1
and 2-d matrix like:
d =: 3 3 $ i.5
d
0 1 2
3 4 0
1 2 3
It is really simple to calculate in J language: putting "2 (by 2D matrix) after - sign.
d -"2 cor
0 0 0
0 0 0
0 0 0
_4 1 1
1 1 _4
1 1 1
_3 _3 2
2 2 _3
_3 2 2
But I am still a numpy novice....
cor - d
ValueError: Unable to coerce to Series/DataFrame, dim must be <= 2: (59, 59, 59)
Is there anyway that I can manipulate this kind of matrix manipulation in Python Numpy??
Thanks in advance.
this is the python for loop code that I wanted to change into numpy
def pcor(df):
cor = df.corr()
n = df.shape[1] # number of indices
pcor = np.empty((n, n, n))
d = np.empty((n, n, n))
for x in range(n):
for y in range(n):
for m in range(n):
if x==y:
pcor[x,y,m] = float('nan')
else:
pcor[x,y,m] = (cor.iloc[x,y] - cor.iloc[x,m]*cor.iloc[y,m])/((1-cor.iloc[x,m]**2)*(1-cor.iloc[y,m]**2))**(1/2)
d[x,y,m] = cor.iloc[x,y] - pcor[x,y,m] # <-- this part!

You need to match the shape of d (currently (3, 3)) to the shape of cor (currently (3, 3, 3)) before subtraction. Try cor - d[:None]. This basically tells numpy to use the existing shape of d (:) and to create a new axis for the last dimension (None).

Related

Numpy: recode numeric array to which quintile each element belongs

I have a numeric vector a:
import numpy as np
a = np.random.rand(100)
I wish to get the vector (or any other vector) recoded so that each element is either 0, 1, 2, 3 or 4, according to which a quintile it is in (could be more general for any quantile, like quartile, decile etc.).
This is what I'm doing. There has to be something more elegant, no?
from scipy.stats import percentileofscore
n_quantiles = 5
def get_quantile(i, a, n_quantiles):
if a[i] >= max(a):
return n_quantiles - 1
return int(percentileofscore(a, a[i])/(100/n_quantiles))
a_recoded = np.array([get_quantile(i, a, n_quantiles) for i in range(len(a))])
print(a)
print(a_recoded)
[0.04708996 0.86267278 0.23873192 0.02967989 0.42828385 0.58003015
0.8996666 0.15359369 0.83094778 0.44272398 0.60211289 0.90286434
0.40681163 0.91338397 0.3273745 0.00347029 0.37471307 0.72735901
0.93974808 0.55937197 0.39297097 0.91470761 0.76796271 0.50404401
0.1817242 0.78244809 0.9548256 0.78097562 0.90934337 0.89914752
0.82899983 0.44116683 0.50885813 0.2691431 0.11676798 0.84971927
0.38505195 0.7411976 0.51377242 0.50243197 0.89677377 0.69741088
0.47880953 0.71116534 0.01717348 0.77641096 0.88127268 0.17925502
0.53053573 0.16935597 0.65521692 0.19042794 0.21981197 0.01377195
0.61553814 0.8544525 0.53521604 0.88391848 0.36010949 0.35964882
0.29721931 0.71257335 0.26350287 0.22821314 0.8951419 0.38416004
0.19277649 0.67774468 0.27084229 0.46862229 0.3107887 0.28511048
0.32682302 0.14682896 0.10794566 0.58668243 0.16394183 0.88296862
0.55442047 0.25508233 0.86670299 0.90549872 0.04897676 0.33042884
0.4348465 0.62636481 0.48201213 0.49895892 0.36444648 0.01410316
0.46770595 0.09498391 0.96793139 0.03931124 0.64286295 0.50934846
0.59088907 0.56368594 0.7820928 0.77172038]
[0 4 1 0 2 3 4 0 4 2 3 4 2 4 1 0 1 3 4 2 1 4 3 2 0 3 4 3 4 4 4 2 2 1 0 4 1
3 2 2 4 3 2 3 0 3 4 0 2 0 3 0 1 0 3 4 2 4 1 1 1 3 1 1 4 1 0 3 1 2 1 1 1 0
0 3 0 4 2 1 4 4 0 1 2 3 2 2 1 0 2 0 4 0 3 2 3 2 3 3]
Update: just wanted to say this is so easy in R:
How to get the x which belongs to a quintile?
You could use argpartition. Example:
>>> a = np.random.random(20)
>>> N = len(a)
>>> nq = 5
>>> o = a.argpartition(np.arange(1, nq) * N // nq)
>>> out = np.empty(N, int)
>>> out[o] = np.arange(N) * nq // N
>>> a
array([0.61238649, 0.37168998, 0.4624829 , 0.28554766, 0.00098016,
0.41979328, 0.62275886, 0.4254548 , 0.20380679, 0.762435 ,
0.54054873, 0.68419986, 0.3424479 , 0.54971072, 0.06929464,
0.51059431, 0.68448674, 0.97009023, 0.16780152, 0.17887862])
>>> out
array([3, 1, 2, 1, 0, 2, 3, 2, 1, 4, 3, 4, 1, 3, 0, 2, 4, 4, 0, 0])
Here's one way to do it using pd.cut()
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(100))
df.columns = ['values']
# Apply the quantiles
gdf = df.groupby(pd.cut(df.loc[:, 'values'], np.arange(0, 1.2, 0.2)))['values'].apply(lambda x: list(x)).to_frame()
# Make use of the automatic indexing to assign quantile numbers
gdf.reset_index(drop=True, inplace=True)
# Re-expand the grouped list of values. Method provided by #Zero at https://stackoverflow.com/questions/32468402/how-to-explode-a-list-inside-a-dataframe-cell-into-separate-rows
gdf['values'].apply(pd.Series).stack().reset_index(level=1, drop=True).to_frame('values').reset_index()

Coercing numpy arrays to same dimensions

I have 2 matrices, and I want to perform a 'cell-wise' addition, however the matrices aren't the same size. I want to preserve the cells relative positions during the calculation (i.e. their 'co-ordinates' from the top left), so a simple (if maybe not the best) solution, seems to be to pad the smaller matrix's x and y with zeros.
This thread has a perfectly satisfactory answer for concatenating vertically, and this does work with my data, and following the suggestion in the answer, I also threw in the hstack but at the moment, it's complaining that the dimensions (excluding concatenation axis) need to match exactly. Perhaps hstack doesnt work as I anticipate or exactly equivalently to vstack, but I'm at a bit of a loss now.
This is what hstack throws at me, meanwhile vstack seems to have no problem.
ValueError: all the input array dimensions except for the concatenation axis must match exactly
Essentially the code checks which of a pair of matrices is the shorter and/or wider, and then pads the smaller matrix with zeros to match.
Here's the code I have:
import numpy as np
A = np.random.randint(2, size = (3, 7))
B = np.random.randint(2, size = (5, 10))
# If the arrays have different row numbers:
if A.shape[0] < B.shape[0]: # Is A shorter than B?
A = np.vstack((A, np.zeros((B.shape[0] - A.shape[0], A.shape[1]))))
elif A.shape[0] > B.shape[0]: # or is A longer than B?
B = np.vstack((B, np.zeros((A.shape[0] - B.shape[0], B.shape[1]))))
# If they have different column numbers
if A.shape[1] < B.shape[1]: # Is A narrower than B?
A = np.hstack((A, np.zeros((B.shape[1] - A.shape[1], A.shape[0]))))
elif A.shape[1] > B.shape[1]: # or is A wider than B?
B = np.hstack((B, np.zeros((A.shape[1] - B.shape[1], B.shape[0]))))
It's getting late so its possible I've just missed something obvious with hstack but I can't see my logic error at the moment.
Just use np.pad :
np.pad(A,((0,2),(0,3)),'constant') # 2 is 5-3, 3 is 10-7
[[0 1 1 0 1 0 0 0 0 0]
[1 0 0 1 0 1 0 0 0 0]
[1 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0]]
But the 4 pads width must be computed; so an another simple
method to pad the 2 array in any case is :
A = np.ones((3, 7),int)
B = np.ones((5, 2),int)
ma,na = A.shape
mb,nb = B.shape
m,n = max(ma,mb) , max(na,nb)
newA = np.zeros((m,n),A.dtype)
newA[:ma,:na]=A
newB = np.zeros((m,n),B.dtype)
newB[:mb,:nb]=B
For :
[[1 1 1 1 1 1 1]
[1 1 1 1 1 1 1]
[1 1 1 1 1 1 1]
[0 0 0 0 0 0 0]
[0 0 0 0 0 0 0]]
[[1 1 0 0 0 0 0]
[1 1 0 0 0 0 0]
[1 1 0 0 0 0 0]
[1 1 0 0 0 0 0]
[1 1 0 0 0 0 0]]
I think your hstack lines should be of the form
np.hstack((A, np.zeros((A.shape[0], B.shape[1] - A.shape[1]))))
You seem to have the rows and columns swapped.
Yes, indeed. You should swap (B.shape[1] - A.shape[1], A.shape[0]) to (A.shape[0], B.shape[1] - A.shape[1]) and so on, because you need to have the same numbers of rows to stack them horizontally.
Try b[:a.shape[0], :a.shape[1]] = b[:a.shape[0], :a.shape[1]]+a where b the larger array
Example below
import numpy as np
a = np.arange(12).reshape(3, 4)
print("a\n", a)
b = np.arange(16).reshape(4, 4)
print("b original\n", b)
b[:a.shape[0], :a.shape[1]] = b[:a.shape[0], :a.shape[1]]+a
print("b new\n",b)
output
a
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
b original
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
b new
[[ 0 2 4 6]
[ 8 10 12 14]
[16 18 20 22]
[12 13 14 15]]

Euclidean distance in Python

I have two 3000x3 vectors and I'd like to compute 1-to-1 Euclidean distance between them. For example, vec1 is
1 1 1
2 2 2
3 3 3
4 4 4
...
The vec2 is
2 2 2
3 3 3
4 4 4
5 5 5
...
I'd like to get the results as
1.73205081
1.73205081
1.73205081
1.73205081
...
I triedscipy.spatial.distance.cdist(vec1,vec2), and it returns a 3000x3000 matrix whereas I only need the main diagonal. I also tried np.sqrt(np.sum((vec1-vec2)**2 for vec1,vec2 in zip(vec1,vec2))) and it didn't work for my purpose. Is there any way to compute the distances please? I'd appreciate any comments.
cdist gives you back a 3000 x 3000 array because it computes the distance between every pair of row vectors in your two input arrays.
To compute only the distances between corresponding row indices, you could use np.linalg.norm:
a = np.repeat((np.arange(3000) + 1)[:, None], 3, 1)
b = a + 1
dist = np.linalg.norm(a - b, axis=1)
Or using standard vectorized array operations:
dist = np.sqrt(((a - b) ** 2).sum(1))
Here's another way that works. It still utilizes the np.linalg.norm function but it processes the data, if that is something you needed.
import numpy as np
vec1='''1 1 1
2 2 2
3 3 3
4 4 4'''
vec2='''2 2 2
3 3 3
4 4 4
5 5 5'''
process_vec1 = np.array([])
process_vec2 = np.array([])
for line in vec1:
process_vec1 = np.append( process_vec1, map(float,line.split()) )
for line in vec2:
process_vec2 = np.append( process_vec2, map(float,line.split()) )
process_vec1 = process_vec1.reshape( (len(process_vec1)/3, 3) )
process_vec2 = process_vec2.reshape( (len(process_vec2)/3, 3) )
dist = np.linalg.norm( process_vec1 - process_vec2 , axis = 1 )
print dist
[1.7320508075688772 1.7320508075688772 1.7320508075688772 1.7320508075688772]

Python Ignoring What is in a list?

Working on a project for CS1 that prints out a grid made of 0s and adds shapes of certain numbered sizes to it. Before it adds a shape it needs to check if A) it will fit on the grid and B) if something else is already there. The issue I am having is that when run, the function that checks to make sure placement for the shapes is valid will always do the first and second shapes correctly, but any shape added after that will only "see" the first shape added when looking for a collision. I checked to see if it wasnt taking in the right list after the first time but that doesnt seem to be it. Example of the issue....
Shape Sizes = 4, 3, 2, 1
Python Outputs:
4 4 4 4 1 2 3 0
4 4 4 4 2 2 3 0
4 4 4 4 3 3 3 0
4 4 4 4 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
It Should Output:
4 4 4 4 3 3 3 1
4 4 4 4 3 3 3 0
4 4 4 4 3 3 3 0
4 4 4 4 2 2 0 0
0 0 0 0 2 2 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
What's going on here? Full Code is below...
def binCreate(size):
binlist = [[0 for col in range(size)] for row in range(size)]
return binlist
def binPrint(lst):
for row in range(len(lst)):
for col in range(len(lst[row])):
print(lst[row][col], end = " ")
print()
def itemCreate(fileName):
lst = []
for i in open(fileName):
i = i.split()
lst = i
lst = [int(i) for i in lst]
return lst
def main():
size = int(input("Bin Size: "))
fileName = str(input("Item Size File: "))
binList = binCreate(size)
blockList = itemCreate(fileName)
blockList.sort(reverse = True)
binList = checker(binList, len(binList), blockList)
binPrint(binList)
def isSpaceFree(binList, r, c, size):
if r + size > len(binList[0]):
return False
elif c + size > len(binList[0]):
return False
for row in range(r, r + size):
for col in range(c, c + size):
if binList[r][c] != 0:
return False
elif binList[r][c] == size:
return False
return True
def checker(binList, gSize, blockList):
for i in blockList:
r = 0
c = 0
comp = False
while comp != True:
check = isSpaceFree(binList, r, c, i)
if check == True:
for x in range(c, c+ i):
for y in range(r, r+ i):
binList[x][y] = i
comp = True
else:
print(c)
print(r)
r += 1
if r > gSize:
r = 0
c += 1
if c > gSize:
print("Imcompadible")
comp = True
print(i)
binPrint(binList)
input()
return binList
Your code to test for open spaces looks in binList[r][c] (where r is a row value and c is a column value). However, the code that sets the values once an open space has been found sets binList[x][y] (where x is a column value and y is a row value).
The latter is wrong. You want to set binList[y][x] instead (indexing by row, then column).
That will get you a working solution, but it will still not be exactly what you say you expect (you'll get a reflection across the diagonal). This is because your code updates r first, then c only when r has exceeded the bin size. If you want to place items to the right first, then below, you need to swap them.
I'd suggest using two for loops for r and c, rather than a while too, but to make it work in an elegant way you'd probably need to factor out the "find one item's place" code so you could return from the inner loop (rather than needing some complicated code to let you break out of both of the nested loops).

Numpy, problem with long arrays

I have two arrays (a and b) with n integer elements in the range (0,N).
typo: arrays with 2^n integers where the largest integer takes the value N = 3^n
I want to calculate the sum of every combination of elements in a and b (sum_ij_ = a_i_ + b_j_ for all i,j). Then take modulus N (sum_ij_ = sum_ij_ % N), and finally calculate the frequency of the different sums.
In order to do this fast with numpy, without any loops, I tried to use the meshgrid and the bincount function.
A,B = numpy.meshgrid(a,b)
A = A + B
A = A % N
A = numpy.reshape(A,A.size)
result = numpy.bincount(A)
Now, the problem is that my input arrays are long. And meshgrid gives me MemoryError when I use inputs with 2^13 elements. I would like to calculate this for arrays with 2^15-2^20 elements.
that is n in the range 15 to 20
Is there any clever tricks to do this with numpy?
Any help will be highly appreciated.
--
jon
try chunking it. your meshgrid is an NxN matrix, block that up to 10x10 N/10xN/10 and just compute 100 bins, add them up at the end. this only uses ~1% as much memory as doing the whole thing.
Edit in response to jonalm's comment:
jonalm: N~3^n not n~3^N. N is max element in a and n is number of
elements in a.
n is ~ 2^20. If N is ~ 3^n then N is ~ 3^(2^20) > 10^(500207).
Scientists estimate (http://www.stormloader.com/ajy/reallife.html) that there are only around 10^87 particles in the universe. So there is no (naive) way a computer can handle an int of size 10^(500207).
jonalm: I am however a bit curios about the pv() function you define. (I
do not manage to run it as text.find() is not defined (guess its in another
module)). How does this function work and what is its advantage?
pv is a little helper function I wrote to debug the value of variables. It works like
print() except when you say pv(x) it prints both the literal variable name (or expression string), a colon, and then the variable's value.
If you put
#!/usr/bin/env python
import traceback
def pv(var):
(filename,line_number,function_name,text)=traceback.extract_stack()[-2]
print('%s: %s'%(text[text.find('(')+1:-1],var))
x=1
pv(x)
in a script you should get
x: 1
The modest advantage of using pv over print is that it saves you typing. Instead of having to
write
print('x: %s'%x)
you can just slap down
pv(x)
When there are multiple variables to track, it's helpful to label the variables.
I just got tired of writing it all out.
The pv function works by using the traceback module to peek at the line of code
used to call the pv function itself. (See http://docs.python.org/library/traceback.html#module-traceback) That line of code is stored as a string in the variable text.
text.find() is a call to the usual string method find(). For instance, if
text='pv(x)'
then
text.find('(') == 2 # The index of the '(' in string text
text[text.find('(')+1:-1] == 'x' # Everything in between the parentheses
I'm assuming n ~ 3^N, and n~2**20
The idea is to work module N. This cuts down on the size of the arrays.
The second idea (important when n is huge) is to use numpy ndarrays of 'object' type because if you use an integer dtype you run the risk of overflowing the size of the maximum integer allowed.
#!/usr/bin/env python
import traceback
import numpy as np
def pv(var):
(filename,line_number,function_name,text)=traceback.extract_stack()[-2]
print('%s: %s'%(text[text.find('(')+1:-1],var))
You can change n to be 2**20, but below I show what happens with small n
so the output is easier to read.
n=100
N=int(np.exp(1./3*np.log(n)))
pv(N)
# N: 4
a=np.random.randint(N,size=n)
b=np.random.randint(N,size=n)
pv(a)
pv(b)
# a: [1 0 3 0 1 0 1 2 0 2 1 3 1 0 1 2 2 0 2 3 3 3 1 0 1 1 2 0 1 2 3 1 2 1 0 0 3
# 1 3 2 3 2 1 1 2 2 0 3 0 2 0 0 2 2 1 3 0 2 1 0 2 3 1 0 1 1 0 1 3 0 2 2 0 2
# 0 2 3 0 2 0 1 1 3 2 2 3 2 0 3 1 1 1 1 2 3 3 2 2 3 1]
# b: [1 3 2 1 1 2 1 1 1 3 0 3 0 2 2 3 2 0 1 3 1 0 0 3 3 2 1 1 2 0 1 2 0 3 3 1 0
# 3 3 3 1 1 3 3 3 1 1 0 2 1 0 0 3 0 2 1 0 2 2 0 0 0 1 1 3 1 1 1 2 1 1 3 2 3
# 3 1 2 1 0 0 2 3 1 0 2 1 1 1 1 3 3 0 2 2 3 2 0 1 3 1]
wa holds the number of 0s, 1s, 2s, 3s in a
wb holds the number of 0s, 1s, 2s, 3s in b
wa=np.bincount(a)
wb=np.bincount(b)
pv(wa)
pv(wb)
# wa: [24 28 28 20]
# wb: [21 34 20 25]
result=np.zeros(N,dtype='object')
Think of a 0 as a token or chip. Similarly for 1,2,3.
Think of wa=[24 28 28 20] as meaning there is a bag with 24 0-chips, 28 1-chips, 28 2-chips, 20 3-chips.
You have a wa-bag and a wb-bag. When you draw a chip from each bag, you "add" them together and form a new chip. You "mod" the answer (modulo N).
Imagine taking a 1-chip from the wb-bag and adding it with each chip in the wa-bag.
1-chip + 0-chip = 1-chip
1-chip + 1-chip = 2-chip
1-chip + 2-chip = 3-chip
1-chip + 3-chip = 4-chip = 0-chip (we are mod'ing by N=4)
Since there are 34 1-chips in the wb bag, when you add them against all the chips in the wa=[24 28 28 20] bag, you get
34*24 1-chips
34*28 2-chips
34*28 3-chips
34*20 0-chips
This is just the partial count due to the 34 1-chips. You also have to handle the other
types of chips in the wb-bag, but this shows you the method used below:
for i,count in enumerate(wb):
partial_count=count*wa
pv(partial_count)
shifted_partial_count=np.roll(partial_count,i)
pv(shifted_partial_count)
result+=shifted_partial_count
# partial_count: [504 588 588 420]
# shifted_partial_count: [504 588 588 420]
# partial_count: [816 952 952 680]
# shifted_partial_count: [680 816 952 952]
# partial_count: [480 560 560 400]
# shifted_partial_count: [560 400 480 560]
# partial_count: [600 700 700 500]
# shifted_partial_count: [700 700 500 600]
pv(result)
# result: [2444 2504 2520 2532]
This is the final result: 2444 0s, 2504 1s, 2520 2s, 2532 3s.
# This is a test to make sure the result is correct.
# This uses a very memory intensive method.
# c is too huge when n is large.
if n>1000:
print('n is too large to run the check')
else:
c=(a[:]+b[:,np.newaxis])
c=c.ravel()
c=c%N
result2=np.bincount(c)
pv(result2)
assert(all(r1==r2 for r1,r2 in zip(result,result2)))
# result2: [2444 2504 2520 2532]
Check your math, that's a lot of space you're asking for:
2^20*2^20 = 2^40 = 1 099 511 627 776
If each of your elements was just one byte, that's already one terabyte of memory.
Add a loop or two. This problem is not suited to maxing out your memory and minimizing your computation.

Categories