Understand numpy's transpose - python

I have the following python code
import numpy as np
import itertools as it
ref_list = [0, 1, 2]
p = it.permutations(ref_list)
transpose_list = tuple(p)
#print('transpose_list', transpose_list)
na = nb = nc = 2
A = np.zeros((na,nb,nc))
n = 1
for la in range(na):
for lb in range(nb):
for lc in range(nc):
A[la,lb,lc] = n
n = n + 1
factor_list = [(i+1)*0.0 for i in range(6)]
factor_list[0] = 0.1
factor_list[1] = 0.2
factor_list[2] = 0.3
factor_list[3] = 0.4
sum_A = np.zeros((na,nb,nc))
for m, t in enumerate(transpose_list):
if abs(factor_list[m]) < 1.e-3:
continue
factor_list[m] * np.transpose(A, transpose_list[m])
print('inter', m, t, factor_list[m], np.transpose(A, transpose_list[m])[0,0,1] )
B = np.transpose(A, (0, 2, 1))
C = np.transpose(A, (1, 2, 0))
for la in range(na):
for lb in range(nb):
for lc in range(nc):
print(la,lb,lc,'A',A[la,lb,lc],'B',B[la,lb,lc],'C',C[la,lb,lc])
The result is
inter 0 (0, 1, 2) 0.1 2.0
inter 1 (0, 2, 1) 0.2 3.0
inter 2 (1, 0, 2) 0.3 2.0
inter 3 (1, 2, 0) 0.4 5.0
0 0 0 A 1.0 B 1.0 C 1.0
0 0 1 A 2.0 B 3.0 C 5.0
0 1 0 A 3.0 B 2.0 C 2.0
0 1 1 A 4.0 B 4.0 C 6.0
1 0 0 A 5.0 B 5.0 C 3.0
1 0 1 A 6.0 B 7.0 C 7.0
1 1 0 A 7.0 B 6.0 C 4.0
1 1 1 A 8.0 B 8.0 C 8.0
My question is, why the inter 1 and inter 3 get 3.0 and 5.0? The objective is to obtain
P(A)[0,0,1].
For inter 1, it is (0, 2, 1), I thought about (0,2,1) on [0,0,1] -> [0,1,0]
For inter 3, it is (1, 2, 0), I thought about (1,2,0) on [0,0,1] -> [0,1,0]
So the value should be the same. The output are not the same (3.0 and 5.0). So apparently I misunderstood np.transpose. What would be the correct understanding that what happened inside np.transpose?
More specifically, from How does numpy.transpose work for this example?, Anand S Kumar's answer
I tried to think from both (0, 2, 1) and (1, 2, 0), both lead to
(0,0,0) -> (0,0,0)
(0,0,1) -> (0,1,0)
I guess it related to the inverse of permutation. But I am not sure why.

A more direct way of making your A:
In [29]: A = np.arange(1,9).reshape(2,2,2)
In [30]: A
Out[30]:
array([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
The transposes:
In [31]: B = np.transpose(A, (0, 2, 1))
...: C = np.transpose(A, (1, 2, 0))
In [32]: B
Out[32]:
array([[[1, 3],
[2, 4]],
[[5, 7],
[6, 8]]])
In [33]: C
Out[33]:
array([[[1, 5],
[2, 6]],
[[3, 7],
[4, 8]]])
two of the cases:
In [35]: A[0,0,1], B[0,1,0],C[0,1,0]
Out[35]: (2, 2, 2)
In [36]: A[1,0,0], B[1,0,0], C[0,0,1]
Out[36]: (5, 5, 5)
It's eash to match A and B, by just swaping the last 2 indices. It's tempting to just swap the 1st and 3rd for C, but that's wrong. When the 1st is moved to the end, the others shift over without changing their order:
In [38]: for la in range(na):
...: for lb in range(nb):
...: for lc in range(nc):
...: print(la,lb,lc,'A',A[la,lb,lc],'B',B[la,lc,lb],'C',C[lb,lc,la])
0 0 0 A 1 B 1 C 1
0 0 1 A 2 B 2 C 2
0 1 0 A 3 B 3 C 3
0 1 1 A 4 B 4 C 4
1 0 0 A 5 B 5 C 5
1 0 1 A 6 B 6 C 6
1 1 0 A 7 B 7 C 7
1 1 1 A 8 B 8 C 8

Related

What will be the best approach for a digit like pattern in python?

i was trying a pattern in Python
if n == 6
1 2 3 4 5
2 3 4 5 1
3 4 5 1 2
4 5 1 2 3
5 1 2 3 4
after trying to think a lot
i did it like this --->
n = 6
for i in range(1,n):
x = 1
countj = 0
for j in range(i,n):
countj +=1
print(j,end=" ")
if j == n-1 and countj < n-1 :
while countj < n-1:
print(x , end =" ")
countj +=1
x +=1
print()
but i don't think it is the best approach, I was trying to search some better approach , but not able to get the proper one, So that I came here,, is there any possible better approach for the problem?
I would do like this, using a rotating deque instance:
>>> from collections import deque
>>> n = 6
>>> d = deque(range(1, n))
>>> for _ in range(1, n):
... print(*d)
... d.rotate(-1)
...
1 2 3 4 5
2 3 4 5 1
3 4 5 1 2
4 5 1 2 3
5 1 2 3 4
There is a similar/shorter code possible just using range slicing, but maybe it's a bit harder to understand how it works:
>>> ns = range(1, 6)
>>> for i in ns:
... print(*ns[i-1:], *ns[:i-1])
...
1 2 3 4 5
2 3 4 5 1
3 4 5 1 2
4 5 1 2 3
5 1 2 3 4
You could also create a mathematical function of the coordinates, which might look something like this:
>>> for row in range(5):
... for col in range(5):
... print((row + col) % 5 + 1, end=" ")
... print()
...
1 2 3 4 5
2 3 4 5 1
3 4 5 1 2
4 5 1 2 3
5 1 2 3 4
A too-clever way using list comprehension:
>>> r = range(5)
>>> [[1 + r[i - j - 1] for i in r] for j in reversed(r)]
[[1, 2, 3, 4, 5],
[2, 3, 4, 5, 1],
[3, 4, 5, 1, 2],
[4, 5, 1, 2, 3],
[5, 1, 2, 3, 4]]
more-itertools has this function:
>>> from more_itertools import circular_shifts
>>> circular_shifts(range(1, 6))
[(1, 2, 3, 4, 5),
(2, 3, 4, 5, 1),
(3, 4, 5, 1, 2),
(4, 5, 1, 2, 3),
(5, 1, 2, 3, 4)]
You can use itertools.cycle to make the sequence generated from range repeat itself, and then use itertools.islice to slice the sequence according to the iteration count:
from itertools import cycle, islice
n = 6
for i in range(n - 1):
print(*islice(cycle(range(1, n)), i, i + n - 1))
This outputs:
1 2 3 4 5
2 3 4 5 1
3 4 5 1 2
4 5 1 2 3
5 1 2 3 4
Your 'pattern' is actually known as a Hankel matrix, commonly used in linear algebra.
So there's a scipy function for creating them.
from scipy.linalg import hankel
hankel([1, 2, 3, 4, 5], [5, 1, 2, 3, 4])
or
from scipy.linalg import hankel
import numpy as np
def my_hankel(n):
x = np.arange(1, n)
return hankel(x, np.roll(x, 1))
print(my_hankel(6))
Output:
[[1 2 3 4 5]
[2 3 4 5 1]
[3 4 5 1 2]
[4 5 1 2 3]
[5 1 2 3 4]]
Seeing lots of answers involving Python libraries. If you want a simple way to do it, here it is.
n = 5
arr = [[1 + (start + i) % n for i in range(n)] for start in range(n)]
arr_str = "\n".join(" ".join(str(cell) for cell in row) for row in arr)
print(arr_str)

How to format tabular data WITHOUT row labels?

For instance right now the output is:
Temp [C] v [m3/kg] u [kJ/kg] s [kJ/kgK]
Temp [C] 1 2 1 3
v [m3/kg] 0 1 0 3
u [kJ/kg] 2 4 2 3
s [kJ/kgK] 2 4 2 3
and I want it to look like:
Temp [C] v [m3/kg] u [kJ/kg] s [kJ/kgK]
1 2 1 3
0 1 0 3
2 4 2 3
2 4 2 3
Here is the code I am currently using:
units = ['Temp [C]', 'v [m3/kg]', 'u [kJ/kg]', 's [kJ/kgK]']
data = ([1, 2, 1, 3], [0, 1, 0, 3], [2, 4, 2,3], [2, 4, 2,3])
format_row = "{:>12}" * (len(units) + 1)
print(format_row.format("", *units))
for unit, row in zip(units, data):
print(format_row.format(unit, *row))
units = ['Temp [C]', 'v [m3/kg]', 'u [kJ/kg]', 's [kJ/kgK]']
data = ([1, 2, 1, 3], [0, 1, 0, 3], [2, 4, 2,3], [2, 4, 2,3])
fmt_string = '{:<15}' * len(units)
print(fmt_string.format(*units))
for row in data:
print(fmt_string.format(*row))
Prints:
Temp [C] v [m3/kg] u [kJ/kg] s [kJ/kgK]
1 2 1 3
0 1 0 3
2 4 2 3
2 4 2 3

How to understand the Python code I wrote for a problem

If I give 2 output is
2 2 2
2 1 2
2 2 2
and for 3 is
3 3 3 3 3
3 2 2 2 3
3 2 1 2 3
3 2 2 2 3
3 3 3 3 3
The code is as follows:
def square(arr, val):
if val is 1:
return [[1]] //self understandable
n = val + (val - 1) //getting array size, for 2 its 3
sideAdd = [val] * n
arr.insert(0, sideAdd) //for 2 it add [2,2,2] at first
arr.append(sideAdd) //append [2,2,2]
for i in range(1, n-1): //for inner elements add val to either sides
arr[i].insert(0, val) // like, for 2 [val,1,val]
arr[i].append(val)
return(arr)
array = square([[2, 2, 2], [2, 1, 2], [2, 2, 2]], 3)
# print(array)
for i in array:
print(*i)
The output goes like:
3 3 3 3 3
3 2 2 2 3
3 2 1 2 3
3 2 2 2 3
3 3 3 3 3
which is a correct answer.
But when I tried to complete the whole solution by giving values through for loop and sending returned array as parameter again through the same function like
n = 3
arr = []
for i in range(1, n+1):
arr = square(arr, i)
the whole code is
def square(arrx, val):
if val is 1:
return [[1]]
n1 = val + (val - 1)
sideAdd = [val] * n1
arrx.insert(0, sideAdd)
arrx.append(sideAdd)
for i in range(1, n1-1):
arrx[i].insert(0, val)
arrx[i].append(val)
return arrx
n = 3
arr = []
for i in range(1, n+1):
arr = square(arr, i)
for i in arr:
print(*i)
This returns the answer as:
3 3 3 3 3
3 3 2 2 2 3 3
3 2 1 2 3
3 3 2 2 2 3 3
3 3 3 3 3
which is wrong
I already tried running in debug mode in pycharm, there I got an unusual thing. Check the screenshot below. When j = 1 (index), the code inserts 2 in blue underline but also in red which should not because the red underline index is 3 (insertion of 2 should occur at j = 3), and when j = 3 again insertion at index 1 (j = 1) is happening which makes the output 3 3 2 2 2 3 3 instead of 3 2 2 2 3.
I can't understand how that's happening. The screenshot is as follows:
https://imgur.com/yce47Pi
I think you are making this more complicated than it needs to be. You can determine the correct value of any cell from their indexes alone. Given a size n and row/column [i,j], the value is:
max(abs(n - 1 - j) + 1, abs(n - 1 - i) + 1)
For example:
def square(n):
arr = []
for i in range(n + n-1):
cur = []
arr.append(cur)
for j in range(n + n -1):
cur.append(max(abs(n - 1 - j) + 1, abs(n - 1 - i) + 1))
return arr
Then
> square(3)
[[3, 3, 3, 3, 3],
[3, 2, 2, 2, 3],
[3, 2, 1, 2, 3],
[3, 2, 2, 2, 3],
[3, 3, 3, 3, 3]]
> square(5)
[[5, 5, 5, 5, 5, 5, 5, 5, 5],
[5, 4, 4, 4, 4, 4, 4, 4, 5],
[5, 4, 3, 3, 3, 3, 3, 4, 5],
[5, 4, 3, 2, 2, 2, 3, 4, 5],
[5, 4, 3, 2, 1, 2, 3, 4, 5],
[5, 4, 3, 2, 2, 2, 3, 4, 5],
[5, 4, 3, 3, 3, 3, 3, 4, 5],
[5, 4, 4, 4, 4, 4, 4, 4, 5],
[5, 5, 5, 5, 5, 5, 5, 5, 5]]
Edit
The problem with current code is this:
sideAdd = [val] * n
arr.insert(0, sideAdd)
arr.append(sideAdd)
You adding a reference to the same array (sideAdd) twice. So later when you add columns with:
arrx[i].insert(0, val)
arrx[i].append(val)
these two array are the same if they were added in an earlier loop. Adding to the first array also adds one to the second and adding to the second adds to the first. So you end up doing it twice. There are a couple ways to fix that, but the easiest is adding a copy the second time:
sideAdd = [val] * n
arr.insert(0, sideAdd)
arr.append(sideAdd[:]) # Make a copy — don't add the same reference
We can make use of numpys broadcasting rules here:
>>> np.maximum(np.abs(np.arange(-2, 3)[:, None]), np.abs(np.arange(-2, 3))) + 1
array([[3, 3, 3, 3, 3],
[3, 2, 2, 2, 3],
[3, 2, 1, 2, 3],
[3, 2, 2, 2, 3],
[3, 3, 3, 3, 3]])
So we can render the cube with:
import numpy as np
def cube(n):
ran = np.abs(np.arange(-n+1, n))
cub = np.maximum(ran[:, None], ran) + 1
return '\n'.join(' '.join(map(str, row)) for row in cub)
For example:
>>> print(cube(1))
1
>>> print(cube(2))
2 2 2
2 1 2
2 2 2
>>> print(cube(3))
3 3 3 3 3
3 2 2 2 3
3 2 1 2 3
3 2 2 2 3
3 3 3 3 3
>>> print(cube(4))
4 4 4 4 4 4 4
4 3 3 3 3 3 4
4 3 2 2 2 3 4
4 3 2 1 2 3 4
4 3 2 2 2 3 4
4 3 3 3 3 3 4
4 4 4 4 4 4 4
>>> print(cube(5))
5 5 5 5 5 5 5 5 5
5 4 4 4 4 4 4 4 5
5 4 3 3 3 3 3 4 5
5 4 3 2 2 2 3 4 5
5 4 3 2 1 2 3 4 5
5 4 3 2 2 2 3 4 5
5 4 3 3 3 3 3 4 5
5 4 4 4 4 4 4 4 5
5 5 5 5 5 5 5 5 5

Numpy: recode numeric array to which quintile each element belongs

I have a numeric vector a:
import numpy as np
a = np.random.rand(100)
I wish to get the vector (or any other vector) recoded so that each element is either 0, 1, 2, 3 or 4, according to which a quintile it is in (could be more general for any quantile, like quartile, decile etc.).
This is what I'm doing. There has to be something more elegant, no?
from scipy.stats import percentileofscore
n_quantiles = 5
def get_quantile(i, a, n_quantiles):
if a[i] >= max(a):
return n_quantiles - 1
return int(percentileofscore(a, a[i])/(100/n_quantiles))
a_recoded = np.array([get_quantile(i, a, n_quantiles) for i in range(len(a))])
print(a)
print(a_recoded)
[0.04708996 0.86267278 0.23873192 0.02967989 0.42828385 0.58003015
0.8996666 0.15359369 0.83094778 0.44272398 0.60211289 0.90286434
0.40681163 0.91338397 0.3273745 0.00347029 0.37471307 0.72735901
0.93974808 0.55937197 0.39297097 0.91470761 0.76796271 0.50404401
0.1817242 0.78244809 0.9548256 0.78097562 0.90934337 0.89914752
0.82899983 0.44116683 0.50885813 0.2691431 0.11676798 0.84971927
0.38505195 0.7411976 0.51377242 0.50243197 0.89677377 0.69741088
0.47880953 0.71116534 0.01717348 0.77641096 0.88127268 0.17925502
0.53053573 0.16935597 0.65521692 0.19042794 0.21981197 0.01377195
0.61553814 0.8544525 0.53521604 0.88391848 0.36010949 0.35964882
0.29721931 0.71257335 0.26350287 0.22821314 0.8951419 0.38416004
0.19277649 0.67774468 0.27084229 0.46862229 0.3107887 0.28511048
0.32682302 0.14682896 0.10794566 0.58668243 0.16394183 0.88296862
0.55442047 0.25508233 0.86670299 0.90549872 0.04897676 0.33042884
0.4348465 0.62636481 0.48201213 0.49895892 0.36444648 0.01410316
0.46770595 0.09498391 0.96793139 0.03931124 0.64286295 0.50934846
0.59088907 0.56368594 0.7820928 0.77172038]
[0 4 1 0 2 3 4 0 4 2 3 4 2 4 1 0 1 3 4 2 1 4 3 2 0 3 4 3 4 4 4 2 2 1 0 4 1
3 2 2 4 3 2 3 0 3 4 0 2 0 3 0 1 0 3 4 2 4 1 1 1 3 1 1 4 1 0 3 1 2 1 1 1 0
0 3 0 4 2 1 4 4 0 1 2 3 2 2 1 0 2 0 4 0 3 2 3 2 3 3]
Update: just wanted to say this is so easy in R:
How to get the x which belongs to a quintile?
You could use argpartition. Example:
>>> a = np.random.random(20)
>>> N = len(a)
>>> nq = 5
>>> o = a.argpartition(np.arange(1, nq) * N // nq)
>>> out = np.empty(N, int)
>>> out[o] = np.arange(N) * nq // N
>>> a
array([0.61238649, 0.37168998, 0.4624829 , 0.28554766, 0.00098016,
0.41979328, 0.62275886, 0.4254548 , 0.20380679, 0.762435 ,
0.54054873, 0.68419986, 0.3424479 , 0.54971072, 0.06929464,
0.51059431, 0.68448674, 0.97009023, 0.16780152, 0.17887862])
>>> out
array([3, 1, 2, 1, 0, 2, 3, 2, 1, 4, 3, 4, 1, 3, 0, 2, 4, 4, 0, 0])
Here's one way to do it using pd.cut()
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(100))
df.columns = ['values']
# Apply the quantiles
gdf = df.groupby(pd.cut(df.loc[:, 'values'], np.arange(0, 1.2, 0.2)))['values'].apply(lambda x: list(x)).to_frame()
# Make use of the automatic indexing to assign quantile numbers
gdf.reset_index(drop=True, inplace=True)
# Re-expand the grouped list of values. Method provided by #Zero at https://stackoverflow.com/questions/32468402/how-to-explode-a-list-inside-a-dataframe-cell-into-separate-rows
gdf['values'].apply(pd.Series).stack().reset_index(level=1, drop=True).to_frame('values').reset_index()

Pandas aggregating average while excluding current row

How to aggregate in the way to get the average of b for group a, while excluding the current row (the target result is in c)?
a b c
1 1 0.5 # (avg of 0 & 1, excluding 1)
1 1 0.5 # (avg of 0 & 1, excluding 1)
1 0 1 # (avg of 1 & 1, excluding 0)
2 1 0.5 # (avg of 0 & 1, excluding 1)
2 0 1 # (avg of 1 & 1, excluding 0)
2 1 0.5 # (avg of 0 & 1, excluding 1)
3 1 0.5 # (avg of 0 & 1, excluding 1)
3 0 1 # (avg of 1 & 1, excluding 0)
3 1 0.5 # (avg of 0 & 1, excluding 1)
Data dump:
import pandas as pd
data = pd.DataFrame([[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1], [2, 1, 0.5], [2, 0, 1],
[2, 1, 0.5], [3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]],
columns=['a', 'b', 'c'])
Suppose a group has values x_1, ..., x_n.
The average of the entire group would be
m = (x_1 + ... + x_n)/n
The sum of the group without x_i would be
(m*n - x_i)
The average of the group without x_i would be
(m*n - x_i)/(n-1)
Therefore, you could compute the desired column of values with
import pandas as pd
df = pd.DataFrame([[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1], [2, 1, 0.5], [2, 0, 1],
[2, 1, 0.5], [3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]],
columns=['a', 'b', 'c'])
grouped = df.groupby(['a'])
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
which yields
In [32]: df
Out[32]:
a b c result
0 1 1 0.5 0.5
1 1 1 0.5 0.5
2 1 0 1.0 1.0
3 2 1 0.5 0.5
4 2 0 1.0 1.0
5 2 1 0.5 0.5
6 3 1 0.5 0.5
7 3 0 1.0 1.0
8 3 1 0.5 0.5
In [33]: assert df['result'].equals(df['c'])
Per the comments below, in the OP's actual use case, the DataFrame's a column
contains strings:
def make_random_str_array(letters, strlen, size):
return (np.random.choice(list(letters), size*strlen)
.view('|S{}'.format(strlen)))
N = 3*10**6
df = pd.DataFrame({'a':make_random_str_array(letters='ABCD', strlen=10, size=N),
'b':np.random.randint(10, size=N)})
so that there are about a million unique values in df['a'] out of 3 million
total:
In [87]: uniq, key = np.unique(df['a'], return_inverse=True)
In [88]: len(uniq)
Out[88]: 988337
In [89]: len(df)
Out[89]: 3000000
In this case the calculation above requires (on my machine) about 11 seconds:
In [86]: %%timeit
....: grouped = df.groupby(['a'])
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
....: ....: ....: ....:
1 loops, best of 3: 10.5 s per loop
Pandas converts all string-valued columns to object
dtype. But we could convert the
DataFrame column to a NumPy array with a fixed-width dtype, and the group
according to those values.
Here is a benchmark showing that if we convert the Series with object dtype to a NumPy array with fixed-width string dtype, the calculation requires less than 2 seconds:
In [97]: %%timeit
....: grouped = df.groupby(df['a'].values.astype('|S4'))
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
....: ....: ....: ....:
1 loops, best of 3: 1.39 s per loop
Beware that you need to know the maximum length of the strings in df['a'] to choose the appropriate fixed-width dtype. In the example above, all the strings have length 4, so |S4 works. If you use |Sn for some integer n and n is smaller than the longest string, then those strings will get silently truncated with no error warning. This could potentially lead to the grouping of values which should not be grouped together. Thus, the onus is on you to choose the correct fixed-width dtype.
You could use
dtype = '|S{}'.format(df['a'].str.len().max())
grouped = df.groupby(df['a'].values.astype(dtype))
to ensure the conversion uses the correct dtype.
You can calculate the statistics manually by iterating group by group:
# Set up input
import pandas as pd
df = pd.DataFrame([
[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1],
[2, 1, 0.5], [2, 0, 1], [2, 1, 0.5],
[3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]
], columns=['a', 'b', 'c'])
df
a b c
0 1 1 0.5
1 1 1 0.5
2 1 0 1.0
3 2 1 0.5
4 2 0 1.0
5 2 1 0.5
6 3 1 0.5
7 3 0 1.0
8 3 1 0.5
# Perform grouping, excluding the current row
results = []
grouped = df.groupby(['a'])
for key, group in grouped:
for idx, row in group.iterrows():
# The group excluding current row
group_other = group.drop(idx)
avg = group_other['b'].mean()
results.append(row.tolist() + [avg])
# Compare our results with what is expected
results_df = pd.DataFrame(
results, columns=['a', 'b', 'c', 'c_new']
)
results_df
a b c c_new
0 1 1 0.5 0.5
1 1 1 0.5 0.5
2 1 0 1.0 1.0
3 2 1 0.5 0.5
4 2 0 1.0 1.0
5 2 1 0.5 0.5
6 3 1 0.5 0.5
7 3 0 1.0 1.0
8 3 1 0.5 0.5
This way you can use any statistic you want.

Categories