Why doesn't assignment work on this DataFrame - python

I want to create a copy df with different values based on the previous one. I have used this technique before and it worked just fine, however it doesn't work here.
Does anyone know if I am missing something?
Code:
df2 = df1.copy()
for index, row in df2.iterrows():
flowers_num = int(row["flowers_num"])
if flowers_num >= 100:
flowers_num = 10
elif flowers_num >= 10:
flowers_num = 8
else:
flowers_num = 6
row["flowers_num"] = flowers_num
Unique values on df2 before loop:
[ 0. 1. 10. 15. 6. 2. 4. 3. 44. 8. 9. 7. 22. 5.
11. 19. 12. 13. 21. 20. 14. 23. 16. 18. 24. 17. 35. 32.
25. 30. 28. 57. 45. 27. 42. 38. 43. 37. 34. 26. 29. 41.
52. 31. 39. 46. 51. 131. 36. 61. 53. 33. 48. 40. 58. 49.
76. 50. 119. 55. 91. 59. 106. 56. 65. 54. 47. 63. 64. 67.
75. 102. 74. 70. 60.]
Unique values on df2 after loop (should be just 6, 8 or 10):
[ 0. 1. 10. 15. 6. 2. 4. 3. 44. 8. 9. 7. 22. 5.
11. 19. 12. 13. 21. 20. 14. 23. 16. 18. 24. 17. 35. 32.
25. 30. 28. 57. 45. 27. 42. 38. 43. 37. 34. 26. 29. 41.
52. 31. 39. 46. 51. 131. 36. 61. 53. 33. 48. 40. 58. 49.
76. 50. 119. 55. 91. 59. 106. 56. 65. 54. 47. 63. 64. 67.
75. 102. 74. 70. 60.]
Thanks in advance!

Your coded worked from me, however, the "pandas" way to do this is to use pd.cut:
pd.cut(df1['flowers_num'], [0,10,100,np.inf], labels=[6,8,10])

It would be much faster if you use apply on the column rather than iterrows.
Create a function to change the values
def change_num(x):
if x>=100:
return 10
elif x>=10:
return 8
else:
return 6
Dummy DataFrame:
df_ex = pd.DataFrame({'flowers_num': np.random.randint(1,1000,20)})
Using apply:
df_ex["flowers_num"]=df_ex["flowers_num"].apply(change_num)

Related

Brute force to generate possible permutations

I have 4 point groups, each of them contain 5 different 3D positions. My goal is to brut force all possible four permutations for each group without repeating the order and print them out as (5x3)array. E.g. for input data:
1,2,3
4,5,6
7,8,9
10,11,12
13,14,15
16,17,18
19,20,21
22,23,24
25,26,27
28,29,30
31,32,33
34,35,36
37,38,39
40,41,42
43,44,45
46,47,48
49,50,51
52,53,54
55,56, 57
58,59,60
I read the file:
def read_file(name):
with open(name, 'r') as f:
data = []
for line in f:
l = line.strip()
cols = [float(i) for i in line.split(',')]
data.append(cols)
return np.array(data)
and reshape it to have 4x(5x3) arrays to be brute-forced:
def main():
filePath= 'C:/Users/retw/input.txt'
data = read_file(filePath)
print('data:', data, type(data), data.shape)
reshapedData = data.reshape(4, 5, 3)
print('reshapedData :', reshapedData, type(reshapedData), reshapedData.shape)
The current output looks like:
respahedData: [[[ 1. 2. 3.]
[ 4. 5. 6.]
[ 7. 8. 9.]
[10. 11. 12.]
[13. 14. 15.]]
[[16. 17. 18.]
[19. 20. 21.]
[22. 23. 24.]
[25. 26. 27.]
[28. 29. 30.]]
[[31. 32. 33.]
[34. 35. 36.]
[37. 38. 39.]
[40. 41. 42.]
[43. 44. 45.]]
[[46. 47. 48.]
[49. 50. 51.]
[52. 53. 54.]
[55. 56. 57.]
[58. 59. 60.]]] <class 'numpy.ndarray'> (4, 5, 3)
after brut force, the permutations as array or list should look like:
[[1,2,3]
[16,17,18]
[31,32,33]
[46,47,48]]
[[1,2,3]
[19,20,21]
[31,32,33]
[46,47,48]]
[[1,2,3]
[22,23,24]
[31,32,33]
[46,47,48]]
etc,
until
[[13,14,15]
[28,29,30]
[43,44,45]
[58,59,60]]
Edit
For given two 4x3 arrays as input:
[[[1,2,3]
[4,5,6]]
[7,8,9]
[10,11,12]]]
The output after brute force should be:
[[1,2,3]
[7,8,9]]
[[1,2,3]
[10,11,12]]
[[4,5,6]
[7,8,9]]
[[4,5,6]
[10,11,12]]
Here is a solution using numpy and a generator that appears to work, generates the correct number of combos (625), and sequences them as you are looking for...
import numpy as np
f_in = 'data.csv'
data = []
with open(f_in, 'r') as f:
for line in f:
l = line.strip()
cols = [float(i) for i in line.split(',')]
data.append(cols)
data = np.array(data).reshape((4,5,3))
#print(data)
def result_gen(data):
odometer = [0, 0, 0, 0]
roll_seq = [1, 2, 3, 0] # the sequence of positions by which to roll the odometer
expired = False
while not expired:
res = data[[0, 1, 2, 3], [odometer]]
for i in roll_seq:
if odometer[i] < 4:
odometer[i] += 1
break
else:
if i == 0: # we have exhausted all combos
expired = True
odometer[i] = 0
yield res
my_gen = result_gen(data)
a = list(my_gen)
print(len(a))
for t in a[:6]:
print(t)
Yields:
625
[[[ 1. 2. 3.]
[16. 17. 18.]
[31. 32. 33.]
[46. 47. 48.]]]
[[[ 1. 2. 3.]
[19. 20. 21.]
[31. 32. 33.]
[46. 47. 48.]]]
[[[ 1. 2. 3.]
[22. 23. 24.]
[31. 32. 33.]
[46. 47. 48.]]]
[[[ 1. 2. 3.]
[25. 26. 27.]
[31. 32. 33.]
[46. 47. 48.]]]
[[[ 1. 2. 3.]
[28. 29. 30.]
[31. 32. 33.]
[46. 47. 48.]]]
[[[ 1. 2. 3.]
[16. 17. 18.]
[34. 35. 36.]
[46. 47. 48.]]]
[Finished in 0.2s]
Looks like you want to create something like this.
import numpy as np
a = np.arange(1,61).reshape(4,5,3)
print (a)
b = np.zeros((20,4,3))
for k in range(4):
for i in range(4):
for j in range(5):
if i == k:
b[5*k + j][i] = a[i][j]
else:
b[5*k + j][i] = a[i][0]
print (b)
The output of this will be:
[[[ 1. 2. 3.]
[16. 17. 18.]
[31. 32. 33.]
[46. 47. 48.]]
[[ 4. 5. 6.]
[16. 17. 18.]
[31. 32. 33.]
[46. 47. 48.]]
[[ 7. 8. 9.]
[16. 17. 18.]
[31. 32. 33.]
[46. 47. 48.]]
[[10. 11. 12.]
[16. 17. 18.]
[31. 32. 33.]
[46. 47. 48.]]
[[13. 14. 15.]
[16. 17. 18.]
[31. 32. 33.]
[46. 47. 48.]]
[[ 1. 2. 3.]
[16. 17. 18.]
[31. 32. 33.]
[46. 47. 48.]]
[[ 1. 2. 3.]
[19. 20. 21.]
[31. 32. 33.]
[46. 47. 48.]]
[[ 1. 2. 3.]
[22. 23. 24.]
[31. 32. 33.]
[46. 47. 48.]]
[[ 1. 2. 3.]
[25. 26. 27.]
[31. 32. 33.]
[46. 47. 48.]]
[[ 1. 2. 3.]
[28. 29. 30.]
[31. 32. 33.]
[46. 47. 48.]]
[[ 1. 2. 3.]
[16. 17. 18.]
[31. 32. 33.]
[46. 47. 48.]]
[[ 1. 2. 3.]
[16. 17. 18.]
[34. 35. 36.]
[46. 47. 48.]]
[[ 1. 2. 3.]
[16. 17. 18.]
[37. 38. 39.]
[46. 47. 48.]]
[[ 1. 2. 3.]
[16. 17. 18.]
[40. 41. 42.]
[46. 47. 48.]]
[[ 1. 2. 3.]
[16. 17. 18.]
[43. 44. 45.]
[46. 47. 48.]]
[[ 1. 2. 3.]
[16. 17. 18.]
[31. 32. 33.]
[46. 47. 48.]]
[[ 1. 2. 3.]
[16. 17. 18.]
[31. 32. 33.]
[49. 50. 51.]]
[[ 1. 2. 3.]
[16. 17. 18.]
[31. 32. 33.]
[52. 53. 54.]]
[[ 1. 2. 3.]
[16. 17. 18.]
[31. 32. 33.]
[55. 56. 57.]]
[[ 1. 2. 3.]
[16. 17. 18.]
[31. 32. 33.]
[58. 59. 60.]]]
There are a total of 20 arrays of 4 x 3 I could get looping through this.

Numpy/Python: Efficient matrix as multiplication of cartesian product of input matrix

Problem:
The input is a (i,j)-matrix M. The desired output is a (i^n,j^n) matrix K , where n is the number of products taken. The verbose way to get the desired output is as follows
Generate all arrays of n row permutations I (total of i**n n-arrays)
Generate all arrays of n column permutations J (total of j**n n-arrays)
K[i,j] = m[I[0],J[0]] * ... * m[I[n],J[n]] for all n in range(len(J))
The straightforward way I've done this is by generating a list of labels of all n-permutations of numbers in range(len(np.shape(m)[0])) and range(len(np.shape(m)[1])) for rows and columns, respectively. Afterwards you can multiply them as in the last bullet point above. This, however, is not practical for large input matrices -- so I'm looking for ways to optimize the above. Thank you in advance
Example:
Input
np.array([[1,2,3],[4,5,6]])
Output for n = 3
[[ 1. 2. 3. 2. 4. 6. 3. 6. 9. 2. 4. 6.
4. 8. 12. 6. 12. 18. 3. 6. 9. 6. 12. 18.
9. 18. 27.]
[ 4. 5. 6. 8. 10. 12. 12. 15. 18. 8. 10. 12.
16. 20. 24. 24. 30. 36. 12. 15. 18. 24. 30. 36.
36. 45. 54.]
[ 4. 8. 12. 5. 10. 15. 6. 12. 18. 8. 16. 24.
10. 20. 30. 12. 24. 36. 12. 24. 36. 15. 30. 45.
18. 36. 54.]
[ 16. 20. 24. 20. 25. 30. 24. 30. 36. 32. 40. 48.
40. 50. 60. 48. 60. 72. 48. 60. 72. 60. 75. 90.
72. 90. 108.]
[ 4. 8. 12. 8. 16. 24. 12. 24. 36. 5. 10. 15.
10. 20. 30. 15. 30. 45. 6. 12. 18. 12. 24. 36.
18. 36. 54.]
[ 16. 20. 24. 32. 40. 48. 48. 60. 72. 20. 25. 30.
40. 50. 60. 60. 75. 90. 24. 30. 36. 48. 60. 72.
72. 90. 108.]
[ 16. 32. 48. 20. 40. 60. 24. 48. 72. 20. 40. 60.
25. 50. 75. 30. 60. 90. 24. 48. 72. 30. 60. 90.
36. 72. 108.]
[ 64. 80. 96. 80. 100. 120. 96. 120. 144. 80. 100. 120.
100. 125. 150. 120. 150. 180. 96. 120. 144. 120. 150. 180.
144. 180. 216.]]
Partial solution:
The best I've found is a function to create the cartesian product of matrices proposed here: https://stackoverflow.com/a/1235363/4003747
The problem is that the output is not a matrix but an array of arrays. Multiplying the element of each array gives the values I'm after, but in an unordered fashion. I've tried for a while but I have no idea how to sensibly reorder them.
Inefficient solution for n =3:
import numpy as np
import itertools
m=np.array([[1,2,3],[4,5,6]])
def f(m):
labels_i = [list(p) for p in itertools.product(range(np.shape(m)[0]),repeat=3)]
labels_j = [list(p) for p in itertools.product(range(np.shape(m)[1]),repeat=3)]
out = np.zeros([len(labels_i),len(labels_j)])
for i in range(len(labels_i)):
for j in range(len(labels_j)):
out[i,j] = m[labels_i[i][0],labels_j[j][0]] * m[labels_i[i][1],labels_j[j][1]] * m[labels_i[i][2],labels_j[j][2]]
return out
Here's a vectorized approach using a combination of broadcasting and linear indexing -
from itertools import product
# Get input array's shape
r,c = A.shape
# Setup arrays corresponding to labels i and j
arr_i = np.array(list(product(range(r), repeat=n)))
arr_j = np.array(list(product(range(c), repeat=n)))
# Use linear indexing with ".ravel()" to extract elements.
# Perform elementwise product along the rows for the final output
out = A.ravel()[(arr_i*c)[:,None,:] + arr_j].prod(2)
Runtime test and output verification -
In [167]: # Inputs
...: n = 4
...: A = np.array([[1,2,3],[4,5,6]])
...:
...: def f(m):
...: labels_i = [list(p) for p in product(range(np.shape(m)[0]),repeat=n)]
...: labels_j = [list(p) for p in product(range(np.shape(m)[1]),repeat=n)]
...:
...: out = np.zeros([len(labels_i),len(labels_j)])
...: for i in range(len(labels_i)):
...: for j in range(len(labels_j)):
...: out[i,j] = m[labels_i[i][0],labels_j[j][0]] \
...: * m[labels_i[i][1],labels_j[j][1]] \
...: * m[labels_i[i][2],labels_j[j][2]] \
...: * m[labels_i[i][3],labels_j[j][3]]
...: return out
...:
...: def f_vectorized(A,n):
...: r,c = A.shape
...: arr_i = np.array(list(product(range(r), repeat=n)))
...: arr_j = np.array(list(product(range(c), repeat=n)))
...: return A.ravel()[(arr_i*c)[:,None,:] + arr_j].prod(2)
...:
In [168]: np.allclose(f_vectorized(A,n),f(A))
Out[168]: True
In [169]: %timeit f(A)
100 loops, best of 3: 2.37 ms per loop
In [170]: %timeit f_vectorized(A,n)
1000 loops, best of 3: 202 µs per loop
this should work:
import numpy as np
import itertools
m=np.array([[1,2,3],[4,5,6]])
n=3 # change your n here
def f(m):
labels_i = [list(p) for p in itertools.product(range(np.shape(m)[0]),repeat=n)]
labels_j = [list(p) for p in itertools.product(range(np.shape(m)[1]),repeat=n)]
out = np.zeros([len(labels_i),len(labels_j)])
for i in range(len(labels_i)):
for j in range(len(labels_j)):
out[i,j] = np.prod([m[labels_i[i][k],labels_j[j][k]] for k in range(n)])
return out

Numpy - easier way to change the value of one column of an array only?

I'd like to make a 2-D array, with one column staying the same, and the other varying with linspace.
This works, but seems a little bulky:
np.hstack((np.tile(45,(21,1)), np.array([np.linspace(55,65,21)]).T))
[[ 45. 55. ]
[ 45. 55.5]
[ 45. 56. ]
[ 45. 56.5]
[ 45. 57. ]
[ 45. 57.5]
[ 45. 58. ]
[ 45. 58.5]
[ 45. 59. ]
[ 45. 59.5]
[ 45. 60. ]
[ 45. 60.5]
[ 45. 61. ]
[ 45. 61.5]
[ 45. 62. ]
[ 45. 62.5]
[ 45. 63. ]
[ 45. 63.5]
[ 45. 64. ]
[ 45. 64.5]
[ 45. 65. ]]
Is there a better way to do this?
This seems cleaner, but else i don't see much advantage:
x = np.empty((21, 2))
x[:, 0] = 45
x[:, 1] = np.linspace(55, 65, x.shape[0])
Not a great deal better, but I would do
>>> a = np.full((21, 2), 45.0)
>>> a[..., 1] = np.linspace(55, 65, a.shape[0])
>>> a
array([[ 45. , 55. ],
[ 45. , 55.5],
[ 45. , 56. ],
[ 45. , 56.5],
[ 45. , 57. ],
[ 45. , 57.5],
[ 45. , 58. ],
[ 45. , 58.5],
[ 45. , 59. ],
[ 45. , 59.5],
[ 45. , 60. ],
[ 45. , 60.5],
[ 45. , 61. ],
[ 45. , 61.5],
[ 45. , 62. ],
[ 45. , 62.5],
[ 45. , 63. ],
[ 45. , 63.5],
[ 45. , 64. ],
[ 45. , 64.5],
[ 45. , 65. ]])
>>>

Extract non-main diagonal from scipy sparse matrix?

Say that I have a sparse matrix in scipy.sparse format. How can I extract a diagonal other than than the main diagonal? For a numpy array, you can use numpy.diag. Is there a scipy sparse equivalent?
For example:
from scipy import sparse
A = sparse.diags(ones(5),1)
How would I get back the vector of ones without converting to a numpy array?
When the sparse array is in dia format, the data along the diagonals is recorded in the offsets and data attributes:
import scipy.sparse as sparse
import numpy as np
def make_sparse_array():
A = np.arange(ncol*nrow).reshape(nrow, ncol)
row, col = zip(*np.ndindex(nrow, ncol))
val = A.ravel()
A = sparse.coo_matrix(
(val, (row, col)), shape=(nrow, ncol), dtype='float')
A = A.todia()
# A = sparse.diags(np.ones(5), 1)
# A = sparse.diags([np.ones(4),np.ones(3)*2,], [2,3])
print(A.toarray())
return A
nrow, ncol = 10, 5
A = make_sparse_array()
diags = {offset:(diag[offset:nrow+offset] if 0<=offset<=ncol else
diag if offset+nrow-ncol>=0 else
diag[:offset+nrow-ncol])
for offset, diag in zip(A.offsets, A.data)}
for offset, diag in sorted(diags.iteritems()):
print('{o}: {d}'.format(o=offset, d=diag))
Thus for the array
[[ 0. 1. 2. 3. 4.]
[ 5. 6. 7. 8. 9.]
[ 10. 11. 12. 13. 14.]
[ 15. 16. 17. 18. 19.]
[ 20. 21. 22. 23. 24.]
[ 25. 26. 27. 28. 29.]
[ 30. 31. 32. 33. 34.]
[ 35. 36. 37. 38. 39.]
[ 40. 41. 42. 43. 44.]
[ 45. 46. 47. 48. 49.]]
the code above yields
-9: [ 45.]
-8: [ 40. 46.]
-7: [ 35. 41. 47.]
-6: [ 30. 36. 42. 48.]
-5: [ 25. 31. 37. 43. 49.]
-4: [ 20. 26. 32. 38. 44.]
-3: [ 15. 21. 27. 33. 39.]
-2: [ 10. 16. 22. 28. 34.]
-1: [ 5. 11. 17. 23. 29.]
0: [ 0. 6. 12. 18. 24.]
1: [ 1. 7. 13. 19.]
2: [ 2. 8. 14.]
3: [ 3. 9.]
4: [ 4.]
The output above is printing the offset followed by the diagonal at that offset.
The code above should work for any sparse array. I used a fully populated sparse array only to make it easier to check that the output is correct.

Applying a function to windows in an array (like a filter)

Suppose I have an image loaded into Python as a Numpy array.
I would like to run a function over say a 5x5 window, like a filter kernel but it's not really a standard convolution. What is the most efficient/pythonic way to do this?
A specific example - I have an image of points with associated 3D coordinates. I'd like to calculate the average normal vector for a 5x5 window across the image. I imagine something like:
for each pixel in image:
form an nxn window and extract a list of points
fit a plane to the points
calculate the normal
associate this value with pixel (2,2) in the window
Iterating over arrays in Numpy is usually a Smell so I was hoping there's a better way to do it.
If you have scipy, you could use scipy.ndimage.filters.generic_filter.
For example,
import numpy as np
import scipy.ndimage as ndimage
img = np.arange(100, dtype='float').reshape(10,10)
print(img)
# [[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
# [ 10. 11. 12. 13. 14. 15. 16. 17. 18. 19.]
# [ 20. 21. 22. 23. 24. 25. 26. 27. 28. 29.]
# [ 30. 31. 32. 33. 34. 35. 36. 37. 38. 39.]
# [ 40. 41. 42. 43. 44. 45. 46. 47. 48. 49.]
# [ 50. 51. 52. 53. 54. 55. 56. 57. 58. 59.]
# [ 60. 61. 62. 63. 64. 65. 66. 67. 68. 69.]
# [ 70. 71. 72. 73. 74. 75. 76. 77. 78. 79.]
# [ 80. 81. 82. 83. 84. 85. 86. 87. 88. 89.]
# [ 90. 91. 92. 93. 94. 95. 96. 97. 98. 99.]]
def test(x):
return x.mean()
result = ndimage.generic_filter(img, test, size=(5,5))
print(result)
prints
[[ 8.8 9.2 10. 11. 12. 13. 14. 15. 15.8 16.2]
[ 12.8 13.2 14. 15. 16. 17. 18. 19. 19.8 20.2]
[ 20.8 21.2 22. 23. 24. 25. 26. 27. 27.8 28.2]
[ 30.8 31.2 32. 33. 34. 35. 36. 37. 37.8 38.2]
[ 40.8 41.2 42. 43. 44. 45. 46. 47. 47.8 48.2]
[ 50.8 51.2 52. 53. 54. 55. 56. 57. 57.8 58.2]
[ 60.8 61.2 62. 63. 64. 65. 66. 67. 67.8 68.2]
[ 70.8 71.2 72. 73. 74. 75. 76. 77. 77.8 78.2]
[ 78.8 79.2 80. 81. 82. 83. 84. 85. 85.8 86.2]
[ 82.8 83.2 84. 85. 86. 87. 88. 89. 89.8 90.2]]
Be sure to check out the mode parameter to control what values should be passed to the function when the window falls off the edge of the boundary.
Note that this is mainly a convenience function, for organizing the computation. You are still calling a Python function once for each window. That could be inherently slow.

Categories