Numpy Dynamic Indexing With Both Slicing and Advanced Indexing - python

Looking to index a (N, M) array by taking a certain rows (indicated by an array of row index values) and take a slice from the columns with a fixed start and varying stop indices.
population = np.full((N,M), 0) #population of N guys with genome length M
#Choose some guys and change parts of their genome (cols)
rows_indices = [0,1,5,6] #four guys 0,1,5,6 will be changed
#all selected guys will have a start of 10
#the ends will be difference for each guy
slice_lengths = np.random.geometric(p=0.8, size = 4) #vector of 4
What I imagine is something like:
population[0, 10: 10+ slice_length[0]] = 100
population[1, 10: 10+ slice_length[1]] = 100
population[2, 10: 10+ slice_length[2]] = 100
population[3, 10: 10+ slice_length[3]] = 100
Except vectorized without hardcoding each value
#false code
population[rows_indices, start: start + slice_length]

Related

Moving average in python array

I have an array 'aN' with a shape equal to (1000,151). I need to calculate the average every 10 data in rows, so I implemented this
arr = aN[:]
window_size = 10
i = 0
moving_averages = []
while i < len(arr) - window_size + 1:
window_average = round(np.sum(arr[i:i+window_size]) / window_size, 2)
moving_averages.append(window_average)
i += 10
The point is that my output is a list of 100 data, but I need an array with the same number of columns that the original array (151).
Any idea on how to get this outcome??
TIA!!
If you convert it to a pandas dataframe, you can use the rolling() function of pandas together with the mean() function. It should be able to accomplish what you need.

Recursive python function to make two arrays equal?

I'm attempting to write python code to solve a transportation problem using the Least Cost method. I have a 2D numpy array that I am iterating through to find the minimum, perform calculations with that minimum, and then replace it with a 0 so that the loops stops when values matches constantarray, an array of the same shape containing only 0s. The values array contains distances from points in supply to points in demand. I'm currently using a while loop to do so, but the loop isn't running because values.all() != constantarray.all() evaluates to False.
I also need the process to repeat once the arrays have been edited to move onto the next lowest number in values.
constarray = np.zeros((len(supply),len(demand)) #create array of 0s
sandmoved = np.zeros((len(supply),len(demand)) #used to store information needed for later
totalcost = 0
while values.all() != constantarray.all(): #iterate until `values` only contains 0s
m = np.argmin(values,axis = 0)[0] #find coordinates of minimum value
n = np.argmin(values,axis = 1)[0]
if supply[m] > abs(demand[m]): #all demand numbers are negative
supply[m]+=demand[n] #subtract demand from supply
totalcost +=abs(demand[n])*values[m,n]
sandmoved[m,n] = demand[n] #add amount of 'sand' moved to an empty array
values[m,0:-1] = 0 #replace entire m row with 0s since demand has been filled
demand[n]=0 #replace demand value with 0
elif supply[m]< abs(demand[n]):
demand[n]+=supply[m] #combine positive supply with negative demand
sandmoved[m,n]=supply[m]
totalcost +=supply[m]*values[m,n]
values[:-1,n]=0 #replace entire column with 0s since supply has been depleted
supply[m] = 0
There is an additional if statement for when supply[m]==demand[n] but I feel that isn't necessary. I've already tried using nested for loops, and so many different syntax combinations for a while loop but I just can't get it to work the way I want it to. Even when running the code block over over by itself, m and n stay the same and the function removes one value from values but doesn't add it to sandmoved. Any ideas are greatly appreciated!!
Well, here is an example from an old implementation of mine:
import numpy as np
values = np.array([[3, 1, 7, 4],
[2, 6, 5, 9],
[8, 3, 3, 2]])
demand = np.array([250, 350, 400, 200])
supply = np.array([300, 400, 500])
totCost = 0
MAX_VAL = 2 * np.max(values) # choose MAX_VAL higher than all values
while np.any(values.ravel() < MAX_VAL):
# find row and col indices of min
m, n = np.unravel_index(np.argmin(values), values.shape)
if supply[m] < demand[n]:
totCost += supply[m] * values[m,n]
demand[n] -= supply[m]
values[m,:] = MAX_VAL # set all row to MAX_VAL
else:
totCost += demand[n] * values[m,n]
supply[m] -= demand[n]
values[:,n] = MAX_VAL # set all col to MAX_VAL
Solution:
print(totCost)
# 2850
Basically, start by choosing a MAX_VAL higher than all given values and a totCost = 0. Then follow the standard steps of the algorithm. Find row and column indices of the smallest cell, say m, n. Select the m-th supply or the n-th demand whichever is smaller, then add what you selected multiplied by values[m,n] to the totCost, and set all entries of the selected row or column to MAX_VAL to avoid it in the next iterations. Update the greater value by subtracting the selected one and repeat until all values are equal to MAX_VAL.

How to group and sum certain columns of an array based on their classification (eg to group cities by country)

The issue
I have arrays which track certain items over time. The items belong to certain categories. I want to calculate the sum by time and category, e.g. to go from a table by time and city to one by time and country.
I have found a couple of ways, but they seem clunky - there must be a better way! Surely I'm not the first one with this issue? Maybe using np.where?
More specifically:
I have a number of numpy arrays of shape (p x i), where p is the period and i is the item I am tracking over time.
I then have a separate array of shape i which classifies the items into categories (red, green, yellow, etc.).
What I want to do is calculate an array of shape (p x number of unique categories) which sums the values of the big array by time and category. In pictures:
I'd need the code to be as efficient as possible as I need to do this multiple times on arrays which can be up to 400 x 1,000,000
What I have tried:
This question covers a number of ways to groupby without resorting to pandas. I like the scipy.ndimage approach, but AFAIK it works on one dimension only.
I have tried a solution with pandas:
I create a dataframe of shape periods x items
I unpivot it with pd.melt(), join the categories and do a crosstab period/categories
I have also tried a set of loops, optimised with numba:
A first loop creates an array which converts the categories into integers, i.e. the first category in alphabetical order becomes 0, the 2nd 1, etc
A second loop iterates through all the items, then for each item it iterates through all the periods and sums by category
My findings
for small arrays, pandas is faster
for large arrays, numba is better, but it's better to set parallel = False in the numba decorator
for very large arrays, numba with parallel = True shines
parallel = True makes use of numba's parallelisation by using numba.prange on the outer loops.
PS I am aware of the pitfalls of premature optimisation etc etc - I am only looking into this because a significant amount of time is spent doing precisely this
The code
import numpy as np
import pandas as pd
import time
import numba
periods = 300
n = int(2000)
categories = np.tile(['red','green','yellow','brown'],n)
my_array = np.random.randint(low = 0, high = 10, size = (periods, len(categories) ))
# my_arrays will have shape (periods x (n * number of categories))
#---- pandas
start = time.time()
df_categories = pd.DataFrame(data = categories).reset_index().rename(columns ={'index':'item',0:'category'})
df = pd.DataFrame(data = my_array)
unpiv = pd.melt(df.reset_index(), id_vars ='index', var_name ='item', value_name ='value').rename( columns = {'index':'time'})
unpiv = pd.merge(unpiv, df_categories, on='item' )
crosstab = pd.crosstab( unpiv['time'], unpiv['category'], values = unpiv['value'], aggfunc='sum' )
print("panda crosstab in:")
print(time.time() - start)
# yep, I know that timeit.timer would have been better, but I was in a hurry :)
print("")
#---- numba
#numba.jit(nopython = True, parallel = True, nogil = True)
def numba_classify(x, categories):
cat_uniq = np.unique(categories)
num_categories = len(cat_uniq)
num_items = x.shape[1]
periods = x.shape[0]
categories_converted = np.zeros(len(categories), dtype = np.int32)
out = np.zeros(( periods, num_categories))
# before running the actual classification, I must convert the categories, which can be strings, to
# the corresponsing number in cat_uniq, e.g. if brown is the first category by alphabetical sorting, then
# brown --> 0, etc
for i in numba.prange(num_items):
for c in range(num_categories):
if categories[i] == cat_uniq[c]:
categories_converted[i] = c
for i in numba.prange(num_items):
for p in range(periods):
out[ p, categories_converted[i] ] += x[p,i]
return out
start = time.time()
numba_out = numba_classify(my_array, categories)
print("numba done in:")
print(time.time() - start)
You can use df.groupby(categories, axis=1).sum() for a substantial speedup.
import numpy as np
import pandas as pd
import time
def make_data(periods, n):
categories = np.tile(['red','green','yellow','brown'],n)
my_array = np.random.randint(low = 0, high = 10, size = (periods, len(categories) ))
return categories, pd.DataFrame(my_array)
for n in (200, 2000, 20000):
categories, df = make_data(300, n)
true_n = n * 4
start = time.time()
tabulation =df.groupby(categories, axis=1).sum()
elapsed = time.time() - start
print(f"300 x {true_n:5}: {elapsed:.3f} seconds")
# prints:
300 x 800: 0.005 seconds
300 x 8000: 0.021 seconds
300 x 80000: 0.673 seconds

How do you add create multiple of the same item and add them to an array?

I'm trying to create and an array of shape (1, inter) [i.e. 1 row, inter Columns], where inter is user input;
If you look at the code below,
l_o_s, Inter, n_o_s, L, d_o_s are all from user inputs
The n_o_s represents the number of sections across the total length of the shaft that have lengths corresponding to the values in l_o_s and diameters corresponding to the values in d_o_s.
So
Section 1 has a length of 1.5 and diameter 3.75
Section 2 = length of 4.5-1.5 = 3 and diameter 3.5
Section 3 = length of 7.5-4.5 = 3 and diameter 3.75
and so forth...
Here's an image of the shaft arrangement:
This is a shaft of length = 36, with 13 sections that have different size diameters
Inter is the number of intervals I require in the analysis, in this case inter is 3600, so I require a (1,3600) array.
si is an array that is a function (mathematical) of the length of the individual section in l_o_s, the total length (L) of the system and the interval (Inter).
Here's the question
So if you take every value in
si = [ 150. 450. 750. 1050. 1350. 1650. 1950. 2250. 2550. 2850. 3150. 3450. 3600.]
I require an array of shape (1,3600) whose first 150 elements are all equal to the diameter of section 1 - (3.75), and the elements between 150 and 450 i need them to equal the diameter of the second section (3.5) and so forth...
So i need the first 150 element corresponding to index 0 in d_o_s and the next 300 elements corresponding to index 1 in d_o_s, etc...
Here's a code I began with, but I don't think it's worth talking about. I was creating an array of zeros with inner inner shapes corresponding to each of the 150,300,300,300 elements.
import numpy as np
import math
L = 36
Inter = 3600
n_o_s = 13
l_o_s = np.asarray([1.5,4.5,7.5,10.5,13.5,16.5,19.5,22.5,25.5,28.5,31.5,34.5,36])
d_o_s = np.asarray([3.75,3.5,3.75,3.5,3.75,3.5,3.75,3.5,3.75,3.5,3.75,3.5,3.75])
si = np.asarray((l_o_s/L)*Inter)
print(si)
z = (si.size)
def f(x):
for i in si:
zz = np.zeros((x,1,int(i)))
for j in range(int(z)):
for p in range(int(d_o_s[j])):
zz[j][0][p] = np.full((1,int(i)),(math.pi*d_o_s**4)/64)
return zz
print(f(z))
Any ideas,
Dallan
This is what I ended up with but I'm only receiving 3599 values instead of the required 3600 any ideas? I used the diameter to output another variable (basically swapped the diameters in d_o_s for values in i_o_s)
L = 36
Inter = 3600
n_o_s = 13
l_o_s = np.asarray([0,1.5,4.5,7.5,10.5,13.5,16.5,19.5,22.5,25.5,28.5,31.5,34.5,36])
d_o_s = np.asarray([3.75,3.5,3.75,3.5,3.75,3.5,3.75,3.5,3.75,3.5,3.75,3.5,3.75])
i_o_s = (math.pi*d_o_s**4)/64
si = np.asarray((l_o_s/L)*Inter)
lengths = si[1:] - si[:-1]
Iu = np.asarray(sum([[value]*(int(length)) for value, length in zip(i_o_s, lengths)], []))
print(Iu,Iu.shape)
In python, an operation like 4 *[1] produces [1,1,1,1]. So, you need to calculate the lengths of the subarrays, create them, and concatenate them using sum().
lengths = si[1:] - si[:-1]
result = sum([
[value]*length for value, length in zip(d_o_s, lengths)
], [])
Also, your si array is of type float, so you get a rounding error when used as index. convert it to integer, by changing
si = np.asarray((l_o_s/L)*Inter)
to
si = np.asarray((l_o_s/L)*Inter).astype(int)

How to equally partitioning an array into predefined size and loop folding in Python

I'm using python and try to do 10 folds looping. To explain this problem, I've an array of any size > 10 of any content, for example:
myArray = [12,14,15,22,16,20,30,25,21,5,3,8,11,19,40,33,23,45,65]
smallArray = []
bigArray = []
I want to do two things:
divide "myArray" into 10 equal parts [e.g. part1, part2, ..., part10]
I need to loop 10 times and each time to do the following:
smallArray = one distinct part a time
the remaining parts are assigned into "bigArray"
and keep doing this for the remaining 10 folds.
the output for example:
Loop1: smallArray = [part1], bigArray[the remaining parts except part1]
Loop2: smallArray = [part2], bigArray[the remaining parts except part2]
...
Loop10: smallArray = [part10], bigArray[the remaining parts except part10]
How to do so in Python?
l = len(myArray)
#create start and end indices for each slice
slices = ((i * l // 10, (i + 1) * l // 10) for i in xrange(0, 10))
#build (small, big) pairs
pairs = [(myArray[a:b], myArray[:a] + myArray[b:]) for a, b in slices]
for small, big in pairs:
pass

Categories