Counting combinations over pairs of columns in a numpy array - python

I have a matrix with a certain number of columns that contain only the numbers 0 and 1, I want to count the number of [0, 0], [0, 1], [1, 0], and [1, 1] in each PAIR of columns.
So for example, if I have a matrix with four columns, I want to count the number of 00s, 11s, 01s, and 11s in the first and second column, append the final result to a list, then loop over the 3rd and 4th column and append that answer to the list.
Example input:
array([[0, 1, 1, 0],
[1, 0, 1, 0],
[0, 1, 0, 1],
[0, 0, 1, 1],
[1, 1, 0, 0]])
My expected output is:
array([[1, 1],
[2, 1],
[1, 2],
[1, 1]])
Explanation:
The first two columns have [0, 0] once. The second two columns also have [0, 0] once. The first two columns have [0, 1] twice, and the second two columns have [0, 1] once... and so on.
This is my latest attempt and it seems to work. Would like feedback.
# for each pair of columns calculate haplotype frequencies
# haplotypes:
# h1 = 11
# h2 = 10
# h3 = 01
# h4 = 00
# takes as input a pair of columns
def calc_haplotype_freq(matrix):
h1_frequencies = []
h2_frequencies = []
h3_frequencies = []
h4_frequencies = []
colIndex1 = 0
colIndex2 = 1
for i in range(0, 2): # number of columns divided by 2
h1 = 0
h2 = 0
h3 = 0
h4 = 0
column_1 = matrix[:, colIndex1]
column_2 = matrix[:, colIndex2]
for row in range(0, matrix.shape[0]):
if (column_1[row, 0] == 1).any() & (column_2[row, 0] == 1).any():
h1 += 1
elif (column_1[row, 0] == 1).any() & (column_2[row, 0] == 0).any():
h2 += 1
elif (column_1[row, 0] == 0).any() & (column_2[row, 0] == 1).any():
h3 += 1
elif (column_1[row, 0] == 0).any() & (column_2[row, 0] == 0).any():
h4 += 1
colIndex1 += 2
colIndex2 += 2
h1_frequencies.append(h1)
h2_frequencies.append(h2)
h3_frequencies.append(h3)
h4_frequencies.append(h4)
print("H1 Frequencies (11): ", h1_frequencies)
print("H2 Frequencies (10): ", h2_frequencies)
print("H3 Frequencies (01): ", h3_frequencies)
print("H4 Frequencies (00): ", h4_frequencies)
For the sample input above, this gives:
----------
H1 Frequencies (11): [1, 1]
H2 Frequencies (10): [1, 2]
H3 Frequencies (01): [2, 1]
H4 Frequencies (00): [1, 1]
----------
Which is correct, but is there a better way to do this? How can I return these results from the function for further processing?

Starting with this -
x
array([[0, 1, 1, 0],
[1, 0, 1, 0],
[0, 1, 0, 1],
[0, 0, 1, 1],
[1, 1, 0, 0]])
Split your array into groups of 2 columns and concatenate them:
y = x.T
z = np.concatenate([y[i:i + 2] for i in range(0, y.shape[0], 2)], 1).T
Now, perform a broadcasted comparison and sum:
(z[:, None] == [[0, 0], [0, 1], [1, 0], [1, 1]]).all(2).sum(0)
array([2, 3, 3, 2])
If you want a per-column pair count, then you could do something like this:
def calc_haplotype_freq(x):
counts = []
for i in range(0, x.shape[1], 2):
counts.append(
(x[:, None, i:i + 2] == [[0, 0], [0, 1], [1, 0], [1, 1]]).all(2).sum(0)
)
return np.column_stack(counts)
calc_haplotype_freq(x)
array([[1, 1],
[2, 1],
[1, 2],
[1, 1]])

Related

Searching within Nested lists in Python || How do I match a 1-index nested list, with a 2-index nested list

here is the problem I am trying to solve:
coord = [[0, 0], [1, 0], [1, 1], [2, 0], [2, 1], [2, 1], [2, 2] ..]
new_arr = [[[0, 0], 1], [[1, 0], 1], [[1, 1], 1], [[2, 0], 1], [[2, 1], 2], [[2, 2], 1] ..]
This is the target I am trying to map to
[0, 0][0, 1][0, 2]
[1, 0][1, 1][1, 2]
[2, 0][2, 1][2, 2]
the ultimate output would be the counts against each of the coordinates
1 0 0
1 1 0
1 2 1
------ clarifications --------
the goal is to generate this square of numbers (counts) which is the second element in new_arr. E.g. [[0, 0], 1], [[1, 0], 1], can be interpreted as the value 1 for the coordinate [0,0] and value 1 for coordinate [1,0] 
the first list (coord) is simply a map of the coordinates. The goal is to get the corresponding value (from new_arr) and display it in the form of a square. Hope this clarified. The output will be a grid of the format
1 0 0
1 1 0
1 2 1
to the question of N (I just took a sample value of 3). The actual use case is when the user enters an integer, say 6 and the result is in a 6 X 6 square. The counts are chess move computations on the ways to reach a specific cell (two movements only (i+1, j) & (i+1, j+1) ....... starting from (0,0)
The logic is not fully clear, but is looks like you want to map the values of new_arr on the Cartesian product of coordinates:
N = 3 # how this is determined is unclear
d = {tuple(l):x for l, x in new_arr}
# {(0, 0): 1, (1, 0): 1, (1, 1): 1, (2, 0): 1, (2, 1): 2, (2, 2): 1}
out = [d.get((i,j), 0) for i in range(N) for j in range(N)]
# [1, 0, 0, 1, 1, 0, 1, 2, 1]
# 2D variant
out2 = [[d.get((i,j), 0) for j in range(N)] for i in range(N)]
# [[1, 0, 0],
# [1, 1, 0],
# [1, 2, 1]]
alternative with numpy
import numpy as np
N = 3
a = np.zeros((N,N), dtype=int)
# get indices and values
idx, val = zip(*new_arr)
# assign values (option 1)
a[tuple(zip(*idx))] = val
# assign values (option 2)
a[tuple(np.array(idx).T.tolist())] = val
print(a)
output:
array([[1, 0, 0],
[1, 1, 0],
[1, 2, 1]])
Use numpy:
import numpy as np
i = []
coord = [[0, 0], [1, 0], [1, 1], [2, 0], [2, 1], [2, 1], [2, 2]]
new_arr = [[[0, 0], 1], [[1, 0], 1], [[1, 1], 1], [[2, 0], 1], [[2, 1], 2], [[2, 2], 1]]
result = np.zeros([coord[-1][0] + 1, coord[-1][1] + 1])
for i in new_arr:
for j in coord:
if i[0] == j:
result[j[0],j[1]]= i[1]
print(result)
Output:
[[1. 0. 0.]
[1. 1. 0.]
[1. 2. 1.]]

Generate a list based on the index of another list

I have a list:
hash_table = [1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1]
I want to change this to:
result = [[0, 0], [1, 2], [4, 5]]
How to generate:
array: [1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1]
map: 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
# start to end, generate the result like `[int(start), int(end)]`
combine them:[[0, 0], [1, 2], [4, 5]]
0 and 1 wouldn't appear in pairs. So the numbers in result must be an integer.
What I have tried:
hash_table = [1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1]
output = [[]]
for pre, next_ in zip(hash_table, hash_table[1:]):
output[-1].append(pre)
if {next_, pre} == {0, 1}:
output.append([])
output[-1].append(hash_table[-1])
# the output is [[1], [0], [1, 1, 1], [0, 0, 0], [1, 1, 1]]
start = index = 0
result = []
while index < len(output):
# output[index]
if output[0] != 0:
res.append([start, math.ceil(len(output[index]))])
# I don't know how to handle the list "output".
# I couldn't know it. My mind has gone blank
start += len(output[index])/2
Any good ideas? I thought I made it too complicated.
You can use itertools.groupby to group the 0s and 1s:
import itertools
hash_table = [1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1]
result = []
cur_ind = 0
for (val, vals) in itertools.groupby(hash_table):
vals = list(vals) # itertools doesn't make it a list by default
old_ind = cur_ind
cur_ind += len(vals)
if val == 0:
continue
result.append([old_ind // 2, (cur_ind - 1) // 2])
print(result)
Essentially, itertools.groupby will give an iterator of [(1, [1]), (0, [0]), (1, [1, 1, 1]), (0, [0, 0, 0]), (1, [1, 1, 1])] (more or less). We can iterate through this iterator and keep track if the index we're on by adding the length of the sublist to the current index. If the value is 1, then we have a run of ones so we append it to the results. The old_ind // 2 is integer division and is equivalent to int(old_ind / 2).
You could use groupby from itertools library:
import itertools
hash_table = [1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1]
s = "".join(map(str, hash_table)) # s = "10111000111"
gs = [(i, list(g)) for i, g in itertools.groupby(s)]
idx, result = 0, []
for i, g in gs: # i can be '1' or '0' (i.e, if the group consist in 1's or 0's)
if i == '1':
result.append([idx/2, (idx + len(g) - 1)/2])
idx += len(g)
return result

How can I find the value with the minimum MSE with a numpy array?

My possible values are:
0: [0 0 0 0]
1: [1 0 0 0]
2: [1 1 0 0]
3: [1 1 1 0]
4: [1 1 1 1]
I have some values:
[[0.9539342 0.84090066 0.46451256 0.09715253],
[0.9923432 0.01231235 0.19491441 0.09715253]
....
I want to figure out which of my possible values this is the closest to my new values. Ideally I want to avoid doing a for loop and wonder if there's some sort of vectorized way to search for the minimum mean squared error?
I want it to return an array that looks like: [2, 1 ....
You can use np.argmin to get the lowest index of the rmse value which can be calculated using np.linalg.norm
import numpy as np
a = np.array([[0, 0, 0, 0], [1, 0, 0, 0], [1, 1, 0, 0],[1, 1, 1, 0], [1, 1, 1, 1]])
b = np.array([0.9539342, 0.84090066, 0.46451256, 0.09715253])
np.argmin(np.linalg.norm(a-b, axis=1))
#outputs 2 which corresponds to the value [1, 1, 0, 0]
As mentioned in the edit, b can have multiple rows. The op wants to avoid for loop, but I can't seem to find a way to avoid the for loop. Here is a list comp way, but there could be a better way
[np.argmin(np.linalg.norm(a-i, axis=1)) for i in b]
#Outputs [2, 1]
Let's assume your input data is a dictionary. You can then use NumPy for a vectorized solution. You first convert your input lists to a NumPy array and the use axis=1 argument to get the RMSE.
# Input data
dicts = {0: [0, 0, 0, 0], 1: [1, 0, 0, 0], 2: [1, 1, 0, 0], 3: [1, 1, 1, 0],4: [1, 1, 1, 1]}
new_value = np.array([0.9539342, 0.84090066, 0.46451256, 0.09715253])
# Convert values to array
values = np.array(list(dicts.values()))
# Compute the RMSE and get the index for the least RMSE
rmse = np.mean((values-new_value)**2, axis=1)**0.5
index = np.argmin(rmse)
print ("The closest value is %s" %(values[index]))
# The closest value is [1 1 0 0]
Pure numpy:
val1 = np.array ([
[0, 0, 0, 0],
[1, 0, 0, 0],
[1, 1, 0, 0],
[1, 1, 1, 0],
[1, 1, 1, 1]
])
print val1
val2 = np.array ([0.9539342, 0.84090066, 0.46451256, 0.09715253], float)
val3 = np.round(val2, 0)
print val3
print np.where((val1 == val3).all(axis=1)) # show a match on row 2 (array([2]),)

Determine which combinations of vectors will sum to another vector

I am using Python 3 to try to find what linear combinations of a set of vectors will sum to another vector. I am using numpy arrays as vectors.
For example, I would have a target vector and matrix "choices" containing all the possible choices of vectors:
targetvector0 = numpy.array([0, 1, 2])
choices = numpy.array([[0, 1, 0], [0, 0, 1], [0, 0, 2], [1, 1, 0]])
I need something that would return all possible combinations and their integer multiples (need them to be integer multiples) that sum to the target and ignore those that don't:
option1 = [[1], [2], [0], [0]]
option2 = [[1], [0], [1], [0]]
I found some info on numpy.linalg.solve(x, y), but it doesn't quite do what I'm looking for or I don't know how to use it effectively.
I suppose the multiples you are searching are all positive.
You can carefully increment the multiples, studying all the combinations that give results not greater than the target vector.
import numpy as np
def solve(target_vector, choices):
nb_choices, n = choices.shape
factors = np.zeros((1, nb_choices), dtype=np.int)
i = 0
while True:
if i == nb_choices - 1:
return
factors[0, i] += 1
difference_to_target = factors.dot(choices) - targetvector
found_solution = np.all(difference_to_target == 0)
factors_too_high = np.any(difference_to_target > 0)
if found_solution:
yield factors.copy()
if found_solution or factors_too_high:
factors[0, :i + 1] = 0
i += 1
continue
i = 0
targetvector = np.array([0, 1, 2])
choices = np.array([[0, 1, 0], [0, 0, 1], [0, 0, 2], [1, 1, 0]])
print(list(solve(targetvector, choices)))
# [array([[1, 2, 0, 0]]), array([[1, 0, 1, 0]])]

How to apply multiple masks to an array and count occurrences per row

Let's say I have a 2D array with positive integers:
a = numpy.array([[1, 1, 2],
[1, 2, 5],
[1, 3, 6],
[3, 3, 3],
[3, 4, 6],
[4, 5, 6],
])
and a threshold (positive integer). I want to count, for each row, how many ocurrences are < threshold, how many >= threshold and < threshold+2, and how many >= threshold+2. The results are to be stored on a size 3 x n array, where n = a.shape[0] and each of the 3 columns corresponds to the threshold partition.
For the example above and threshold = 3, it would be:
b = numpy.array([[3, 0, 0],
[2, 0, 1],
[1, 1, 1],
[0, 3, 0],
[0, 2, 1],
[0, 1, 2],
])
My solution was to use a for loop combined with masks, so that I could apply the masks individually for each row. But using for loops on arrays feels wrong. Is there a more optimized way to accomplish that?
My solution so far:
b = []
for row in a:
b.append((numpy.sum(row < threshold),
numpy.sum((row >= threshold) * (row < threshold + 2)),
numpy.sum(row >= threshold + 2)))
b = numpy.array(b)
Approach #1
Making use of elementwise comparison against the thresholds and summing each row -
t = 3 # threshold
mask0 = (a<t)
mask2 = a>=t+2
mask1 = (a>=t) & ~mask2
out = np.c_[mask0.sum(1), mask1.sum(1), mask2.sum(1)]
Approach #2
If you think about it closely, we are creating three bins there. So, we could use get the bin ID for each element and finally, get the count of each row based on the IDs. We would use np.searchsorted to get those bin IDs and then elementwise equate and sum along each row.
Thus, we would have a solution, like so -
t = 3 # threshold
bins = [t, t+2] # Create intervals
N = len(bins)+1 # Number of cols in output
idx = np.searchsorted(bins,a,'right') # Get bin IDs
out = np.column_stack([(idx==i).sum(1) for i in range(N)])
We can vectorize the last step with broadcasting -
out = (idx == np.arange(N)[:,None,None]).sum(2).T
And one more vectorized alternative, which would also be memory efficient with np.bincount -
M = a.shape[0]
r = N*np.arange(M)[:,None]
out = np.bincount((idx + r).ravel(),minlength=M*N).reshape(M,N)
You have to break points 3 and 5. We can use np.searchsorted to find where each element of a falls with respect to our break points.
np.searchsorted([3, 5], 1, side='right') will return 0 because 1 should be inserted at position 0 to maintain sorted-ness.
np.searchsorted([3, 5], 3, side='right') will return 1 because 3 can be inserted at position 0 or any other in which a value of 3 occupies to maintain sorted-ness. The default behavior to insert to the left of elements that are equal. We can change this to insert to the right of all elements that are equal. This accounts for the condition < threshold
np.searchsorted([3, 5], 5) will return 1
np.searchsorted([3, 5], 7) will return 2
I use np.eye to build sub arrays to sum over in order to count how many fall within each bin.
np.eye(3, dtype=int)[np.searchsorted([3, 5], a, side='right')].sum(1)
array([[3, 0, 0],
[2, 0, 1],
[1, 1, 1],
[0, 3, 0],
[0, 2, 1],
[0, 1, 2]])
We can generalize this with a function
def count_bins(a, threshold, interval_sizes):
edges = np.append(threshold, interval_sizes).cumsum()
eye = np.eye(edges.size + 1, dtype=int)
return eye[edges.searchsorted(a, side='right')].sum(1)
count_bins(a, 3, [2])
array([[3, 0, 0],
[2, 0, 1],
[1, 1, 1],
[0, 3, 0],
[0, 2, 1],
[0, 1, 2]])
Or
count_bins(a, 3, [1, 1])
array([[3, 0, 0, 0],
[2, 0, 0, 1],
[1, 1, 0, 1],
[0, 3, 0, 0],
[0, 1, 1, 1],
[0, 0, 1, 2]])
But I'd rather return a pandas dataframe to see things more clearly
def count_bins(a, threshold, interval_sizes):
edges = np.append(threshold, interval_sizes).cumsum()
eye = np.eye(edges.size + 1, dtype=int)
labels = ['{:0.0f} to {:0.0f}'.format(i, j) for i, j in zip(np.append(-np.inf, edges), np.append(edges, np.inf))]
return pd.DataFrame(
eye[edges.searchsorted(a, side='right')].sum(1),
columns=labels
)
count_bins(a, 3, [2])
-inf to 3 3 to 5 5 to inf
0 3 0 0
1 2 0 1
2 1 1 1
3 0 3 0
4 0 2 1
5 0 1 2

Categories