Related
GOAL: Filter a list of lists using dictionary as reference in Python 3.8+
CASE USE: When reviewing a nested list -- a series of survey responses -- filtering out responses based on control questions. In the dictionary, the responses to questions 3 (index 2 in list) and 7 (index 6) should both be of corresponding value 5. If both answers for a response are not 5, they should not be populated in the filtered_responses list.
Open to interpretation on how to solve for this. I have reviewed several resources touching on filtering dictionaries using lists. This method is preferred as some survey responses many contain the same array of values, therefore the list element is retained.
no_of_survey_questions = 10
no_of_participants = 5
min_score = 1
max_score = 10
control_questions = {3: 5,
7: 5, }
unfiltered_responses = [[4, 5, 4, 5, 4, 5, 4, 5, 4, 5], # omit
[9, 8, 7, 6, 5, 4, 3, 2, 1, 1], # omit
[5, 5, 5, 5, 5, 5, 5, 5, 5, 5], # include
[5, 2, 5, 2, 5, 2, 5, 9, 1, 1], # include
[1, 2, 5, 1, 2, 1, 2, 1, 2, 1]] # omit
for response in unfiltered_responses:
print(response)
print()
filtered_responses = [] # should contain only unfiltered_responses values marked 'include'
for response in filtered_responses:
# INSERT CODE HERE
print(response)
Thanks in advance!
You can use list comprehension + all():
control_questions = {3: 5,
7: 5}
unfiltered_responses = [[4, 5, 4, 5, 4, 5, 4, 5, 4, 5], # omit
[9, 8, 7, 6, 5, 4, 3, 2, 1, 1], # omit
[5, 5, 5, 5, 5, 5, 5, 5, 5, 5], # include
[5, 2, 5, 2, 5, 2, 5, 9, 1, 1], # include
[1, 2, 5, 1, 2, 1, 2, 1, 2, 1]] # omit
filted_questions = [subl for subl in unfiltered_responses if all(subl[k-1] == v for k, v in control_questions.items())]
print(filted_questions)
Prints:
[
[5, 5, 5, 5, 5, 5, 5, 5, 5, 5],
[5, 2, 5, 2, 5, 2, 5, 9, 1, 1]
]
I have a tensor, where I want to copy only some of the values (columnwise). The same values are in another tensor but in a random order. What I want, are the column indices from tensor2 of the values of tensor1. Here is an example:
copy_ind = torch.tensor([0, 1, 3], dtype=torch.long)
tensor1 = torch.tensor([[4, 6, 5, 1, 8],[10, 0, 8, 2, 1]])
temp = torch.index_select(tensor1, 1, copy_ind) # values to copy
tensor2 = torch.tensor([[1, 4, 5, 6, 8],[2, 10, 8, 0, 1]], dtype=torch.long)
_, t_ind = torch.sort(temp[0], dim=0)
t2_ind = copy_ind[t_ind] # indices of tensor2
The output should be:
t2_ind = [1, 3, 0]
Here is another example where I want to get the values of the tensor according to c1_new:
c1 = torch.tensor([[6, 7, 7, 8, 6, 8, 9, 4, 7, 6, 1, 3],[5, 11, 5, 7, 2, 9, 5, 5, 7, 11, 10, 7]], dtype=torch.long)
copy_ind = torch.tensor([1, 2, 3, 5, 7, 8], dtype=torch.long)
c1_new = torch.index_select(c1, 1, copy_ind)
indices = torch.as_tensor([[1, 3, 4, 6, 6, 6, 7, 7, 7, 8, 8, 9], [10, 7, 5, 2, 5, 11, 5, 7, 11, 7, 9, 5]])
values = torch.randn(12)
tensor = torch.sparse.FloatTensor(indices, values, (12, 12))
_, t_ind = torch.sort(c1[0], dim=0)
ind = t_ind[copy_ind] # should be [8, 6, 9, 10, 2, 7]
Unfortunately, the indices ind are not correct. Can someone please help me?
If you're ok with using a for loop you can use something like this: checking each column of your temp tensor against the columns of tensor2:
Edit: using torch.prod across dimension 1 to make sure both rows match
[torch.prod((temp.T[i] == tesnor2.T), dim=1).nonzero()[0] for i in range(temp.size(1))]
My output for your first example is [tensor(1), tensor(3), tensor(0)]
Given some numpy array a
array([2,2,3,3,2,0,0,0,2,2,3,2,0,1,1,0])
what is the best way to get all groups of n indices with each of them having a different value in a?
Obviously there is no group larger than the number of unique elements in a, here 4.
So for example, one group of size 4 is
array([0,2,5,13])
Consider that a might be quite long, let's say up to 250k.
If the result gets too large, it might also be desirable not to compute all such groups, but only the first k requested.
For inputs as integers, we can have a solution based on this post -
In [41]: sidx = a.argsort() # use kind='mergesort' for first occurences
In [42]: c = np.bincount(a)
In [43]: np.sort(sidx[np.r_[0,(c[c!=0])[:-1].cumsum()]])
Out[43]: array([ 0, 2, 5, 13])
Another closely related to previous method for generic inputs -
In [44]: b = a[sidx]
In [45]: np.sort(sidx[np.r_[True,b[:-1]!=b[1:]]])
Out[45]: array([ 0, 2, 5, 13])
Another with numba for memory-efficiency and hence performance too, to select first indices along those unique groups and also with the additional k arg -
from numba import njit
#njit
def _numba1(a, notfound, out, k):
iterID = 0
for i,e in enumerate(a):
if notfound[e]:
notfound[e] = False
out[iterID] = i
iterID += 1
if iterID>=k:
break
return out
def unique_elems(a, k, maxnum=None):
# feed in max of the input array as maxnum value if known
if maxnum is None:
L = a.max()+1
else:
L = maxnum+1
notfound = np.ones(L, dtype=bool)
out = np.ones(k, dtype=a.dtype)
return _numba1(a, notfound, out, k)
Sample run -
In [16]: np.random.seed(0)
...: a = np.random.randint(0,10,200)
In [17]: a
Out[17]:
array([5, 0, 3, 3, 7, 9, 3, 5, 2, 4, 7, 6, 8, 8, 1, 6, 7, 7, 8, 1, 5, 9,
8, 9, 4, 3, 0, 3, 5, 0, 2, 3, 8, 1, 3, 3, 3, 7, 0, 1, 9, 9, 0, 4,
7, 3, 2, 7, 2, 0, 0, 4, 5, 5, 6, 8, 4, 1, 4, 9, 8, 1, 1, 7, 9, 9,
3, 6, 7, 2, 0, 3, 5, 9, 4, 4, 6, 4, 4, 3, 4, 4, 8, 4, 3, 7, 5, 5,
0, 1, 5, 9, 3, 0, 5, 0, 1, 2, 4, 2, 0, 3, 2, 0, 7, 5, 9, 0, 2, 7,
2, 9, 2, 3, 3, 2, 3, 4, 1, 2, 9, 1, 4, 6, 8, 2, 3, 0, 0, 6, 0, 6,
3, 3, 8, 8, 8, 2, 3, 2, 0, 8, 8, 3, 8, 2, 8, 4, 3, 0, 4, 3, 6, 9,
8, 0, 8, 5, 9, 0, 9, 6, 5, 3, 1, 8, 0, 4, 9, 6, 5, 7, 8, 8, 9, 2,
8, 6, 6, 9, 1, 6, 8, 8, 3, 2, 3, 6, 3, 6, 5, 7, 0, 8, 4, 6, 5, 8,
2, 3])
In [19]: unique_elems(a, k=6)
Out[19]: array([0, 1, 2, 4, 5, 8])
Use Numpy.unique for this job. There are several other options, one can for instance return the number of times each unique item appears in a.
import numpy as np
# Sample data
a = np.array([2,2,3,3,2,0,0,0,2,2,3,2,0,1,1,0])
# The unique values are in 'u'
# The indices of the first occurence of the unique values are in 'indices'
u, indices = np.unique(a, return_index=True)
I have a large numpy array of size 100x100. Among these 10000 values, there are only about 50 unique values. So I want to create a second array of length 50, containing these unique values, and then somehow map the large array to the smaller array. Effectively, I want to store just 50 values in my system instead of redundant 10000 values.
Slices of arrays seem to share memory, but as soon as I use specific indexing, memory sharing is lost.
a = np.array([1,2,3,4,5])
b = a[:3]
indices = [0,1,2]
c = a[indices]
print(b,c)
print(np.shares_memory(a,b),np.shares_memory(a,c))
This gives the output:
[1 2 3] [1 2 3]
True False
Even though b and c are referring to the same values of a, b(the slice) shares memory with a while c doesn't. If I execute b[0] = 100, a[0] also becomes 100 since they share memory. That is not the case with c.
I want to make c, which is a collection of values which are all from a, share memory with a.
In general it is not possible to save memory in this way. The reason is that your data consists of 64-bit integers, and pointers are also 64-bit integers, so if you try to store each value exactly once in some auxiliary array and then point at those values, you will end up using basically the same amount of space.
The answer would be different if for example some of your arrays are subsets of other ones, or you if you were storing large types like long strings.
So make a random array with a small set of unique values:
In [45]: x = np.random.randint(0,10,(10,10))
In [46]: x
Out[46]:
array([[4, 3, 8, 5, 4, 8, 8, 1, 8, 1],
[9, 2, 7, 2, 9, 5, 3, 9, 3, 3],
[6, 2, 6, 9, 4, 2, 3, 4, 6, 7],
[1, 0, 2, 1, 0, 9, 4, 2, 6, 2],
[8, 1, 6, 8, 3, 9, 5, 0, 8, 5],
[4, 9, 1, 4, 1, 2, 8, 4, 7, 2],
[4, 5, 2, 4, 8, 0, 1, 4, 4, 7],
[2, 2, 0, 5, 3, 0, 3, 3, 3, 9],
[3, 1, 0, 6, 4, 8, 8, 3, 5, 2],
[7, 5, 9, 2, 8, 0, 8, 1, 7, 8]])
Find the unique ones:
In [48]: np.unique(x)
Out[48]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
better yet the unique values plus an array that lets us map those values on the original:
In [49]: np.unique(x, return_inverse=True)
Out[49]:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([4, 3, 8, 5, 4, 8, 8, 1, 8, 1, 9, 2, 7, 2, 9, 5, 3, 9, 3, 3, 6, 2,
6, 9, 4, 2, 3, 4, 6, 7, 1, 0, 2, 1, 0, 9, 4, 2, 6, 2, 8, 1, 6, 8,
3, 9, 5, 0, 8, 5, 4, 9, 1, 4, 1, 2, 8, 4, 7, 2, 4, 5, 2, 4, 8, 0,
1, 4, 4, 7, 2, 2, 0, 5, 3, 0, 3, 3, 3, 9, 3, 1, 0, 6, 4, 8, 8, 3,
5, 2, 7, 5, 9, 2, 8, 0, 8, 1, 7, 8]))
There's a value in the reverse mapping for each element in the original.
I don't know how to generate a list of numbers with duplicates based on a certain seed.
I have tried using the code below, but it cannot generate numbers that have duplicates
random.seed(3340)
test = random.sample(range(100), 100000)
I think this could work, but I got an error saying "ValueError: Sample larger than population or is negative"
I could implement some functions that can do this, but I think it would be a great idea if I can use some libraries.
random.sample samples without replacement. random.choices samples with replacement, which is what you want:
In [1]: import random
In [2]: random.choices([1, 2], k=10)
Out[2]: [2, 1, 1, 2, 1, 1, 1, 2, 2, 1]
You can also do this with numpy:
In [3]: import numpy
In [4]: numpy.random.randint(0, 10, 100)
Out[4]:
array([7, 6, 3, 3, 8, 5, 9, 5, 4, 5, 1, 5, 8, 2, 4, 3, 9, 3, 5, 7, 9, 6,
2, 3, 5, 8, 4, 9, 3, 3, 0, 8, 4, 4, 7, 2, 8, 4, 4, 9, 1, 1, 7, 1,
3, 1, 1, 5, 1, 7, 5, 1, 9, 6, 0, 4, 8, 9, 9, 4, 7, 6, 0, 5, 1, 8,
4, 8, 9, 8, 5, 4, 3, 0, 2, 6, 4, 4, 2, 3, 0, 6, 7, 3, 5, 9, 3, 7,
4, 1, 7, 6, 7, 8, 7, 6, 0, 5, 1, 0])
I dont know if you're looking for a simpler solution, but you could use indexing in a generator:
population = list(range(100))
sample = [population[random.randint(0,99) for _ in range(100000)]]
You could use this comprehension as well:
random.seed(3340)
test = [random.randrange(100) for _ in range(100000)]