Python factor level combinations - python

I'm trying to create a python version of the attentional network task. See this as a reference (page 3): http://www.researchgate.net/publication/7834908_The_Activation_of_Attentional_Networks
I have a total of 216 trials. Half of which will be "congruent", half are "incongruent". Furthermore, a third of the 216 will be "nocue", another third will be "center", and the final third will be "spatial"
Each of the 216 trials will be some combination of the above (e.g. congruent-spatial, incongruent-none)
This is how I'm creating those trials right now:
import pandas as pd
import numpy as np
import random
#set number of trials
numTrials = 216
numCongruent = numTrials / 2
numCue = numTrials / 3
#create shuffled congruency conditions
congruent = ["congruent"] * numCongruent
incongruent = ["incongruent"] * numCongruent
congruentConditions = congruent + incongruent
random.shuffle(congruentConditions)
#create shuffled cue conditions
noCue = ["none"] * numCue
centerCue = ["center"] * numCue
spatialCue = ["spatial"] * numCue
cueConditions = noCue + centerCue + spatialCue
random.shuffle(cueConditions)
#put everything into a dataframe
df = pd.DataFrame()
congruentArray = np.asarray(congruentConditions)
cueArray = np.asarray(cueConditions)
df["congruent"] = congruentArray
df["cue"] = cueArray
print df
2 questions...
Now, this works, but one important point is ensuring even distribution of the levels.
For example, I need to ensure that all of the "congruent" trials have an equal number of "nocue", "center", and "spatial" trials. And conversely, all of the "nocue" trials, for example, need to half an equal number of "congruent" and "incongruent" trials.
This is currently not ensured given the way I'm randomly shuffling the conditions. This would even out over an infinite sample size, but that is not the case here.
How would I ensure an equal distribution?
I've taken a look at the cartesian product (https://docs.python.org/2/library/itertools.html#itertools.product), but I'm not entirely that will help me achieve the equality problem
Once the above has been solved, I then need to ensure that in the final shuffled list, each trial type (e.g. congruent-spatial) follows each other trial type an equal number of times in the list order

One easy option is to generate a list of the 216 trials and shuffle it:
In [16]: opt1 = ["congruent", "incongruent"]
In [17]: opt2 = ["nocue", "center", "spatial"]
In [18]: from itertools import product
In [19]: trials = list(product(opt1, opt2))*36
In [20]: np.random.shuffle(trials)
trials will then be a randomly ordered list with 36 of each of the pairs.
EDIT: Your edit is a harder problem, and honestly, I'd need to think more about it to figure out if there is a solution or to prove that you can't have that desired property.
If "close enough" to even works, the best I could come up with is a bogosort approach: shuffle the list, check whether all of the a->b counts are between 4-8, and start over if they're not. Generally runs in 1-5 seconds on my machine:
def checkvals(v):
return all(x in (4,5,6,7,8) for x in v[1].value_counts().values)
def checkall(trials):
return all(checkvals(v) for k, v in pd.DataFrame(zip(trials, trials[1:])).groupby(0))
while not checkall(trials):
np.random.shuffle(trials)

Related

Fast conditional overlapping windowing (framing) of numpy array

I have a huge list of numpy arrays (1 dimensional), which are time series for different events. Each point has a label, and I want to window the numpy arrays based on its label. The labels I have is 0, 1, and 2. Each window has a fixed size M.
The label of each window will be the biggest label available in the window. So if a window consists of both 0 an 1 labeled datapoints, the label will be 1 for the whole window.
But the problem is that, the windowing is not label agnostic. Because of class imbalance, I want to only do overlapped windowing in case of labels 1 and 2.
So far I have written this code:
# conditional framing
data = []
start_cursor = 0
while start_cursor < arr.size:
end_cursor = start_cursor + window_size
data.append(
{
"frame": arr[start_cursor:end_cursor],
"label": y[start_cursor:end_cursor].max(),
}
)
start_cursor = end_cursor
if np.any(y[start_cursor, end_cursor] != 0):
start_cursor = start_cursor - overlap_size
But this is clearly too verbose and just plain inefficient, especially because I will call this while loop on my huge list of separate arrays.
EDIT: to explain the problem more. Imagine you are going to window a signal with fixed length M. If there only exists 0 label points in the window, there will be no overlap between adjacent windows. But if there exists labels 1 and 2, there will be an overlap between two signals with percentage p%.
I think this does what you are asking to do. The visualization for checking isn't great, but it helps you see how the windowing works. Hopefully I understood your question right and this is what you are trying to do. Anytime there is a 1 or 2 in the time series (rather than a 0) the window steps forward some fraction of the full window length (here 50%).
To examine how to do this, start with a sample time series:
import matplotlib.pylab as plt
import numpy as np
N = 5000 # time series length
# create some sort of data set to work with
x = np.zeros(N)
# add a few 1s and 2s to the list (though really they are the same for the windowing)
y = np.random.random(N)
x[y < 0.01] = 1
x[y < 0.005] = 2
# assign a window length
M = 50 # window length
overlap = 0.5 # assume 50% overlap
M_overlap = int(M * (1-overlap))
My approach is to sum the window of interest over your time series. If the sum ==0, there is no overlap between windows and if it is >0 then there is overlap. The question, then, becomes how should we calculate these sums efficiently? I compare two approaches. The first is simply to walk through the time series and the second is to use convolution (which is much faster). For the first one, I also explore different ways of assessing window size after summation.
Summation (slow version)
def window_sum1():
# start of windows in list windows
windows = [0,]
while windows[-1] + M < N:
check = sum(x[windows[-1]:windows[-1]+M]) == 0
windows.append(windows[-1] + M_overlap + (M - M_overlap) * check)
if windows[-1] + M > N:
windows.pop()
break
# plotting stuff for checking
return(windows)
Niter = 10**4
print(timeit.timeit(window_sum1, number = Niter))
# 29.201083058
So this approach went through 10,000 time series of length 5000 in about 30 seconds. But the line windows.append(windows[-1] + M_overlap + (M - M_overlap) * check) can be streamlined in an if statement.
Summation (fast version, 33% faster than slow version)
def window_sum2():
# start of windows in list windows
windows = [0,]
while windows[-1] + M < N:
check = sum(x[windows[-1]:windows[-1]+M]) == 0
if check:
windows.append(windows[-1] + M)
else:
windows.append(windows[-1] + M_overlap)
if windows[-1] + M > N:
windows.pop()
break
# plotting stuff for checking
return(windows)
print(timeit.timeit(window_sum2, number = Niter))
# 20.456240447000003
We see a 1/3 reduction in time with the if statement.
Convolution (85% faster than fast summation)
We can use signal processing to get a lot faster, by convolving the time series with the window of interest using numpy.convolve. (Disclaimer: I got the idea from the accepted answer to this question.) Of course, it also makes sense to adopt the faster window size assessment from above.
def window_conv():
a = np.convolve(x,np.ones(M,dtype=int),'valid')
windows = [0,]
while windows[-1] + M < N:
if a[windows[-1]]:
windows.append(windows[-1] + M_overlap)
else:
windows.append(windows[-1] + M)
if windows[-1] + M > N:
windows.pop()
break
return(windows)
print(timeit.timeit(window_conv, number = Niter))
#3.3695770570000008
Sliding window
The last thing I will add is that, as shown in one of the comments of this question, as of numpy 1.20 there is a function called sliding_window_view. I still have numpy 1.19 running and was not able to test it to see if it's faster than convolution.
At first, I think you should revise this line if np.any(y[start_cursor, end_cursor] != 0): to if np.any(y[start_cursor:end_cursor] != 0):
Any way,
I think we can revise your code at some points.
Firstly, you can revise this part :
if np.any(y[start_cursor: end_cursor] != 0):
start_cursor = start_cursor - overlap_size
before these lines you have calculated y[start_cursor:end_cursor].max() so you know is there any label that is bigger than 0 or not. so this is a better one:
if data[-1]['label'] != 0):
start_cursor -= overlap_size
Although, the better way is that you set y[start_cursor:end_cursor].max() into the value for using for setting 'label' and checking "if expression"
Secondly, You used "append" for data. It is so inefficient. the best way is to allocate frames with zero (you have fix size for your frame and you know maximum number of frame is maxNumFrame=np.ceil((arr.size-overlap_size)/(window_size-overlap_size)). So, you should initialize frames=np.zeros((maxNumFrame,window_size)) at the first step, then you change frames in the "while" or if you want to use your customized structure, you can initialize your list with zero value, then change values in "while"
Thirdly, the best way is that in a while you calculate "start_cursor" and y and set them into the array of tuple or 2 arrays. ("end_cursor" is redundant)
After that, make the frames by using "map" in one of the ways that I said. (In one array or your customized structure)

Constraining frequencies when building a list of lists of integers

I am trying to write a function which will return a list of lists of integers corresponding to pools that I will pool chemicals in. I want to keep the number of chemicals in each pool as uniform as possible. Each chemical is replicated some number of times across pools (in this example, 3 times across 9 pools). For the example below, I have 31 chemicals. Thus, each pool should have 10.333 drugs in it (or, more specifically, each pool should have floor(93/9) = 10 drugs with 93%9 = 3 pools having 11 drugs. My function for doing so is below. Currently, I'm trying to get the function to loop until there is one set of integers left (i.e. 3 pools with 9 chemicals) so that I can code the function to recognize which pools are allowed one more chemical and finalize the list of lists that tells me which pools to put each chemical in.
However, as written right now, the function will not always give my desired distribution of 11,11,11,10,10,10,9,9,9 for the frequencies of integers appearing in the list of lists. I've written the following to attempt to constrain the distribution: 1) Randomly select, without replacement, a list of bits (pool numbers). If any of the bits in the selected list have frequency >= 10 in the output list and I already have 3 pools with frequency 11, discard this list of bits. 2) If any of the bits in the selected list have frequency >= 9 in the output list, and there are 6 pools with frequency >= 10, discard this list of bits. 3) If any of the bits in the selected list have frequency >= 11 in the output list, discard this list of bits. It seems that this bit of code isn't working properly. I'm thinking it's either related to me improperly coding these three conditions. It appears that some lists of bits are being accidentally discarded while others are improperly added to the output list. Alternatively, there could be a scenario in which two pools go from 9 to 10 chemicals in the same step, resulting in 4 pools of 10 instead of 3 pools of 10. Am I thinking about this problem wrong? Is there an obvious place where my code isn't working?
The function for generating normalized pools:
(overlapping_kbits returns a list of lists of bits of length replicates with each bit being an integer in the range [1,pools], filtered such that no two lists may have greater than overlaps between them.)
import numpy as np
import pandas as pd
import itertools
import re
import math
from collections import Counter
def normalized_pool(pools, replicates, overlaps, ligands):
solvent_bits = [list(bits) for bits in itertools.combinations(range(pools),replicates)]
print(len(solvent_bits))
total_items = ligands*replicates
norm_freq = math.floor(total_items/pools)
num_extra = total_items%pools
num_norm = pools-3
normed_bits = []
count_extra = 0
count_norm = 0
while len(normed_bits) < ligands-1 and len(solvent_bits)>0:
rand = np.random.randint(0,len(solvent_bits))
bits = solvent_bits.pop(rand) #Sample without replacement
print(bits)
bin_freqs = Counter(itertools.chain.from_iterable(normed_bits))
print(bin_freqs)
previous = len(normed_bits)
#Constrain the frequency distribution
count_extra = len([bin_freqs[bit] for bit in bin_freqs.keys() if bin_freqs[bit] >= norm_freq+1])
count_norm = len([bin_freqs[bit] for bit in bin_freqs.keys() if bin_freqs[bit] >= norm_freq])
if any(bin_freqs[bit] >= norm_freq for bit in bits) and count_extra == num_extra:
print('rejected')
continue #i.e. only allow num_extra number of bits to have a frequency higher than norm_freq
elif any(bin_freqs[bit] >= norm_freq+1 for bit in bits):
print('rejected')
continue #i.e. never allow any bit to be greater than norm_freq+1
elif (any(bin_freqs[bit] >= norm_freq-1 for bit in bits) and count_norm >= num_norm):
if count_extra == num_extra:
print('rejected')
continue #only num_norm bins can have norm_freq
normed_bits.append(bits)
bin_freqs = Counter(itertools.chain.from_iterable(normed_bits))
return normed_bits
test_bits = normalized_pool(9,3,2,31)
test_freqs = Counter(itertools.chain.from_iterable(test_bits))
print(test_freqs)
print(len(test_bits))
I can get anything from 11,11,11,10,10,10,9,9,9 (my desired output) to 11,11,11,10,10,10,10,10,7. For a minimal example, try:
test_bits = normalized_pool(7,3,2,10)
test_freqs = Counter(itertools.chain.from_iterable(test_bits))
print(test_freqs)
Which should return 5,5,4,4,3,3,3 as the elements of the test_freqs Counter.
EDIT: Modified the function so it can run from being copied and pasted. Merged the function call into the larger block of code since it was being overlooked.

Is there an efficient way to create a binomial experiment of N bernoulli trials in a numpy array?

Suppose I have a coin with that lands on heads with probability P. The experiment to be performed is to continue flipping the coin x number of times. This experiment is to be repeated 1000 times.
Question Is there an efficient/vectorized approach to generate an array of random 1's(with probability P) and 0's (with probability 1-P)?
If I try something like this:
np.full(10,rng().choice((0,1),p= [.3,.7]))
The entire array is filled with the same selection. I have seen solutions that involve a fixed ratio of zeros to ones.
a = np.ones(n+m)
a[:m] = 0
np.random.shuffle(a)
However I'm not sure how to preserve the stochastic nature of the experiments with this set up.
Presently I am just looping through each iteration as follows, but it is very slow once the number of experiments gets large.
(The actual experiment involves terminating each trial when two consecutive heads are flipped, which is why there is a while loop in the code. For purposes of making the question specific I didn't want to address that here. )
Set = [0,1]
T = np.ones(Episodes)
for i in range(Episodes):
a = rng().choice(Set, p=[(1 - p), p])
counter = 1
while True:
b = rng().choice(Set, p=[(1-p),p])
counter += 1
if (a == 1) & (b == 1):
break
a = b
T[i] = counter
Any insights would be appreciated, thanks!
Answers provided by #Quang Hong and #Kevin as listed in the comments above. Just reposting with default_rng() so it is easier to reference later. But they are the true heroes here.
from numpy import default_rng as rng
rng().binomial(1, p = .7, size=(10,10))
rng().choice((0,1),p = [.3,.7], size=(10,10))

Generate random size-k subset from size-N (probability-weighted) set

This problem stems from a musical training game where I must choose a random 3-note chord from the 12 available pitch-classes, but certain notes are more likely than others (so that the user can train more for weaker notes).
I thought this problem would be quite simple: consider each weight as a line segment, place all segments one after the other to make a long segment, pick a random point on this long segment, record which weight it lies on, rinse and repeat until we have k items.
The following Python code demonstrates that this technique doesn't produce the correct results:
# Choose k items from a set of weights
# return set of winning indices
def Choose(W,k):
import random
cumulative = [ sum(W[:i+1]) for i in xrange(len(W)) ]
totalWeight = cumulative[-1]
winners = set()
while len(winners) < k:
rnd = random.uniform(0.0, totalWeight)
# Returns first element of cumulative that is >= rnd
w = next( i for i in xrange(len(cumulative)) if cumulative[i] >= rnd )
winners.add( w )
return winners
def Test(N):
x = [ list(Choose( [5,3,2], 2 )) for i in xrange(int(N/2))]
y = sum(x, [])
z = [y.count(i) for i in (0,1,2) ]
print z
for i in range(10):
Test(10000)
I generate 5000 random pairs from 3 weights [5,3,2]
The output logs the number of times each weight comes up
It should be 5000,3000,2000
For good measure I run the experiment 10 times:
python test.py
[4173, 3331, 2496]
[4180, 3367, 2453]
[4193, 3393, 2414]
[4228, 3375, 2397]
[4207, 3388, 2405]
[4217, 3377, 2406]
[4173, 3438, 2389]
[4172, 3378, 2450]
[4174, 3371, 2455]
[4208, 3322, 2470]
So ~ 4200 vs 3300 vs 2400
Not 5000 vs 3000 vs 2000
Is there a simple way to understand why this doesn't work?
Is there some way of transforming the weights, maybe 'weight[i] -> ln(weight[i])' or something like this, that would give correct results?
How to achieve the correct result? (I'm more concerned about clarity of code than optimal efficiency)
Use numpy.random.choice with the p parameter:
np.random.choice(3, size=1000, p=[0.5, 0.3, 0.2])
Now try again and see what you get.
Sampling without replacement with weights is a tricky problem.
First, consider your intuitive solution. You generate 5000 pairs, and you expect 5000 of these pairs to contain a 1. This means that every pair must contain a 1. I suspect that this is not what you desired or expected. To get the distribution that you expected, you could first choose 1, and then choose 2 or 3 with probability .6 or .4 respectively.
To do what I suspect you asking for, you should do something like Conditional Poisson Sampling. I do not know of a Python module that does this, though there almost certainly is one. The 'sampling' package in R will do it. I know of no gentle introduction on the web.
From a practical point of view, just do what you are doing and adjust the weights so that the probabilities come close to what you want. For what you are trying to do, precise probabilities do not seem necessary.
If you want a simple method (that is decidedly inefficient) to achieve what you want:
1) normalize the weights so that the sum of all the weights add up to the desired sample size. With your example .5 + .3 + .2 = 2 so the normalized weights would be [1., .6, .4].
2) let p_i be the ith weight considered as a probability (they all must be less than or equal to 1 or the problem will be impossible. Choose a sample by selecting the ith element with probability p_i
3) if the size of the drawn sample is correct output it, otherwise draw again
Here is a quick code example
import random
def sample(weights, sample_size):
w = float(sum(weights))
normweights = [x * sample_size / w for x in weights]
samp = [random.random() < pi for pi in normweights]
while sum(samp) != sample_size:
samp = [random.random() < pi for pi in normweights]
return [i for i,b in enumerate(samp) if b]
print(sample([.5,.3,.2], 2))
EDIT:
Ok, the above algorithm is hooey. I'll try to remember how to do it correctly.

Divide set into subsets with equal number of elements

For the purpose of conducting a psychological experiment I have to divide a set of pictures (240) described by 4 features (real numbers) into 3 subsets with equal number of elements in each subset (240/3 = 80) in such a way that all subsets are approximately balanced with respect to these features (in terms of mean and standard deviation).
Can anybody suggest an algorithm to automate that? Are there any packages/modules in Python or R that I could use to do that? Where should I start?
If I understand correctly your problem, you might use random.sample() in python:
import random
pool = set(["foo", "bar", "baz", "123", "456", "789"]) # your 240 elements here
slen = len(pool) / 3 # we need 3 subsets
set1 = set(random.sample(pool, slen)) # 1st random subset
pool -= set1
set2 = set(random.sample(pool, slen)) # 2nd random subset
pool -= set2
set3 = pool # 3rd random subset
I would tackle this as follows:
Divide into 3 equal subsets.
Figure out the mean and variance of each subset. From them construct an "unevenness" measure.
Compare each pair of elements, if swapping would reduce the "unevenness", swap them. Continue until there are either no more pairs to compare, or the total unevenness is below some arbitrary "good enough" threshold.
You can easily do this using the plyr library in R. Here is the code.
require(plyr)
# CREATE DUMMY DATA
mydf = data.frame(feature = sample(LETTERS[1:4], 240, replace = TRUE))
# SPLIT BY FEATURE AND DIVIDE INTO THREE SUBSETS EQUALLY
ddply(mydf, .(feature), summarize, sub = sample(1:3, 60, replace = TRUE))
In case you are still interested in the exhaustive search question. You have 240 choose 80 possibilities to choose the first set and then another 160 choose 80 for the second set, at which point the third set is fixed. In total, this gives you:
120554865392512357302183080835497490140793598233424724482217950647 * 92045125813734238026462263037378063990076729140
Clearly, this is not an option :)
Order your items by their decreasing Mahalanobis distance from the mean; they will be ordered from most extraordinary to most boring, including the effects of whatever correlations exist amongst the measures.
Assign X[3*i] X[3*i+1] X[3*i+2] to the subsets A, B, C, choosing for each i the ordering of A/B/C that minimizes your mismatch measure.
Why decreasing order? The statistically heavy items will be assigned first, and the choice of permutation in the larger number of subsequent rounds will have a better chance of evening out initial imbalances.
The point of this procedure is to maximize the chance that whatever outliers exist in the data set will be assigned to separate subsets.

Categories