I am sure that this has been done before but I am unsure of how to even phrase the question for google and have been racking my brain for a few hours now, but I can explain it with an example. Imagine you have the data below.
observation #
m1
m2
m3
m4
m5
m6
1
T
L
T
L
T
L
2
A
R
A
R
A
A
3
B
C
B
C
B
C
4
K
K
K
A
L
K
5
P
P
P
R
L
P
I want to generate some sort of similarity metric between observations that relates to the variation across the m1-6 variables. The actual values in the cells shouldn't matter at all.
Considering the table above, for example observations 1 and 3 are exactly the same as they vary the same across the m's (TLTLTL & BCBCBC). 1 & 3 are very similar to 2, and observations 4 and 5 are the same but not similar to 1-3.
I would like an output that captures all these relationships for example . . .
observation #
1
2
3
4
5
1
1
0.8
1
0.1
0.1
2
0.8
1
0.8
0.2
0.2
3
1
0.8
1
0.1
0.1
4
0.1
0.2
0.1
1
1
5
0.1
0.2
0.1
1
1
A few notes - each cell can have more than just 1 letter but again the actual contents of each cell don't matter - just the variation across the m's within each observation compared to other observations. Is their a name for what I am trying to do here? Also I only know python & R so if you provide any code please have it in those (python preferred).
It is driving me crazy that I can't figure this out. Thanks in advance for any help :)
Related
I'm working on a consensus clustering project where I run multiple versions of a clustering algorithm on a random subset of my data, and keep track of which items are assigned to which clusters. This article is very similar to what I'm doing. Imagine this process results in the data below.
iter1 iter2 iter3 iter4
Alice 2 0 2 1
Brian 1 1 1 1
Sally 1 2 0 2
James 0 2 1 0
The values in this table are the cluster numbers the item has been assigned to in that particular clustering iteration, and 0 when it's excluded from that iteration's clustering (chance for inclusion is 80%). From this DataFrame I would like to calculate the consensus matrix that states how many times two items are in the same cluster, of the iterations in which they were both included. So e.g. Brian and Sally were subsampled together 3 times (iter1, iter2, iter4) but were clustered together twice. Thus, entries for Brian ~ Sally are 0.67 which is approximately 2/3. See the table below for the full consensus matrix.
Alice Brian Sally James
Alice 1.0 0.00 0.00 0.0
Brian 0.0 1.00 0.67 0.5
Sally 0.0 0.67 1.00 1.0
James 0.0 0.50 1.00 1.0
My question is: how do I go from the first DataFrame to the second? I imagine one could make the item pairs first by getting all unique items and then making combinations of length 2 (Alice~Brian, Alice~Sally, Alice~James etc) and initialize the empty dataframe so each name is both in the rows and in the columns. Then fill in each cell based on a function that calculates the pair's consensus like we did with Brian ~ Sally (0.67). This, however, already feels kind of cumbersome and I'm fairly sure there is a way better way of doing this. Any help is appreciated!
Edit: I solved this with the following code. I'm not sure if there is a better way (there probably is), but here goes for future reference:
# Make the square matrix for N x N
c_matrix = np.zeros(shape=(len(i_table), len(i_table)))
c_matrix[:] = np.NaN # Replace with NaN to keep the diagonal NaN
iteration_table = i_table.to_numpy()
# Find all i,j combinations of patients that need a consensus index value
comb = list(combinations(list(range(0, iteration_table.shape[0])), 2))
for c in tqdm(comb):
both_clustered = 0
same_cluster = 0
for i, j in zip(iteration_table[c[0]], iteration_table[c[1]]):
if i >= 0 and j >= 0:
both_clustered += 1
if i == j:
same_cluster += 1
res = same_cluster/both_clustered if both_clustered != 0 else 0
c_matrix[c[0]][c[1]] = res
c_matrix[c[1]][c[0]] = res
I am trying to capture values from a kernel distribution which gives almost 0 and at the end of the tail. My try is to take values from the kernel function , distributed in a timeline from -120 to 120 and make the percentage change for the values from the kernel , so then i can declared an arbitrary rule that 10 consecutive negative changes and have which kernel value is almost 0 i can declare as the starting point from the ending of the curve.
Illustration example for which point of the kernel function i want to obtain.
in this case the final value which i will like to obtain is around 300
my dataframe looks like (this is not the same example values from above) :
df
id event_time
1 2
1 3
1 3
1 5
1 9
1 10
2 1
2 1
2 2
2 2
2 5
2 5
# my try
def find_value(df):
if df.shape[0] == 1:
return df.iloc[0].event_time
kernel = stats.gaussian_kde(df['event_time'])
time = list(range(-120,120))
a = kernel(time)
b = np.diff(a) / a[:-1] * 100
so far i have a which represent Y axis from the graph and b which represent the change in Y. The reason that i did this is for making the logic made at the begging but dont know how to code it. after writing the function i was thinking in using an groupby and a apply
I have problem calculating variance with "hidden" NULL (zero) values. Usually that shouldn't be a problem because NULL value is not a value but in my case it is essential to include those NULLs as zero to variance calculation. So I have Dataframe that looks like this:
TableA:
A X Y
1 1 30
1 2 20
2 1 15
2 2 20
2 3 20
3 1 30
3 2 35
Then I need to get variance for each different X value and I do this:
TableA.groupby(['X']).agg({'Y':'var'})
But answer is not what I need since I would need the variance calculation to include also NULL value Y for X=3 when A=1 and A=3.
What my dataset should look like to get the needed variance results:
A X Y
1 1 30
1 2 20
1 3 0
2 1 15
2 2 20
2 3 20
3 1 30
3 2 35
3 3 0
So I need variance to take into account that every X should have 1,2 and 3 and when there are no values for Y in certain X number it should be 0. Could you help me in this? How should I change my TableA dataframe to be able to do this or is there another way?
Desired output for TableA should be like this:
X Y
1 75.000000
2 75.000000
3 133.333333
Compute the variance directly, but divide by the number of different possibilities for A
# three in your example. adjust as needed
a_choices = len(TableA['A'].unique())
def variance_with_missing(vals):
mean_with_missing = np.sum(vals) / a_choices
ss_present = np.sum((vals - mean_with_missing)**2)
ss_missing = (a_choices - len(vals)) * mean_with_missing**2
return (ss_present + ss_missing) / (a_choices - 1)
TableA.groupby(['X']).agg({'Y': variance_with_missing})
Approach of below solution is appending not existing sequence with Y=0. Little messy but hope this will help.
import numpy as np
import pandas as pd
TableA = pd.DataFrame({'A':[1,1,2,2,2,3,3],
'X':[1,2,1,2,3,1,2],
'Y':[30,20,15,20,20,30,35]})
TableA['A'] = TableA['A'].astype(int)
#### Create row with non existing sequence and fill with 0 ####
for i in range(1,TableA.X.max()+1):
for j in TableA.A.unique():
if not TableA[(TableA.X==i) & (TableA.A==j)]['Y'].values :
TableA = TableA.append(pd.DataFrame({'A':[j],'X':[i],'Y':[0]}),ignore_index=True)
TableA.groupby('X').agg({'Y':np.var})
I can't understand the output of this code.
print(1//.2)
print(10//2)
The output of 1st is 4.0
and output of 2nd is 5
To clarify this, the // operator will give you the integer part of the result. So:
10 // 2 = 5
11 // 2 = 5
12 // 2 = 6
To get the rest you can use the modulo operator %:
10 % 2 = 0
11 % 2 = 1
12 % 2 = 0
Now, when you specify 0.2, this is taken as a floating point value. These are not 100% accurate in the way the system stores them, meaning the exact value of this is likely something like 0.2000000000000000001.
This means that when you do:
1 // 0.2
What you are really doing is:
1 // 0.2000000000000001
Which if course evaluates to 4. If you then try the modulo operator, you will see you get a value less than 0.2. When I ran this I saw:
>>> 1 // 0.2
4.0
>>> 1 % 0.2
0.19999999999999996
I am researching how python implements dictionaries. One of the equations in the python dictionary implementation relates the pseudo random probing for an empty dictionary slot using the equation
j = ((j*5) + 1) % 2**i
which is explained here.
I have read this question, How are Python's Built In Dictionaries Implemented?, and basically understand how dictionaries are implemented.
What I don't understand is why/how the equation:
j = ((j*5) + 1) % 2**i
cycles through all the remainders of 2**i. For instance, if i = 3 for a total starting size of 8. j goes through the cycle:
0
1
6
7
4
5
2
3
0
if the starting size is 16, it would go through the cycle:
0 1 6 15 12 13 2 11 8 9 14 7 4 5 10 3 0
This is very useful for probing all the slots in the dictionary. But why does it work ? Why does j = ((j*5)+1) work but not j = ((j*6)+1) or j = ((j*3)+1) both of which get stuck in smaller cycles.
I am hoping to get a more intuitive understanding of this than the equation just works and that's why they used it.
This is the same principle that pseudo-random number generators use, as Jasper hinted at, namely linear congruential generators. A linear congruential generator is a sequence that follows the relationship X_(n+1) = (a * X_n + c) mod m. From the wiki page,
The period of a general LCG is at most m, and for some choices of factor a much less than that. The LCG will have a full period for all seed values if and only if:
m and c are relatively prime.
a - 1 is divisible by all prime factors of m.
a - 1 is divisible by 4 if m is divisible by 4.
It's clear to see that 5 is the smallest a to satisfy these requirements, namely
2^i and 1 are relatively prime.
4 is divisible by 2.
4 is divisible by 4.
Also interestingly, 5 is not the only number that satisfies these conditions. 9 will also work. Taking m to be 16, using j=(9*j+1)%16 yields
0 1 10 11 4 5 14 15 8 9 2 3 12 13 6 7
The proof for these three conditions can be found in the original Hull-Dobell paper on page 5, along with a bunch of other PRNG-related theorems that also may be of interest.