I'm working on a consensus clustering project where I run multiple versions of a clustering algorithm on a random subset of my data, and keep track of which items are assigned to which clusters. This article is very similar to what I'm doing. Imagine this process results in the data below.
iter1 iter2 iter3 iter4
Alice 2 0 2 1
Brian 1 1 1 1
Sally 1 2 0 2
James 0 2 1 0
The values in this table are the cluster numbers the item has been assigned to in that particular clustering iteration, and 0 when it's excluded from that iteration's clustering (chance for inclusion is 80%). From this DataFrame I would like to calculate the consensus matrix that states how many times two items are in the same cluster, of the iterations in which they were both included. So e.g. Brian and Sally were subsampled together 3 times (iter1, iter2, iter4) but were clustered together twice. Thus, entries for Brian ~ Sally are 0.67 which is approximately 2/3. See the table below for the full consensus matrix.
Alice Brian Sally James
Alice 1.0 0.00 0.00 0.0
Brian 0.0 1.00 0.67 0.5
Sally 0.0 0.67 1.00 1.0
James 0.0 0.50 1.00 1.0
My question is: how do I go from the first DataFrame to the second? I imagine one could make the item pairs first by getting all unique items and then making combinations of length 2 (Alice~Brian, Alice~Sally, Alice~James etc) and initialize the empty dataframe so each name is both in the rows and in the columns. Then fill in each cell based on a function that calculates the pair's consensus like we did with Brian ~ Sally (0.67). This, however, already feels kind of cumbersome and I'm fairly sure there is a way better way of doing this. Any help is appreciated!
Edit: I solved this with the following code. I'm not sure if there is a better way (there probably is), but here goes for future reference:
# Make the square matrix for N x N
c_matrix = np.zeros(shape=(len(i_table), len(i_table)))
c_matrix[:] = np.NaN # Replace with NaN to keep the diagonal NaN
iteration_table = i_table.to_numpy()
# Find all i,j combinations of patients that need a consensus index value
comb = list(combinations(list(range(0, iteration_table.shape[0])), 2))
for c in tqdm(comb):
both_clustered = 0
same_cluster = 0
for i, j in zip(iteration_table[c[0]], iteration_table[c[1]]):
if i >= 0 and j >= 0:
both_clustered += 1
if i == j:
same_cluster += 1
res = same_cluster/both_clustered if both_clustered != 0 else 0
c_matrix[c[0]][c[1]] = res
c_matrix[c[1]][c[0]] = res
Related
I am sure that this has been done before but I am unsure of how to even phrase the question for google and have been racking my brain for a few hours now, but I can explain it with an example. Imagine you have the data below.
observation #
m1
m2
m3
m4
m5
m6
1
T
L
T
L
T
L
2
A
R
A
R
A
A
3
B
C
B
C
B
C
4
K
K
K
A
L
K
5
P
P
P
R
L
P
I want to generate some sort of similarity metric between observations that relates to the variation across the m1-6 variables. The actual values in the cells shouldn't matter at all.
Considering the table above, for example observations 1 and 3 are exactly the same as they vary the same across the m's (TLTLTL & BCBCBC). 1 & 3 are very similar to 2, and observations 4 and 5 are the same but not similar to 1-3.
I would like an output that captures all these relationships for example . . .
observation #
1
2
3
4
5
1
1
0.8
1
0.1
0.1
2
0.8
1
0.8
0.2
0.2
3
1
0.8
1
0.1
0.1
4
0.1
0.2
0.1
1
1
5
0.1
0.2
0.1
1
1
A few notes - each cell can have more than just 1 letter but again the actual contents of each cell don't matter - just the variation across the m's within each observation compared to other observations. Is their a name for what I am trying to do here? Also I only know python & R so if you provide any code please have it in those (python preferred).
It is driving me crazy that I can't figure this out. Thanks in advance for any help :)
I have a list of places and I need to find the distance between each of those. Can anyone suggest a faster method? There are about 10k unique places, the method I'm using creates a 10k X 10k matrix and I'm running out of memory. I'm using 15GB RAM.
test_df
Latitude Longitude site
0 32.3 -94.1 1
1 35.2 -93.1 2
2 33.1 -83.4 3
3 33.2 -94.5 4
test_df = test_df[['site', 'Longitude', 'Latitude']]
test_df['coord'] = list(zip(test_df['Longitude'], test_df['Latitude']))
from haversine import haversine
for _,row in test_df.iterrows():
test_df[row.coord]=round(test_df['coord'].apply(lambda x:haversine(row.coord,x, unit='mi')),2)
df = test_df.rename(columns=dict(zip(test_df['coord'], test_df['Facility'])))
df.drop('coord', axis=1, inplace=True)
new_df = pd.melt(df, id_vars='Facility', value_vars=df.columns[1:])
new_df.rename(columns={'variable':'Place', 'value':'dist_in_mi'}, inplace=True)
new_df
site Place dist_in_mi
0 1 1 0.00
1 2 1 70.21
2 3 1 739.28
3 4 1 28.03
4 1 2 70.21
5 2 2 0.00
6 3 2 670.11
7 4 2 97.15
8 1 3 739.28
9 2 3 670.11
10 3 3 0.00
11 4 3 766.94
12 1 4 28.03
13 2 4 97.15
14 3 4 766.94
15 4 4 0.00
If you want to resolve your memory problem you need to use datatypes that use less memory.
In this case since the maximum distance between two points on the planet Earth is less than 20005Km you can use uint16 to store the value (if a 1Km resolution is enough for you)
Since i hadn't any data to work with i generated some data with the following code:
import random
import numpy as np
from haversine import haversine
def getNFacilities(n):
""" returns n random pairs of coordinates in the range [-90, +90]"""
for i in range(n):
yield random.random()*180 - 90, random.random()*180 - 90
facilities = list(getNFacilities(10000))
Then i resolved the memory problem in two different ways:
1- By storing the distance data in uint16 numbers
def calculateDistance(start, end):
mirror = start is end # if the matrix is mirrored the values are calculated just one time instead of two
out = np.zeros((len(start), len(end)), dtype = np.uint16) # might be better to use empty?
for i, coords1 in enumerate(start[mirror:], mirror):
for j, coords2 in enumerate(end[:mirror and i or None]):
out[i, j] = int(haversine(coords1, coords2))
return out
After calculating the distance the memory used by the array was about 200MB:
In [133]: l = calculateDistance(facilities, facilities)
In [134]: sys.getsizeof(l)
Out[134]: 200000112
2- Or in alternative you can just use a generator:
def calculateDistance(start, end):
mirror = start is end # if the matrix is mirrored the values are calculated just one time
for i, coords1 in enumerate(start[mirror:], mirror):
for j, coords2 in enumerate(end[:mirror and i or None]):
yield [i, j, haversine(coords1, coords2)]
For this question, I am using a FIFA dataset. I used a slicer/filter on df to view only players with 4+ skill moves and assigned it a variable. I then took a quick snapshot using value_counts() for seeing which teams held the most players with 4+ skill moves. Ultimately, I would like it if I could preserve this view if possible because the ranking is easy to understand.
My question is: what if I wanted to add new column based on the condition that it gives me the count of 4-skillers for each row/club_name, and similarly, anther column giving me the count of 5-skillers. For example, let's say Real Madrid had three 5-skillers and nine 4-skillers. The new columns would each show the counts accordingly. What would be the best way to do this?
*edit: df.skill_moves is an int column ranging 1-5.
You can have multiple named aggregates like so:
fourfive_skillers.groupby('club_name')['skill_moves'].agg(
total='count',
four_skills=lambda x: sum(x == 4),
five_plus_skills=lambda x: sum(x >= 5))
I have a different dataset than you, but the output would be similar to:
Out[52]:
total four_skills five_plus_skills
club_name
1. FC Kaiserslautern 1 1 0
1. FC Köln 1 1 0
1. FC Nürnberg 4 4 0
1. FC Union Berlin 1 1 0
1. FSV Mainz 05 2 1 1
... ... ... ...
Wolverhampton Wanderers 5 5 0
Yeni Malatyaspor 1 1 0
Yokohama F. Marinos 1 1 0
Çaykur Rizespor 1 1 0
Śląsk Wrocław 1 1 0
Another commonly done thing is to have percentages of the total for each additional column. You can do that like this:
fourfive_skillers.groupby('club_name')['skill_moves'].agg(
total='count',
four_skills=lambda x: sum(x == 4),
four_skills_pct=lambda x: sum(x == 4) / len(x),
five_plus_skills=lambda x: sum(x >= 5),
five_plus_skills_pct=lambda x: sum(x >= 5) / len(x))
I am new to python and am trying to write a code to create a new dataframe based on conditions from an old dataframe along with the results in the cell above on the new dataframe.
Here is an example of what I am trying to do:
is the raw data
I need to create a new dataframe where if the corresponding position in the raw data is 0 the result is 0, if it is greater than 0 then 1 plus the above row
I need to remove any instances where the consecutive number of intervals doesn't reach at least 3
The way I think about the code is as such, but being new to python I am struggling.
From Raw data to Dataframe 2:
if (1,1)=0 then (1a, 1a)= 0: # line 1
else (1a,1a)=1;
if (2,1)=0 then (2a,1a)=0; # line 2
else (2a,1a)= (1a,1a)+1 = 2;
if (3,1)=0 then (3a,1a)=0; # line 3
From Dataframe 2 to 3:
If any of the last 3 rows is greater than 3 then return that cells value else return 0
I am not sure how to make any of these work, if there is an easier way to do/think about this then what I am doing please let me know. Any help is appreciated!
Based on your question, the output I was able to generate was:
Earlier, the DataFrame looked like so:
A B C
0.05 5 0 0
0.10 7 0 1
0.15 0 0 12
0.20 0 4 3
0.25 1 0 5
0.30 21 5 0
0.35 6 0 9
0.40 15 0 0
Now, the DataFrame looks like so:
A B C
0.05 0 0 0
0.10 0 0 1
0.15 0 0 2
0.20 0 0 3
0.25 1 0 4
0.30 2 0 0
0.35 3 0 0
0.40 4 0 0
The code I used for this is given below, just copy the following code in a new file, say code.py and run it
import re
import pandas as pd
def get_continous_runs(ext_list, threshold):
mylist = list(ext_list)
for i in range(len(mylist)):
if mylist[i] != 0:
mylist[i] = 1
samp = "".join(map(str, mylist))
finder = re.finditer(r"1{%s,}" % threshold, samp)
ranges = [x.span() for x in finder]
return ranges
def build_column(ranges, max_len):
answer = [0]*max_len
for r in ranges:
start = r[0]
run_len = r[1] - start
for i in range(run_len):
answer[start+i] = i + 1
return answer
def main(df):
print("Earlier, the DataFrame looked like so:")
print(df)
ndf = df.copy()
for col_name, col_data in df.iteritems():
ranges = get_continous_runs(col_data.values, 4)
column_len = len(col_data.values)
new_column = build_column(ranges, column_len)
ndf[col_name] = new_column
print("\nNow, the DataFrame looks like so:")
print(ndf)
return
if __name__ == '__main__':
raw_data = [
(5,0,0), (7,0,1), (0,0,12), (0,4,3),
(1,0,5), (21,5,0), (6,0,9), (15,0,0),
]
df = pd.DataFrame(
raw_data,
columns=list("ABC"),
index=[0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40]
)
main(df)
You can adjust threshold in line #28 to get consecutive number of intervals other than 4 (i.e. more than 3).
As always, start by reading main() function to understand how everything works. I have tried to use good variable names to aid understanding. My method might seem a little contrived because I am using regex, but I didn't want to overwhelm a very beginner with a custom run-length counter, so...
I have a (very large) Series that contains keywords (each row contain multiple keywords separated by a '-', for example
In[5]: word_series
Out[5]:
0 the-cat-is-pink
1 blue-sea
2 best-job-ever
dtype: object
I have another Series that contains a score attributes to each word (the words are the index, the scores are the values), for example:
In[7]: all_scores
Out[7]:
the 0.34
cat 0.56
best 0.01
ever 0.77
is 0.12
pink 0.34
job 0.01
sea 0.87
blue 0.65
dtype: float64
All the words in my word_series appear in my scores. I am trying to find the fastest way to attribute a score to each row of word_series, based on the average score of each of its words from all_scores. If a row is n/a the score should be the average of scores.
I have tried using apply this way, but it is was too slow.
scores = word_series.apply(
lambda x: all_scores[x.split('-')].mean()).fillna(
all_scores.mean())
I then thought I could split all_words into columns using str.replace and maybe perform a matrix multiplication type operation using this new matrix M and my words like M.mul(all_scores) where each row in M be matched to values based on the the index of all_scores. That would be a first step, to get the mean I could then jut divide by the number of non-na on each row
In[9]: all_words.str.split('-', expand=True)
Out[9]:
0 1 2 3
0 the cat is pink
1 blue sea None None
2 best job ever None
Is such an operation possible? Or is there another fast way to achieve this?
Working with strings data is slow in pandas, so use list comprehension with map by Series and mean:
from statistics import mean
L = [mean(all_scores.get(y) for y in x.split('-')) for x in word_series]
a = pd.Series(L, index=word_series.index)
print (a)
0 0.340000
1 0.760000
2 0.263333
dtype: float64
Or:
def mean(a):
return sum(a) / len(a)
L = [mean([all_scores.get(y) for y in x.split('-')]) for x in word_series]
a = pd.Series(L, index=word_series.index)
If possible some values not matched add parameter np.nan to get and use numpy.nanmean:
L = [np.nanmean([all_scores.get(y, np.nan) for y in x.split('-')]) for x in word_series]
a = pd.Series(L, index=word_series.index)
Or:
def mean(a):
return sum(a) / len(a)
L = [mean([all_scores.get(y, np.nan) for y in x.split('-') if y in all_scores.index])
for x in word_series]
heres a way
print(a)
words
0 the-cat-is-pink
1 blue-sea
2 best-job-ever
print(b)
all_scores
the 0.34
cat 0.56
best 0.01
ever 0.77
is 0.12
pink 0.34
job 0.01
sea 0.87
blue 0.65
b = b.reset_index()
print(b)
index all_scores
0 the 0.34
1 cat 0.56
2 best 0.01
3 ever 0.77
4 is 0.12
5 pink 0.34
6 job 0.01
7 sea 0.87
8 blue 0.65
a['score'] = a['words'].str.split('-').apply(lambda x: sum([b[b['index'] == w].reset_index()['all_scores'][0] for w in x])/len(x))
output
words score
0 the-cat-is-pink 0.340000
1 blue-sea 0.760000
2 best-job-ever 0.263333