How to to calculate unnested watersheds in GRASS GIS? - python

I am running into a few issues using the GRASS GIS module r.accumulate while running it in Python. I use the module to calculate sub watersheds for over 7000 measurement points. Unfortunately, the output of the algorithm is nested. So all sub watersheds are overlapping each other. Running the r.accumulate sub watershed module takes roughly 2 minutes for either one or multiple points, I assume the bottleneck is loading the direction raster.
I was wondering if there is an unnested variant in GRASS GIS available and if not, how to overcome the bottleneck of loading the direction raster every time you call the module accumulate. Below is a code snippet of what I have tried so far (resulting in a nested variant):
locations = VectorTopo('locations',mapset='PERMANENT')
locations.open('r')
points=[]
for i in range(len(locations)):
points.append(locations.read(i+1).coords())
for j in range(0,len(points),255):
output = "watershed_batch_{}#Watersheds".format(j)
gs.run_command("r.accumulate", direction='direction#PERMANENT', subwatershed=output,overwrite=True, flags = "r", coordinates = points[j:j+255])
gs.run_command('r.stats', flags="ac", input=output, output="stat_batch_{}.csv".format(j),overwrite=True)
Any thoughts or ideas are very welcome.

I already replied to your email, but now I see your Python code and better understand your "overlapping" issue. In this case, you don't want to feed individual outlet points one at a time. You can just run
r.accumulate direction=direction#PERMANENT subwatershed=output outlet=locations
r.accumulate's outlet option can handle multiple outlets and will generate non-overlapping subwatersheds.

The answer provided via email was very usefull. To share the answer I have provided the code below to do an unnested basin subwatershed calculation. A small remark: I had to feed the coordinates in batches as the list of coordinates exceeded the max length of characters windows could handle.
Thanks to #Huidae Cho, the call to R.accumulate to calculate subwatersheds and longest flow path can now be done in one call instead of two seperate calls.
The output are unnested basins. Where the largers subwatersheds are seperated from the smaller subbasins instead of being clipped up into the smaller basins. This had to with the fact that the output is the raster format, where each cell can only represent one basin.
gs.run_command('g.mapset',mapset='Watersheds')
gs.run_command('g.region', rast='direction#PERMANENT')
StationIds = list(gs.vector.vector_db_select('locations_snapped_new', columns = 'StationId')["values"].values())
XY = list(gs.vector.vector_db_select('locations_snapped_new', columns = 'x_real,y_real')["values"].values())
for j in range(0,len(XY),255):
output_ws = "watershed_batch_{}#Watersheds".format(j)
output_lfp = "lfp_batch_{}#Watersheds".format(j)
output_lfp_unique = "lfp_unique_batch_{}#Watersheds".format(j)
gs.run_command("r.accumulate", direction='direction#PERMANENT', subwatershed=output_ws, flags = "ar", coordinates = XY[j:j+255],lfp=output_lfp, id=StationIds[j:j+255], id_column="id",overwrite=True)
gs.run_command("r.to.vect", input=output_ws, output=output_ws, type="area", overwrite=True)
gs.run_command("v.extract", input=output_lfp, where="1 order by id", output=output_lfp_unique,overwrite=True)
To export the unique watersheds I used the following code. I had to transform the longest_flow_path to point as some of the longest_flow_paths intersected with the corner boundary of the watershed next to it. Some longest flow paths were thus not fully within the subwatershed. See image below where the red line (longest flow path) touches the subwatershed boundary:
enter image description here
gs.run_command('g.mapset',mapset='Watersheds')
lfps= gs.list_grouped('vect', pattern='lfp_unique_*')['Watersheds']
ws= gs.list_grouped('vect', pattern='watershed_batch*')['Watersheds']
files=np.stack((lfps,ws)).T
#print(files)
for file in files:
print(file)
ids = list(gs.vector.vector_db_select(file[0],columns="id")["values"].values())
for idx in ids:
idx=int(idx[0])
expr = f'id="{idx}"'
gs.run_command('v.extract',input=file[0], where=expr, output="tmp_lfp",overwrite=True)
gs.run_command("v.to.points", input="tmp_lfp", output="tmp_lfp_points", use="vertex", overwrite=True)
gs.run_command('v.select', ainput= file[1], binput = "tmp_lfp_points", output="tmp_subwatersheds", overwrite=True)
gs.run_command('v.db.update', map = "tmp_subwatersheds",col= "value", value=idx)
gs.run_command('g.mapset',mapset='vector_out')
gs.run_command('v.dissolve',input= "tmp_subwatersheds#Watersheds", output="subwatersheds_{}".format(idx),col="value",overwrite=True)
gs.run_command('g.mapset',mapset='Watersheds')
gs.run_command("g.remove", flags="f", type="vector",name="tmp_lfp,tmp_subwatersheds")
I ended up with a vector for each subwatershed

Related

Saving basic and extended stats of street network for multiple polygons using OSMNX

Based on a given place name I use OSMNX to get street network graphs. I layed a grid over the street network dividing the network into multiple cells, for which each cell's polygon is stored in a geo-series object. With the following function I would like to store the basic_stats/extended_stats in a list and later convert it to a dataframe:
def stats_per_cell(geoseries):
network_stats = []
for i in range(0, len(geoseries)):
try:
# Get graph for grid cell -> row
row = ox.graph_from_polygon(geoseries[i], truncate_by_edge=True)
# Save basic stats for graph in list -> grid_stats
grid_stats = ox.basic_stats(row)
# Add grid cell id
grid_stats['index'] = i
print(grid_stats['index'])
# Append in list -> network_stats
network_stats.append(grid_stats)
print(grid_stats)
except:
print('Error' + str(i))
traceback.print_exc()
continue
# Save entire list in df -> street features
street_features = pd.DataFrame.from_dict(network_stats, orient='columns')
return street_features
The code works fine, however after lets say 1000 iterations, it becomes very slow in comparison to the lists it had appended before. I tried to split the number of iterations for each run, reduced the dictionary from basic_stats to only a selection of keys, and assigned more memory to the process.
Still it takes quite a long time. I used Line_Profiler to get the time for each line:
Does anyone have a suggestion what might cause the slow run time, how to increase the speed or what I could do to improve the code?
Cheers

Is there any plot method in Python that can give me the plot in the image below? If not, could someone help in implementing it in R?

So in my project that is on football(soccer), what I want to find is how a title winning team goes on a winning run. For eg. 18 wins in a row that helped them to the title. So I want to show the trend/pattern of how they're winnning the consecutive games. So I have a csv file in which i have columns of W/D/L ( win/draw/loss) which consist of the data for this pattern. I'm doing my project using Python but the person who obtained the image using R of which I have no idea about. So if anyone could help me in obtaining this image in Python or R, it would be appreaciated.
The image has been attached below. Thanks for any help :).
WDL pattern of teams
Here's one way to do it in R using some made up data:
library(ggplot2)
#Some test data
set.seed(0)
testdata <- expand.grid(Team=c("Liverpool","Man U","Man City","Leicester", "Wolves"), Game=1:27)
testdata$Result <- sample(factor(c("Win","Draw","Loss"), levels=c("Win","Draw","Loss")), length(testdata[[1]]),
replace=TRUE, prob=c(0.4,0.2,0.4))
#plot
ggplot(testdata, aes(x=Game, y=as.numeric(Result), fill=Result)) + facet_grid(Team~., switch="y") +
geom_tile(colour="grey80", width=1,height=1) + scale_y_reverse(breaks=NULL) +
ylab("") + scale_fill_manual(values=c(Win="green3",Draw="Orange",Loss="Red"))
This results in the following plot:
If your data is not ordered with the teams in descending order, you'd need to convert testdata$Team to a factor ordered by the position in the league, see this question for example.

How can I match an audio clip inside an audio clip with Python? [duplicate]

I have a load of 3 hour MP3 files, and every ~15 minutes a distinct 1 second sound effect is played, which signals the beginning of a new chapter.
Is it possible to identify each time this sound effect is played, so I can note the time offsets?
The sound effect is similar every time, but because it's been encoded in a lossy file format, there will be a small amount of variation.
The time offsets will be stored in the ID3 Chapter Frame MetaData.
Example Source, where the sound effect plays twice.
ffmpeg -ss 0.9 -i source.mp3 -t 0.95 sample1.mp3 -acodec copy -y
Sample 1 (Spectrogram)
ffmpeg -ss 4.5 -i source.mp3 -t 0.95 sample2.mp3 -acodec copy -y
Sample 2 (Spectrogram)
I'm very new to audio processing, but my initial thought was to extract a sample of the 1 second sound effect, then use librosa in python to extract a floating point time series for both files, round the floating point numbers, and try to get a match.
import numpy
import librosa
print("Load files")
source_series, source_rate = librosa.load('source.mp3') # 3 hour file
sample_series, sample_rate = librosa.load('sample.mp3') # 1 second file
print("Round series")
source_series = numpy.around(source_series, decimals=5);
sample_series = numpy.around(sample_series, decimals=5);
print("Process series")
source_start = 0
sample_matching = 0
sample_length = len(sample_series)
for source_id, source_sample in enumerate(source_series):
if source_sample == sample_series[sample_matching]:
sample_matching += 1
if sample_matching >= sample_length:
print(float(source_start) / source_rate)
sample_matching = 0
elif sample_matching == 1:
source_start = source_id;
else:
sample_matching = 0
This does not work with the MP3 files above, but did with an MP4 version - where it was able to find the sample I extracted, but it was only that one sample (not all 12).
I should also note this script takes just over 1 minute to process the 3 hour file (which includes 237,426,624 samples). So I can imagine that some kind of averaging on every loop would cause this to take considerably longer.
Trying to directly match waveforms samples in the time domain is not a good idea. The mp3 signal will preserve the perceptual properties but it is quite likely the phases of the frequency components will be shifted so the sample values will not match.
You could try trying to match the volume envelopes of your effect and your sample.
This is less likely to be affected by the mp3 process.
First, normalise your sample so the embedded effects are the same level as your reference effect. Constructing new waveforms from the effect and the sample by using the average of the peak values over time frames that are just short enough to capture the relevant features. Better still use overlapping frames. Then use cross-correlation in the time domain.
If this does not work then you could analyze each frame using an FFT this gives you a feature vector for each frame. You then try to find matches of the sequence of features in your effect with the sample. Similar to https://stackoverflow.com/users/1967571/jonnor suggestion. MFCC is used in speech recognition but since you are not detecting speech FFT is probably OK.
I am assuming the effect playing by itself (no background noise) and it is added to the recording electronically (as opposed to being recorded via a microphone). If this is not the case the problem becomes more difficult.
This is an Audio Event Detection problem. If the sound is always the same and there are no other sounds at the same time, it can probably be solved with a Template Matching approach. At least if there is no other sounds with other meanings that sound similar.
The simplest kind of template matching is to compute the cross-correlation between your input signal and the template.
Cut out an example of the sound to detect (using Audacity). Take as much as possible, but avoid the start and end. Store this as .wav file
Load the .wav template using librosa.load()
Chop up the input file into a series of overlapping frames. Length should be same as your template. Can be done with librosa.util.frame
Iterate over the frames, and compute cross-correlation between frame and template using numpy.correlate.
High values of cross-correlation indicate a good match. A threshold can be applied in order to decide what is an event or not. And the frame number can be used to calculate the time of the event.
You should probably prepare some shorter test files which have both some examples of the sound to detect as well as other typical sounds.
If the volume of the recordings is inconsistent you'll want to normalize that before running detection.
If cross-correlation in the time-domain does not work, you can compute the melspectrogram or MFCC features and cross-correlate that. If this does not yield OK results either, a machine learning model can be trained using supervised learning, but this requires labeling a bunch of data as event/not-event.
To follow up on the answers by #jonnor and #paul-john-leonard, they are both correct, by using frames (FFT) I was able to do Audio Event Detection.
I've written up the full source code at:
https://github.com/craigfrancis/audio-detect
Some notes though:
To create the templates, I used ffmpeg:
ffmpeg -ss 13.15 -i source.mp4 -t 0.8 -acodec copy -y templates/01.mp4;
I decided to use librosa.core.stft, but I needed to make my own implementation of this stft function for the 3 hour file I'm analysing, as it's far too big to keep in memory.
When using stft I tried using a hop_length of 64 at first, rather than the default (512), as I assumed that would give me more data to work with... the theory might be true, but 64 was far too detailed, and caused it to fail most of the time.
I still have no idea how to get cross-correlation between frame and template to work (via numpy.correlate)... instead I took the results per frame (the 1025 buckets, not 1024, which I believe relate to the Hz frequencies found) and did a very simple average difference check, then ensured that average was above a certain value (my test case worked at 0.15, the main files I'm using this on required 0.55 - presumably because the main files had been compressed quite a bit more):
hz_score = abs(source[0:1025,x] - template[2][0:1025,y])
hz_score = sum(hz_score)/float(len(hz_score))
When checking these scores, it's really useful to show them on a graph. I often used something like the following:
import matplotlib.pyplot as plt
plt.figure(figsize=(30, 5))
plt.axhline(y=hz_match_required_start, color='y')
while x < source_length:
debug.append(hz_score)
if x == mark_frame:
plt.axvline(x=len(debug), ymin=0.1, ymax=1, color='r')
plt.plot(debug)
plt.show()
When you create the template, you need to trim off any leading silence (to avoid bad matching), and an extra ~5 frames (it seems that the compression / re-encoding process alters this)... likewise, remove the last 2 frames (I think the frames include a bit of data from their surroundings, where the last one in particular can be a bit off).
When you start finding a match, you might find it's ok for the first few frames, then it fails... you will probably need to try again a frame or two later. I found it easier having a process that supported multiple templates (slight variations on the sound), and would check their first testable (e.g. 6th) frame and if that matched, put them in a list of potential matches. Then, as it progressed on to the next frames of the source, it could compare it to the next frames of the template, until all frames in the template had been matched (or failed).
This might not be an answer, it's just where I got to before I start researching the answers by #jonnor and #paul-john-leonard.
I was looking at the Spectrograms you can get by using librosa stft and amplitude_to_db, and thinking that if I take the data that goes in to the graphs, with a bit of rounding, I could potentially find the 1 sound effect being played:
https://librosa.github.io/librosa/generated/librosa.display.specshow.html
The code I've written below kind of works; although it:
Does return quite a few false positives, which might be fixed by tweaking the parameters of what is considered a match.
I would need to replace the librosa functions with something that can parse, round, and do the match checks in one pass; as a 3 hour audio file causes python to run out of memory on a computer with 16GB of RAM after ~30 minutes before it even got to the rounding bit.
import sys
import numpy
import librosa
#--------------------------------------------------
if len(sys.argv) == 3:
source_path = sys.argv[1]
sample_path = sys.argv[2]
else:
print('Missing source and sample files as arguments');
sys.exit()
#--------------------------------------------------
print('Load files')
source_series, source_rate = librosa.load(source_path) # The 3 hour file
sample_series, sample_rate = librosa.load(sample_path) # The 1 second file
source_time_total = float(len(source_series) / source_rate);
#--------------------------------------------------
print('Parse Data')
source_data_raw = librosa.amplitude_to_db(abs(librosa.stft(source_series, hop_length=64)))
sample_data_raw = librosa.amplitude_to_db(abs(librosa.stft(sample_series, hop_length=64)))
sample_height = sample_data_raw.shape[0]
#--------------------------------------------------
print('Round Data') # Also switches X and Y indexes, so X becomes time.
def round_data(raw, height):
length = raw.shape[1]
data = [];
range_length = range(1, (length - 1))
range_height = range(1, (height - 1))
for x in range_length:
x_data = []
for y in range_height:
# neighbours = []
# for a in [(x - 1), x, (x + 1)]:
# for b in [(y - 1), y, (y + 1)]:
# neighbours.append(raw[b][a])
#
# neighbours = (sum(neighbours) / len(neighbours));
#
# x_data.append(round(((raw[y][x] + raw[y][x] + neighbours) / 3), 2))
x_data.append(round(raw[y][x], 2))
data.append(x_data)
return data
source_data = round_data(source_data_raw, sample_height)
sample_data = round_data(sample_data_raw, sample_height)
#--------------------------------------------------
sample_data = sample_data[50:268] # Temp: Crop the sample_data (318 to 218)
#--------------------------------------------------
source_length = len(source_data)
sample_length = len(sample_data)
sample_height -= 2;
source_timing = float(source_time_total / source_length);
#--------------------------------------------------
print('Process series')
hz_diff_match = 18 # For every comparison, how much of a difference is still considered a match - With the Source, using Sample 2, the maximum diff was 66.06, with an average of ~9.9
hz_match_required_switch = 30 # After matching "start" for X, drop to the lower "end" requirement
hz_match_required_start = 850 # Out of a maximum match value of 1023
hz_match_required_end = 650
hz_match_required = hz_match_required_start
source_start = 0
sample_matched = 0
x = 0;
while x < source_length:
hz_matched = 0
for y in range(0, sample_height):
diff = source_data[x][y] - sample_data[sample_matched][y];
if diff < 0:
diff = 0 - diff
if diff < hz_diff_match:
hz_matched += 1
# print(' {} Matches - {} # {}'.format(sample_matched, hz_matched, (x * source_timing)))
if hz_matched >= hz_match_required:
sample_matched += 1
if sample_matched >= sample_length:
print(' Found # {}'.format(source_start * source_timing))
sample_matched = 0 # Prep for next match
hz_match_required = hz_match_required_start
elif sample_matched == 1: # First match, record where we started
source_start = x;
if sample_matched > hz_match_required_switch:
hz_match_required = hz_match_required_end # Go to a weaker match requirement
elif sample_matched > 0:
# print(' Reset {} / {} # {}'.format(sample_matched, hz_matched, (source_start * source_timing)))
x = source_start # Matched something, so try again with x+1
sample_matched = 0 # Prep for next match
hz_match_required = hz_match_required_start
x += 1
#--------------------------------------------------

(Blender) (Python)How can I animate the factor value in the mix node with Python code?

What I want is a way to handle the 'factor' value in the mixRGB node like a normal object, like for example a cube, so with fcurves, fmodifiers and so on.
All this via Python code made in the Text Editor
The first step is to find the mix node you want. Within a material you can access each node by name, while the first mixRGB node is named 'Mix', following mix nodes will have a numerical extension added to the name. The name may also be changed manually by the user (or python script). By showing the properties region (press N) you can see the name of the active node in the node properties.
To adjust the fac value you alter the default_value of the fac input. To keyframe the mix factor you tell the fac input to insert a keyframe with a data_path of default_value
import bpy
cur_frame = bpy.context.scene.frame_current
mat_nodes = bpy.data.materials['Material'].node_tree.nodes
mix_factor = mat_nodes['Mix.002'].inputs['Fac']
mix_factor.default_value = 0.5
mix_factor.keyframe_insert('default_value', frame=cur_frame)
Of course you may specify any frame number for the keyframe not just the current frame.
If you have many mix nodes, you can loop over the nodes and add each mix shader to a list
mix_nodes = [n for n in mat_nodes if n.type == 'MIX_RGB']
You can then loop over them and keyframe as desired.
for m in mix_nodes:
m.inputs['Fac'].default_value = 0.5
m.inputs['Fac'].keyframe_insert('default_value', frame=cur_frame)
Finding the fcurves after adding them is awkward for nodes. While you tell the input socket to insert a keyframe, the fcurve is stored in the node_tree so after keyframe_insert() you would use
bpy.data.materials['Material'].node_tree.animation_data.action.fcurves.find()
Knowing the data path you want to search for can be tricky, as the data path for the Fac input of node Mix.002 will be nodes["Mix.002"].inputs[0].default_value
If you want to find an fcurve after adding it to adjust values or add modifiers you will most likely find it easier to keep a list of them as you add the keyframes. After keyframe_insert() the new fcurve should be at
material.node_tree.animation_data.action.fcurves[-1]

How to compare clusters?

Hopefully this can be done with python! I used two clustering programs on the same data and now have a cluster file from both. I reformatted the files so that they look like this:
Cluster 0:
Brucellaceae(10)
Brucella(10)
abortus(1)
canis(1)
ceti(1)
inopinata(1)
melitensis(1)
microti(1)
neotomae(1)
ovis(1)
pinnipedialis(1)
suis(1)
Cluster 1:
Streptomycetaceae(28)
Streptomyces(28)
achromogenes(1)
albaduncus(1)
anthocyanicus(1)
etc.
These files contain bacterial species info. So I have the cluster number (Cluster 0), then right below it 'family' (Brucellaceae) and the number of bacteria in that family (10). Under that is the genera found in that family (name followed by number, Brucella(10)) and finally the species in each genera (abortus(1), etc.).
My question: I have 2 files formatted in this way and want to write a program that will look for differences between the two. The only problem is that the two programs cluster in different ways, so two cluster may be the same, even if the actual "Cluster Number" is different (so the contents of Cluster 1 in one file might match Cluster 43 in the other file, the only different being the actual cluster number). So I need something to ignore the cluster number and focus on the cluster contents.
Is there any way I could compare these 2 files to examine the differences? Is it even possible? Any ideas would be greatly appreciated!
Given:
file1 = '''Cluster 0:
giant(2)
red(2)
brick(1)
apple(1)
Cluster 1:
tiny(3)
green(1)
dot(1)
blue(2)
flower(1)
candy(1)'''.split('\n')
file2 = '''Cluster 18:
giant(2)
red(2)
brick(1)
tomato(1)
Cluster 19:
tiny(2)
blue(2)
flower(1)
candy(1)'''.split('\n')
Is this what you need?
def parse_file(open_file):
result = []
for line in open_file:
indent_level = len(line) - len(line.lstrip())
if indent_level == 0:
levels = ['','','']
item = line.lstrip().split('(', 1)[0]
levels[indent_level - 1] = item
if indent_level == 3:
result.append('.'.join(levels))
return result
data1 = set(parse_file(file1))
data2 = set(parse_file(file2))
differences = [
('common elements', data1 & data2),
('missing from file2', data1 - data2),
('missing from file1', data2 - data1) ]
To see the differences:
for desc, items in differences:
print desc
print
for item in items:
print '\t' + item
print
prints
common elements
giant.red.brick
tiny.blue.candy
tiny.blue.flower
missing from file2
tiny.green.dot
giant.red.apple
missing from file1
giant.red.tomato
So just for help, as I see lots of different answers in the comment, I'll give you a very, very simple implementation of a script that you can start from.
Note that this does not answer your full question but points you in one of the directions in the comments.
Normally if you have no experience I'd argue to go a head and read up on Python (which i'll do anyways, and i'll throw in a few links in the bottom of the answer)
On to the fun stuffs! :)
class Cluster(object):
'''
This is a class that will contain your information about the Clusters.
'''
def __init__(self, number):
'''
This is what some languages call a constructor, but it's not.
This method initializes the properties with values from the method call.
'''
self.cluster_number = number
self.family_name = None
self.bacteria_name = None
self.bacteria = []
#This part below isn't a part of the class, this is the actual script.
with open('bacteria.txt', 'r') as file:
cluster = None
clusters = []
for index, line in enumerate(file):
if line.startswith('Cluster'):
cluster = Cluster(index)
clusters.append(cluster)
else:
if not cluster.family_name:
cluster.family_name = line
elif not cluster.bacteria_name:
cluster.bacteria_name = line
else:
cluster.bacteria.append(line)
I wrote this as dumb and overly simple as I could without any fancy stuff and for Python 2.7.2
You could copy this file into a .py file and run it directly from command line python bacteria.py for example.
Hope this helps a bit and don't hesitate to come by our Python chat room if you have any questions! :)
http://learnpythonthehardway.org/
http://www.diveintopython.net/
http://docs.python.org/2/tutorial/inputoutput.html
check if all elements in a list are identical
Retaining order while using Python's set difference
You have to write some code to parse the file. If you ignore the cluster, you should be able to distinguish between family, genera and species based on indentation.
The easiest way it to define a named tuple:
import collections
Bacterium = collections.namedtuple('Bacterium', ['family', 'genera', 'species'])
You can make in instance of this object like this:
b = Bacterium('Brucellaceae', 'Brucella', 'canis')
Your parser should read a file line by line, and set the family and genera. If it then finds a species, it should add a Bacterium to a list;
with open('cluster0.txt', 'r') as infile:
lines = infile.readlines()
family = None
genera = None
bacteria = []
for line in lines:
# set family and genera.
# if you detect a bacterium:
bacteria.append(Bacterium(family, genera, species))
Once you have a list of all bacteria in each file or cluster, you can select from all the bacteria like this:
s = [b for b in bacteria if b.genera == 'Streptomycetaceae']
Comparing two clusterings is not trivial task and reinventing the wheel is unlikely to be successful. Check out this package which has lots of different cluster similarity metrics and can compare dendrograms (the data structure you have).
The library is called CluSim and can be found here:
https://github.com/Hoosier-Clusters/clusim/
After learning so much from Stackoverflow, finally I have an opportunity to give back! A different approach from those offered so far is to relabel clusters to maximize alignment, and then comparison becomes easy. For example, if one algorithm assigns labels to a set of six items as L1=[0,0,1,1,2,2] and another assigns L2=[2,2,0,0,1,1], you want these two labelings to be equivalent since L1 and L2 are essentially segmenting items into clusters identically. This approach relabels L2 to maximize alignment, and in the example above, will result in L2==L1.
I found a soution to this problem in "Menéndez, Héctor D. A genetic approach to the graph and spectral clustering problem. MS thesis. 2012." and below is an implementation in Python using numpy. I'm relatively new to Python, so there may be better implementations, but I think this gets the job done:
def alignClusters(clstr1,clstr2):
"""Given 2 cluster assignments, this funciton will rename the second to
maximize alignment of elements within each cluster. This method is
described in in Menéndez, Héctor D. A genetic approach to the graph and
spectral clustering problem. MS thesis. 2012. (Assumes cluster labels
are consecutive integers starting with zero)
INPUTS:
clstr1 - The first clustering assignment
clstr2 - The second clustering assignment
OUTPUTS:
clstr2_temp - The second clustering assignment with clusters renumbered to
maximize alignment with the first clustering assignment """
K = np.max(clstr1)+1
simdist = np.zeros((K,K))
for i in range(K):
for j in range(K):
dcix = clstr1==i
dcjx = clstr2==j
dd = np.dot(dcix.astype(int),dcjx.astype(int))
simdist[i,j] = (dd/np.sum(dcix!=0) + dd/np.sum(dcjx!=0))/2
mask = np.zeros((K,K))
for i in range(K):
simdist_vec = np.reshape(simdist.T,(K**2,1))
I = np.argmax(simdist_vec)
xy = np.unravel_index(I,simdist.shape,order='F')
x = xy[0]
y = xy[1]
mask[x,y] = 1
simdist[x,:] = 0
simdist[:,y] = 0
swapIJ = np.unravel_index(np.where(mask.T),simdist.shape,order='F')
swapI = swapIJ[0][1,:]
swapJ = swapIJ[0][0,:]
clstr2_temp = np.copy(clstr2)
for k in range(swapI.shape[0]):
swapj = [swapJ[k]==i for i in clstr2]
clstr2_temp[swapj] = swapI[k]
return clstr2_temp

Categories