How to Collapse a List of Labels | Python - python

So I realize this is both a theoretical question and a coding question, but say if I have a list of 10 labels (x1, x2,...,x10) and their corresponding "location" vectors (v1, v2, ..., v10).
I want to collapse them based on their L2-norm distance from each other. For example, if v1 is close to v10, then relabel all x10's as x1's and so on.
So the end result could hypothetically look like the new labels: (x1, x3, x7, x8). Is there a way to smartly just make this into (x1', x2', x3', x4')?, so that people don't get confused and assume the new labels are the same.
Given:
labels = vector of Nx1 that has all the labels (1,2,3...,10)
Example Code:
epsilon = 0.2 # defines distance
change = [] # initialize vector of labels to change
# matrix is NxN matrix of the pairwise distances between all our vectors (v1,..,v10)
for i in range(0, distancematrix):
for j in range(0, distancematrix):
# add all pairs of labels that are "close", so that we may relabel
if i!=j and distancematrix[i, j] < epsilon:
change.append((i,j))
This will produce a list of pairs that I want to relabel. Is there a smart way of rewriting 'labels', so that it merges all the pairs I want to merge AND keeps the labels that were not part of any merge. Then reorganizes it to go from (1,2,3,4), if I merge 6 pairs of numbers (10-6 = 4).
Thank you. I realize this is somewhat of a weird problem, so if you have questions please let me know!

This does the job actually for me.
# creates a list of numbers from 0 to the length of your newlabels vector
changeto = [i for i in range(0, len(np.unique(newlabels)))]
# get the unique values of your newlabels (e.g. 0, 3, 4, 5, 10)
currentlabels = np.unique(newlabels)
# change all your labels to your new mapping (e.g. 0 -> 0, 3 -> 1, 4 -> 2, etc.)
for i in range(0, len(changeto)):
if currentlabels[i] != changeto[i]:
# change the 'states' in newlabels to new label
newlabels = [changeto[i] if x==currentlabels[i] else x for x in newlabels]
Maybe it's not pretty, but you map your new labels onto the line 0, 1, 2,...x, where x is the length of your new condensed label vector.

What if a label is not involved in any merge? Do you want to keep the original label? If so, what if that label is outside the new range?
Overall, I think that this is simply generating new labels given only the quantity of labels:
new_label_list = ["x"+str(n+1)+"'" for n in range(len(change))]
For change of length 4, this gives you
["x1'", "x2'", "x3'", "x4'"]
Do you see how the new label is built?
leading "x"
string version of the index, 1 .. length
trailing prime character

Related

How to remove overlapping blocks from numpy array?

I'm using cv2.goodFeaturesToTrack function to find feature points in an image. The end goal is to extract square blocks of certain size, with feature points being the centers of those blocks.
However, lots of the feature points are close to each other, so the blocks are overlapping, which is not what I want.
This is an example of all feature points (centers):
array([[3536., 1419.],
[2976., 1024.],
[3504., 1400.],
[3574., 1505.],
[3672., 1453.],
[3671., 1442.],
[3489., 1429.],
[3108., 737.]])
Let's say I want to find the first n blocks with a blockRadius = 400 which are not overlapping. Any ideas on how to achieve this?
You could get closer with scipy.spatial.KDTree - though it doesn't support querying blocks that consists of distinct amounts of points in blocks. So it can be used in conjunction with another library python-igraph that allows to find connected components of close points in a fast manner:
from scipy.spatial import KDTree
import igraph as ig
data = np.array([[3536., 1419.],
[2976., 1024.],
[3504., 1400.],
[3574., 1505.],
[3672., 1453.],
[3671., 1442.],
[3489., 1429.],
[3108., 737.]])
edges1 = KDTree(data[:,:1]).query_pairs(r=400)
edges2 = KDTree(data[:,1:]).query_pairs(r=400)
g = ig.Graph(n = len(data), edges=edges1 & edges2)
i = g.clusters()
So clusters corresponds to sequences of indices of block points of some kind of internal type igraph. There's a quick preview:
>>> print(i)
Clustering with 8 elements and 2 clusters
[0] 0, 2, 3, 4, 5, 6
[1] 1, 7
>>> pal = ig.drawing.colors.ClusterColoringPalette(len(i)) #number of colors used
color = pal.get_many(i.membership) #list of color tags
ig.plot(g, bbox = (200, 100), layout=g.layout('circle'), vertex_label=g.vs.indices,
vertex_color = color, vertex_size = 12, vertex_label_size = 8)
Example of usage:
>>> [data[n] for n in i] #or list(i)
[array([[3536., 1419.],
[3504., 1400.],
[3574., 1505.],
[3672., 1453.],
[3671., 1442.],
[3489., 1429.]]),
array([[2976., 1024.],
[3108., 737.]])]
Remark: this method allows to work with pairs of close points instead of n*n matrix which is more efficient in memory in some cases.
You'll need something iterative to do that, as recurrent dropouts like this aren't vectorizable. Something like this will work, I think
from scipy.spatial.distance import pdist, squareform
c = np.array([[3536., 1419.],
[2976., 1024.],
[3504., 1400.],
[3574., 1505.],
[3672., 1453.],
[3671., 1442.],
[3489., 1429.],
[3108., 737.]])
dists = squareform(pdist(c, metric = 'chebyshev')) # distance matrix, chebyshev here since you seem to want blocks
indices = np.arange(c.shape[0]) # indices that haven't been dropped (all to start)
out = [0] # always want the first index
while True:
try:
indices = indices[dists[indices[0], indices] > 400] #drop indices that are inside threshhold
out.append(indices[0]) # add the next index that hasn't been dropped to the output
except:
break # once you run out of indices, you'll get an IndexError and you're done
print(out)
[0, 1]
let's try with a whole bunch of points:
np.random.seed(42)
c = np.random.rand(10000, 2) * 800
dists = squareform(pdist(c, metric = 'chebyshev')) # distance matrix, checbyshev here since you seem to want squares
indices = np.arange(c.shape[0]) # indices that haven't been dropped (all to start)
out = [0] # always want the first index
while True:
try:
indices = indices[dists[indices[0], indices] > 400] #drop indices that are inside threshhold
out.append(indices[0]) # add the next index that hasn't been dropped to the output
except:
break # once you run out of indices, you'll get an IndexError and you're done
print(out, pdist(c[out], metric = 'chebyshev'))
[0, 2, 6, 17] [635.77582886 590.70015659 472.87353138 541.13920029 647.69071411
476.84658995]
So, 4 points (makes sense since 4 400x400 blocks tile a 800x800 space with 4 tiles), mostly low values (17 << 10000) and distance between kept points is always > 400

Is there a way to cut only the first gap from histogram and take all the remain values in Python?

I have a data frame with fields: 'unique years', 'counts'. I plotted this data frame and i am getting the following histogram: histogram - example. I need to define a start year variable but if i have empty gaps at the starting point of histogram i need to skip them and shift the starting year. I was wondering if there is a pythonic way to do this. In the histogram - example plot, i have a not empty bin at the starting point but then i have a big gap with empty bins. So i need to find the point with a continuous not empty bins and define this point as a starting year (for the above sample i need the starting year as 1935). The n numpy.ndarray is giving me information about empty or not bins but i need a efficient way to resolve this. Thank you :)
Sample of my data frame:
import pandas as pd
data = {'unique_years': [1907, 1935, 1938, 1939, 1940],
'counts' : [11, 14, 438, 85, 8]}
df = pd.DataFrame(data, columns = ['unique_years', 'counts'])
code for the histogram plot
(n, bins, patches) = plt.hist(df.unique_years, bins=25, label='hst')
plt.show()
The issue with your question is that 'continuous' is not really well defined here. Do you mean that every year should have a non-empty count (that is fairly easy to do as you can filter your data for that prior to building your histogram), or should every consecutive bucket be non empty? If the latter, this means that you must:
Build your histogram
Filter your data on the resulting bins
Either use the filtered histogram or re-bin the remaining data, with bins sizes not guaranteed to stay the same (so it is possible that you have the same issue with the new bins!)
As it is difficult to know exactly what is relevant in your exact case, I think the best answer would be to give you a set of tools that you can use as you see fit for the exact problem that you are encountering:
I want to filter my data starting from a certain date
filtered = df.unique_years[df.unique_years > 1930]
I want to find the second non-empty bin
(n, x) = np.histogram(df.unique_years, bins=25)
second_nonempty = np.where(n > 0)[0][1]
From there you can:
rebin your filtered data:
(n, x) = np.histogram(df.unique_years, bins=25)
second_nonempty = np.where(n > 0)[0][1]
# Re-binning on the filtered data
plt.hist(df.unique_years[df.unique_years >= n[second_nonempty]], bins=25)
Plot your histogram directly on the filtered bins:
(n, x) = np.histogram(df.unique_years, bins=25)
second_nonempty = np.where(n > 0)[0][1]
# Forcing the bins to take the provided values
plt.hist(df.unique_years, bins=x[second_nonempty:])
Now the 'second_nonempty' above can of course be replaced by any estimator of where you want to start, e.g.:
# Last empty bin + 1
all_bins_full_after = np.where(n == 0)[0][-1] + 1
Or anything else really
This should work to eliminate all the bins that are not consecutive. I am working mainly on the df. You can use this to plot your histogram
df = pd.DataFrame(data, columns = ['unique_years', 'counts'])
yd = df.unique_years.diff().eq(1)
df[yd|yd.shift(-1)]
this is the result you would get:

How to plot histogram of counts per range that each bin is labeled with a interval range?

I have a python directory, the keys of which stores the percentage range: 0_5, 5_10, 10_15,....80_85, 85_90, 90_95, 95_100. The value for each key is the count number in the whole data. I want to use matplotlib to plot a histogram to see its distribution and there will be 20 bins and each bin should be labeled with its percentage range and there will be a little spacing between each bin so that they are seperated.
I've tried this code and it gives me histogram that has 20 bins. But it's not what I need.
commutes = pd.Series(counts)
commutes.plot.hist(grid = False, bins = 20, rwidth = 0.8, color = 'tomato', edgecolor='gray', label = 'Type 1')
Also, I tried this and it shows the error: ValueError: weights should have the same shape as x
pylab.hist(ratio.keys(), weights = ratio.values(), bins=range(20))
This is how I created the directory. The variable counts stores a list of 700 percentage values.
for i in range(0,100,5):
start = i
end = i+5
key = str(start)+"_"+str(end)
number = 0
for count in counts:
if((count >= start) and (count < end)):
number = number + 1
ratio[key] = number
ratio.keys() and ratio.values() are of type dict_keys and dict_values. I'm guessing that matplotlib is trying to apply np.array() or np.asarray() to them, which does not work as intended, i.e. it gets converted to array(dict_keys([...]), dtype=object) rather than an array of numbers.
A simple fix is to convert the dictionary keys and values to lists first.
pylab.hist(list(ratio.keys()), weights=list(ratio.values()), bins=range(len(ratio)))

Best way to generate all possible bidirectional graphs Python

I'm trying to generate all possible bidirectional graphs given a set of nodes. I'm storing my graphs as numpy vectors, and need them that way as there is some downstream code consumes the graphs in this format.
Lets say I have two sets of nodes in which nodes in the same set do not connect. It is however possible to have graphs in which members of two sets do not meet at all.
posArgs= [0,1] # (0->1) / (1->1) is not allowed..neither is (0->1)
negArgs= [2] # (0->2) is possible, and so is (0 2) - meaning no connection.
To illustrate what I mean:
translates as:
import numpy as np
graph = np.array([0,0,1,0,0,1,0,0,0])
singleGraph= np.vstack( np.array_split(np.array(graph), nargs))
>>[[0 0 1]
[0 0 1]
[0 0 0]]
# Where the first row represent node 0's relationships with nodes 0, 1, 2
# the second row represents node 1's relationships with nodes 0, 1, 2 etc
What I want is to generate all the possible ways in which these two sets can then be constructed (i.e. all the possible node combinations as vectors). At the moment I am using itertools.product to generate a power set of all the nodes, then I create a bunch of vectors in which there are circular connections and same set connections. I then delete them from the powerset. Hence using the sets above I have the following code:
import numpy as np
import itertools
posArgs= [0,1] # 0->1 / 1->1 is not allowed..neither is 0->1
negArgs= [2]
nargs= len(posArgs+ negArgs)
allPermutations= np.array(list(itertools.product([0,1], repeat=nargs*nargs)))
# Create list of Attacks that we will never need. Circular attacks, and attacks between arguments of same polarity
circularAttacks = (np.arange(0, nargs*nargs, nargs+1)).tolist()
samePolarityAttacks = []
posList = list(itertools.permutations(posArgs, 2))
negList = list(itertools.permutations(negArgs, 2))
totList = posList + negList
for l in totList:
ptn = ((l[0]+1)*nargs)- ((nargs+1) - l[1]) + 1 # All the odd +1 are to account for the shift in 0 index
samePolarityAttacks.append(ptn)
graphsToDelete = np.unique([circularAttacks + samePolarityAttacks])
subGraphs = allPermutations[:,graphsToDelete]
cutDownGraphs = np.delete(allPermutations, (np.where(subGraphs>0)[0]).tolist(), axis = 0)
for graph in cutDownGraphs:
singleGraph= np.vstack( np.array_split(np.array(graph), nargs))
print(singleGraph)
My problem is that when I have more than 5 nodes within both of my sets my itertools.product is trying to produce (2^25) set of vectors. This of course is really expensive and leaves me with memory leaks.
Are you aware of a smart way in which I can reshape this code while ensuring my graphs stay in this numpy array format?
--Additional info:
For two sets, one node a set, all possible combinations look as follows:
Thanks

Personalised colourmap plot using set numbers using matplotlib

I have a data which looks like (example)
x y d
0 0 -2
1 0 0
0 1 1
1 1 3
And I want to turn this into a coloumap plot which looks like one of these:
where x and y are in the table and the color is given by 'd'. However, I want a predetermined color for each number, for example:
-2 - orange
0 - blue
1 - red
3 - yellow
Not necessarily these colours but I need to address a number to a colour and the numbers are not in order or sequence, the are just a set of five or six random numbers which repeat themselves across the entire array.
Any ideas, I haven't got a code for that as I don't know where to start. I have however looked at the examples in here such as:
Matplotlib python change single color in colormap
However they only show how to define colours and not how to link those colours to an specific value.
It turns out this is harder than I thought, so maybe someone has an easier way of doing this.
Since we need to create an image of the data, we will store them in a 2D array. We can then map the data to the integers 0 .. number of different data values and assign a color to each of them. The reason is that we want the final colormap to be equally spaced. So
value -2 --> integer 0 --> color orange
value 0 --> integer 1 --> color blue
and so on.
Having nicely spaced integers, we can use a ListedColormap on the image of newly created integer values.
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.colors
# define the image as a 2D array
d = np.array([[-2,0],[1,3]])
# create a sorted list of all unique values from d
ticks = np.unique(d.flatten()).tolist()
# create a new array of same shape as d
# we will later use this to store values from 0 to number of unique values
dc = np.zeros(d.shape)
#fill the array dc
for i in range(d.shape[0]):
for j in range(d.shape[1]):
dc[i,j] = ticks.index(d[i,j])
# now we need n (= number of unique values) different colors
colors= ["orange", "blue", "red", "yellow"]
# and put them to a listed colormap
colormap = matplotlib.colors.ListedColormap(colors)
plt.figure(figsize=(5,3))
#plot the newly created array, shift the colorlimits,
# such that later the ticks are in the middle
im = plt.imshow(dc, cmap=colormap, interpolation="none", vmin=-0.5, vmax=len(colors)-0.5)
# create a colorbar with n different ticks
cbar = plt.colorbar(im, ticks=range(len(colors)) )
#set the ticklabels to the unique values from d
cbar.ax.set_yticklabels(ticks)
#set nice tickmarks on image
plt.gca().set_xticks(range(d.shape[1]))
plt.gca().set_yticks(range(d.shape[0]))
plt.show()
As it may not be intuitively clear how to get the array d in the shape needed for plotting with imshow, i.e. as 2D array, here are two ways of converting the input data columns:
import numpy as np
x = np.array([0,1,0,1])
y = np.array([ 0,0,1,1])
d_original = np.array([-2,0,1,3])
#### Method 1 ####
# Intuitive method.
# Assumption:
# * Indexing in x and y start at 0
# * every index pair occurs exactly once.
# Create an empty array of shape (n+1,m+1)
# where n is the maximum index in y and
# m is the maximum index in x
d = np.zeros((y.max()+1 , x.max()+1), dtype=np.int)
for k in range(len(d_original)) :
d[y[k],x[k]] = d_original[k]
print d
#### Method 2 ####
# Fast method
# Additional assumption:
# indizes in x and y are ordered exactly such
# that y is sorted ascendingly first,
# and for each index in y, x is sorted.
# In this case the original d array can bes simply reshaped
d2 = d_original.reshape((y.max()+1 , x.max()+1))
print d2

Categories