I'm studying effects of node removals on the model training accuracy, using several heuristics, the CORA dataset and DGL graph library. The issue is when I try to remove by reversed order of degrees, such as nodes with higher degree are removed first. I extract the graph degree array, which is indexed by id's, and reverse argsort it. This would be the largest degree node ids, in decreasing order.
Finally, I remove the desired amount from the graph, returing the modified graph.
After a few iterations, I noticed the largest degree present tends to increase, something that should not happen from my algorithm, as I reverse argsort the degrees indexes, slice and remove.
I've inserted some prints inside the code to show progress and how the degrees change over time. To avoid having to clone the code, I saved the output inside the repository.
Here is the minimum reproducible example: github repository
import math
import random
import secrets
import time
import numpy as np
import torch
import dgl
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import pdb
from dgl.data import CoraGraphDataset
import gnn
def remove_nodes(g, total):
degreeArray = g.in_degrees().numpy()
print('Mean of degrees: ', degreeArray.sum()/len(degreeArray))
print("Size of degree array: ", len(degreeArray))
print("__________")
#sort indexes and reverse, to get greater degrees first
sortIndexes = np.argsort(degreeArray)[::-1].copy()
#print("Sorted indexes: ", sortIndexes.tolist())
#2nd step: get degree value info
debug_sorted_degrees = np.array(degreeArray)[sortIndexes]
# indexes and degrees of 10 to be removed nodes
degreeDict = list(zip(sortIndexes, debug_sorted_degrees))[0:10]
#print("DegreeDict: ", degreeDict)
#take all degrees in graph to dataframe and group by degree
hist = pd.DataFrame(debug_sorted_degrees)
hist.columns = ['degrees in graph, grouped']
y = hist.groupby("degrees in graph, grouped").size()
print("number of nodes to be removed in round: ", total)
print(y)
#slice the desired number of nodes from sorted indexes
nodes = sortIndexes[0:total].copy()
#print(nodes.tolist())
removedNodesSearchedInGraph = g.in_degrees(torch.tensor(nodes)).numpy().tolist()
maiorGrau = max(removedNodesSearchedInGraph)
menorGrau = min(removedNodesSearchedInGraph)
print("\nSorted degree removals: ")
print(removedNodesSearchedInGraph[0:total], sep='\t')
print(f"Largest degree removed: {maiorGrau}")
print(f"Smallest degree removed: {menorGrau}")
g.remove_nodes(torch.tensor(nodes, dtype=torch.int64), store_ids=True)
return g, nodes
dataset = CoraGraphDataset()[0]
precision = []
trainingEpochs = 60
nodeRemovalsPerRound = 50
for i in range(7):
print(f"\n______________ITERATION #{i}______________________")
g, removedNodes = remove_nodes(dataset, nodeRemovalsPerRound)
currentPrecision = gnn.train(dataset, trainingEpochs)
precision.append(currentPrecision)
for i in range(len(precision)):
print(f"Precision of iteration {i+1}: {precision[i]}")
And this is the output I get from running the code
After the first iteration, it starts removing from the lowest nodes, not the highest ones.
What is it that I'm missing?
Related
I am trying to replicate the MATLAB function findpeaks() in Python using find_peaks() from scipy.signal.
Basically I'm trying to translate the MATLAB example for Finding Periodicity Using Autocorrelation into Python.
I've written the following Python code for the same.
Everything seems to be working fine, except for the last part where the indices of the 'long period', i.e. those of the highest peaks, aren't being determined correctly.
#Loading Libraries
import numpy as np
import pandas as pd
import pickle
import scipy
from scipy.signal import find_peaks, square
import scipy.signal as signal
import matplotlib.pyplot as plt
import math
#Loading Dataset from a local copy of the dataset (from the MATLAB link I've shared)
dataset = pd.read_csv('officetemp_matlab_dataset.csv')
#Preprocessing
temp = dataset.to_numpy()
tempC = (temp-32)*5/9
tempnorm = tempC-np.mean(tempC)
fs = 2*24
t = [(i-1)/fs for i in range(len(tempnorm))]
#Plotting the waveform
plt.plot(t, tempnorm)
#Determining Autocorrelation & Lags
autocorr = signal.correlate(tempnorm, tempnorm, mode='same')
lags = signal.correlation_lags(len(tempnorm), len(tempnorm), mode="same")
#Plotting the Autocorrelation & Lags
f = plt.figure()
f.set_figwidth(40)
f.set_figheight(10)
plt.plot(lags/fs, autocorr)
#A) FINDING ALL PEAKS
#1) Finding peak indices
indices = find_peaks(autocorr.flatten())[0]
#2) Finding peak values
peak_values_short = [autocorr.flatten()[j] for j in indices]
#3) Finding corresponding lags of the peak values
peak_values_lags_short = [lags.flatten()[j] for j in indices]
#4) Determining Period (short)
diff = [(indices[i - 1] - x) for i, x in enumerate(indices)][1:]
short_period = abs(np.mean(diff))/fs
short_period
#B) FINDING THE HIGHEST PEAKS (of 2nd period)
#1) Finding peak indices
indices = find_peaks(autocorr.flatten(), height = 0.3, distance = math.ceil(short_period)*fs)[0]
#2) Finding peak values
peak_values_long = [autocorr.flatten()[j] for j in indices]
#3) Finding corresponding lags of the peak values
peak_values_lags_long = [lags.flatten()[j] for j in indices]
#4) Determining Period (long)
diff = [(indices[i - 1] - x) for i, x in enumerate(indices)][1:]
long_period = abs(np.mean(diff))/fs
long_period
###DOING A SCATTER PLOT OF THE PEAK POINTS OVERLAPPING ON THE PREVIOUS PLOT OF AUTOCORR VS LAGS
f = plt.figure()
f.set_figwidth(40)
f.set_figheight(10)
plt.plot(lags/fs, autocorr)
shrt = [i/fs for i in peak_values_lags_short]
lng = [i/fs for i in peak_values_lags_long]
plt.scatter(shrt, peak_values_short, marker='o')
plt.scatter(lng, peak_values_long, marker='*')
As you can see, there are 2 things which are going wrong in my Python output when compared to the MATLAB example:
The 'long time period' value (and their indices values) obtained is different
The autocorr and lag's values for the 'long time period' peak locations are different (as seen in the last plot):
I can't figure out why find_peaks() is working fine the 1st time (when all peaks are determined) but fails to give the correct results the 2nd time when more arguments are provided to find the highest peaks.
How can I detect the highest peaks of the 2nd period correctly?
I'm answering my own question.
I realized that the only mistake I was doing in my Python code was not normalizing the autocorr values as was done in the Matlab example. I simply added the following in my code:
autocorr = (autocorr-min(autocorr))/(max(autocorr)-min(autocorr))
When I do so, I eventually get the desired results, same as that in the example:
Hence, to conclude, find_peaks() does in fact do the intended job.
I am trying to create a directed network with more than 5000 nodes. The edges between the nodes are based on the difference in a certain value assigned to each node; if the difference in values between node pairs is less than a threshold, there is an edge. I generate an adjacency matrix and want to check if the directed graph is weakly connected, and also compute Page rank. Currently, I use the code below to generate the graph and it takes me 78s and occupies nearly 7GB memory. I want to know if there is a more efficient (time and memory) way of constructing and evaluating large networks in Python.
%reset -f
!pip install faiss-gpu
import faiss
import numpy as np
import torch
import random
import networkx as nx
import time
device='cuda'
res = faiss.StandardGpuResources()
start=time.time()
# Total Nodes
N = 5000
# Mean
mu = 0.5*np.pi
# Variance
var = np.pi/18
# Maximum degree of each node
max_degree = 1000
# Threshold
value_thres = np.pi/6
# Placeholders
Values = torch.zeros((N,1),dtype=torch.double,device='cuda')
Matrixs = torch.zeros((2,N,max_degree),dtype=torch.double,device='cuda')
Adj_Matrix = torch.zeros((N,N),dtype=torch.long,device='cuda')
#Generate a directed network with N nodes whose connectivity is based on values
start_network=time.time()
Values[:,0] = torch.normal(mu,var,(N,))
# Find neighbors upto max_degree
# Pytorch to numpy
Current = np.float32(Values[:,0].cpu().detach().numpy())
index_flat = faiss.IndexFlatL2(Current[:,None].shape[1])
gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, index_flat)
gpu_index_flat.add(Current[:,None])
m, n = gpu_index_flat.search(Current[:,None], max_degree)
# Indices of possible neighbors
Matrixs[1] = torch.from_numpy(n).long()
# Value Separation
Matrixs[0] = torch.squeeze(torch.cdist(Values[:,0][:,None][:,None],Values[:,0][:,None][Matrixs[1].long()],p=2))<value_thres
# Construct Adjacency Matrix
Adj_Matrix[Matrixs[1].long()] = 1
Adj_Matrix-=torch.eye(N,dtype=torch.long,device='cuda')
G = nx.from_numpy_matrix(Adj_Matrix.cpu().detach().numpy())
end=time.time()
print('Network Creation Time',end-start_network)
print('Total Time',end-start-start_network)
From the snippet in the question, it's hard to isolate the time/memory of networkx (I don't have 'cuda' on my machine, so unable to replicate). However, the following code runs for about 36 seconds:
import networkx as nx
import numpy as np
A = np.random.randint(2, size=(5000, 5000))
G = nx.from_numpy_matrix(A) # about 36 seconds
There could be scope for generating a faster algorithm by writing a custom low-level graph constructor, but it's unlikely that it will have memory advantages.
I'm currently working on a way to find rectangles/polygons in up to 15 given points (Image below).
Given Points
My goal is it to find polygons in that point array, like I marked in the image below. The polygons are rectangles in the real world but they are distorted a bit that's the reason why they can look like polygons or other shapes. I must find the best rectangle/polygon.
My idea was to check all connections between the points but the total amount of that is to big to run in and it took.
Does anyone has an idea how to solve that, I researched in the web and found the k-Nearest algorithm in sklearn for python but I don't have experience with that if this is the right way to solve it and how to do that. Maybe I'll also need a method to filter out some of the outliers to make it even easier for the algorithm to find the right corner points of the polygon.
The code snippet below splits the given point string into separate arrays, the array coordinatesOnly contains just the x and y values of the points.
Many thanks for you help.
Polygon in Given Points
import math
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.neighbors import NearestNeighbors
millis = round(int(time.time())) / 1000
####input String
print("2D to 3D convert")
resultString = "0,487.50,399.46,176.84,99.99;1,485.93,423.43,-4.01,95.43;2,380.53,433.28,1.52,94.90;3,454.47,397.68,177.07,90.63;4,490.20,404.10,-6.17,89.90;5,623.56,430.52,-176.09,89.00;6,394.66,385.44,90.22,87.74;7,625.61,416.77,-177.95,87.02;8,597.21,591.66,-91.04,86.49;9,374.03,540.89,-11.20,85.77;10,600.51,552.91,178.29,85.52;11,605.29,530.78,-179.89,85.34;12,583.73,653.92,-82.39,84.42;13,483.56,449.58,-91.12,83.37;14,379.01,451.62,-6.21,81.51"
resultString = resultString.split(";")
resultStringSplitted = list()
coordinatesOnly = list()
for i in range(len(resultString)):
resultStringSplitted .append(resultString[i].split(","))
newList = ((float(resultString[i].split(",")[1]),float(resultString[i].split(",")[2])))
coordinatesOnly.append(newList)
for j in range(len(resultStringSplitted[i])):
resultStringSplitted[i][j] = float(resultStringSplitted[i][j])
#Check if score is valid
validScoreList = list()
for i in range(len(resultStringSplitted)):
if resultStringSplitted[i][len(resultStringSplitted[i])-1] != 0:
validScoreList.append(resultStringSplitted[i])
resultStringSplitted = validScoreList
#Result String array contains all 2D results
# [Point Number, X Coordinate, Y Coordinate, Angle, Point Score]
for i in range(len(resultStringSplitted)):
plt.scatter(resultStringSplitted[i][1],resultStringSplitted[i][2])
plt.show(block=True)
Since you mentioned that you can have a maximum of 15 points, I suggest to check all possible combinations of 4 points and keep all rectangles that are close enough to perfect rectangles. For 15 points, it is "only" 15*14*13*12=32760 potential rectangles.
import math
import itertools
import numpy as np
coordinatesOnly = ((0,0),(0,1),(1,0),(1,1),(2,0),(2,1),(1,3)) # Test data
rectangles = []
# Returns True if l0 and l1 are within 10% deviation
def isValid(l0, l1):
if l0 == 0 or l1 == 0:
return False
return abs(max(l0,l1)/min(l0,l1) - 1) < 0.1
for p in itertools.combinations(np.array(coordinatesOnly),4):
for r in itertools.permutations(p,4):
l01 = np.linalg.norm(r[1]-r[0]) # Side
l12 = np.linalg.norm(r[2]-r[1]) # Side
l23 = np.linalg.norm(r[3]-r[2]) # Side
l30 = np.linalg.norm(r[0]-r[3]) # Side
l02 = np.linalg.norm(r[2]-r[0]) # Diagonal
l13 = np.linalg.norm(r[2]-r[0]) # Diagonal
areSidesEqual = isValid(l01,l23) and isValid(l12,l30)
isDiag1Valid = isValid(math.sqrt(l01*l01+l30*l30),l13) # Pythagore
isDiag2Valid = isValid(math.sqrt(l01*l01+l12*l12),l02) # Pythagore
if areSidesEqual and isDiag1Valid and isDiag2Valid:
rectangles.append(r)
break
print(rectangles)
It takes about 1 second to run on 15 points on my computer. It really depends on what are your requirements for computation time, i.e., real time, interactive time, "I just don't want to spend days waiting for the answer" time.
import sys
import numpy as np
import scipy.io as sio
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.svm import SVC
filename = sys.argv[1]
datafile = sio.loadmat(filename)
data = datafile['bow']
sizedata=[len(data), len(data[0])]
gap=[]
SD=[]
for knum in xrange(10,20):
print knum
#Clustering original Data
kmeanspp = KMeans(n_clusters=knum,init = 'k-means++',max_iter = 100,n_jobs = 1)
kmeanspp.fit(data)
dispersion = kmeanspp.inertia_
#Clustering Reference Data
nrefs = 10
refDisp = np.zeros(nrefs)
for nref in xrange(nrefs):
refdata = np.random.random_sample((sizedata[0],sizedata[1]))
refkmeans = KMeans(n_clusters=knum,init='k-means++',max_iter=100,n_jobs=1)
refkmeans.fit(refdata)
refdisp = refkmeans.inertia_
refDisp[nref]=np.log(refdisp)
mean_log_refdisp = np.mean(refDisp)
gap.append(mean_log_refdisp-np.log(dispersion))
#Calculating standard deviaiton
sd = (sum([(r-m)**2 for r,m in zip(refDisp,[mean_log_refdisp]*nrefs)])/nrefs)**0.5
SD.append(sd)
SD = [sd*((1+(1/nrefs))**0.5) for sd in SD]
#determining optimal k
opt_k = None
diff = []
for i in xrange(len(gap)-1):
diff = (SD[i+1]-(gap[i+1]-gap[i]))
if diff>0:
opt_k = i+10
break
print diff
plt.plot(np.linspace(10,19,10,True),gap)
plt.show()
Here I am trying to implement the Gap Statistic method for determining the optimal number of clusters. But the problem is that every time I run the code I get a different value for k.
What is the solution to the problem?
How can the value of optimal k differ for the same data?
I have stored the data in a .mat file beforehand and I am passing it as an argument via terminal
I am looking for the smallest value of k for which Gap(k)>= Gap(k+1)-s(k+1) where s(k+1) = sd(k+1)*square_root(1+(1/B)) where sd is the standard deviation of the reference distribution and B is the number of copies of Monte Carlo sample
Otherwise stated, I am searching for the value of k for which
s(k+1)-Gap(k+1)+Gap(k)>=0
Couple of problems with your simulation:
1- sd = (sum([(r-m)**2 for r,m in zip(refDisp,[mean_log_refdisp]*nrefs)])/nrefs)**0.5
Why did you multiply the second component of zip by nrefs that is not needed according to the original paper.
2-
if diff>0:
opt_k = i+10
break
if diff>0 you want diff>=0 since equality can happen a
About why you get different number of clusters each time, as people said it is monte carlo simulation so there can be randomness and also it depends on what you are clustering and your dataset. I suggest you to test your algorithms against Silhouette and Elbow to get a better idea about number of clusters.
One option is to run your function several times and then average the gap statistics and the s values, and find the smallest k where the average s(k+1)-Gap(k+1)+Gap(k) is greater than
This will take longer but give a more reliable result.
I am using the random number routines in python in the following code in order to create a noise signal.
res = 10
# Add noise to each X bin accross the signal
X = np.arange(-600,600,res)
for i in range(10000):
noise = [random.uniform(-2,2) for i in xrange(len(X))]
# custom module to save output of X and noise to .fits file
wp.save_fits('test10000', X, noise)
plt.plot(V, I)
plt.show()
In this example I am generate 10,000 'noise.fits' files, that I then wish to co-add together in order to show the expected 1/sqrt(N) dependence of the stacked noise root-mean-square (rms) as a function of the number of objects co-added.
My problem is that the rms follows this dependancy up until ~1000 objects, at which point it deviates upwards, suggesting that the random number generator.
Is there a routine or way to structure the code which will avoid or minimise this repetition? (Ideally with the number as a float in between a max and min value >1 and <-1)?
Here is the output of the co-adding code as well as the code pasted at the bottom for reference.
If I use the module random.random() the result is worse.
Here is my code which adds the noise signal files together, averaging over the number of objects.
import os
import numpy as np
from astropy.io import fits
import matplotlib.pyplot as plt
import glob
rms_arr =[]
#vel_w_arr = []
filelist = glob.glob('/Users/thbrown/Documents/HI_stacking/mockcat/testing/test10000/M*.fits')
filelist.sort()
for i in (filelist[:]):
print(i)
#open an existing FITS file
hdulist = fits.open(str(i))
# assuming the first extension is the table we assign data to record array
tbdata = hdulist[1].data
#index = np.arange(len(filelist))
# Access the signal column
noise = tbdata.field(1)
# access the vel column
X = tbdata.field(0)
if i == filelist[0]:
stack = np.zeros(len(noise))
tot_rms = 0
#print len(stack)
# sum signal in loop
stack = (stack + noise)
rms = np.std(stack)
rms_arr = np.append(rms_arr, rms)
numgal = np.arange(1, np.size(filelist)+1)
avg_rms = rms_arr / numgal