Python matrix tagging - python

I have an algorithm I want to implement and I'm trying to figure out the best way to do it.
I have a matrix H of size mxn (m - number of last inputs - sliding window, n - number of attributes).
I have a set of attributes A, and I want to find correlations between the attributes.
My problem is how can I tag a matrix column/line with a name?
this is the algorithm I'm trying to implement:
attributes a_i, a_j are extracted from H and denoted as H_i^T ,H_j^T (where T denotes transpose).
We then apply the Pearson correlation on them denoted as ρi,j.
for example:
If we have:
H(mxn = 4x3) = IQ Height weight
30 180 80
30 170 60
40 183 85
10 190 95
ct = 0.7
A = {IQ, Height, Weight}
Then the result we should get is:
CS = {(C,0)}
Where C = {Height, Weight}
I would also love to get any visualiztion tools reccomandations.
Thanks for your help!

Pandas is your best friend when it comes to tabular data. I'm not an expert on linear algebra notation but it seems like what you're trying to do is append a tuple into a set if the two items in the tuple are correlated by more than the threshold value, ie. if Height and Weight have a correlation coefficient > 0.7, then add those two values into list CS. I would do something like this:
import pandas as pd
import seaborn as sns
df = pd.DataFrame.from_dict({
"IQ":[30,30,40,10],
"Height":[180,170,183,190,],
"Weight":[80,60,85,95]
})
lst = []
threshold = 0.7
p_arr = df.corr().to_dict()
for attr in p_arr:
for sub_attr in p_arr[attr]:
p = p_arr[attr][sub_attr]
if attr != sub_attr and p > threshold:
lst.append(((attr, sub_attr), p))
produces:
[(('Height', 'Weight'), 0.9956654266839726),
(('Weight', 'Height'), 0.9956654266839726)]
and for correlation heatmap
sns.heatmap(df.corr())

Related

Filtering the two frequencies with highest amplitudes of a signal in the frequency domain

I have tried to filter the two frequencies which have the highest amplitudes. I am wondering if the result is correct, because the filtered signal seems less smooth than the original?
Is it correct that the output of the FFT-function contains the fundamental frequency A0/ C0, and is it correct to include it in the search of the highest amplitude (it is indeed the highest!) ?
My code (based on my professors and collegues code, and I did not understand every detail so far):
# signal
data = np.loadtxt("profil.txt")
t = data[:,0]
x = data[:,1]
x = x-np.mean(x) # Reduce signal to mean
n = len(t)
max_ind = int(n/2-1)
dt = (t[n-1]-t[0])/(n-1)
T = n*dt
df = 1./T
# Fast-Fourier-Transformation
c = 2.*np.absolute(fft(x))/n #get the power sprectrum c from the array of complex numbers
c[0] = c[0]/2. #correction for c0 (fundamental frequency)
f = np.fft.fftfreq(n, d=dt)
a = fft(x).real
b = fft(x).imag
n_fft = len(a)
# filter
p = np.ones(len(c))
p[c[0:int(len(c)/2)].argsort()[int(len(c)/2-1)]] = 0 #setting the positions of p to 0 with
p[c[0:int(len(c)/2)].argsort()[int(len(c)/2-2)]] = 0 #the indices from the argsort function
print(c[0:int(len(c)/2-1)].argsort()[int(n_fft/2-2)]) #over the first half of the c array,
ab_filter_2 = fft(x) #because the second half contains the
ab_filter_2.real = a*p #negative frequencies.
ab_filter_2.imag = b*p
x_filter2 = ifft(ab_filter_2)*2
I do not quite get the whole deal about FFT returning negative and positive frequencies. I know they are just mirrored, but then why can I not search over the whole array? And the iFFT function works with an array of just the positive frequencies?
The resulting plot: (blue original, red is filtered):
This part is very wasteful:
a = fft(x).real
b = fft(x).imag
You’re computing the FFT twice for no good reason. You compute it a 3rd time later, and you already computed it once before. You should compute it only once, not 4 times. The FFT is the most expensive part of your code.
Then:
ab_filter_2 = fft(x)
ab_filter_2.real = a*p
ab_filter_2.imag = b*p
x_filter2 = ifft(ab_filter_2)*2
Replace all of that with:
out = ifft(fft(x) * p)
Here you do the same thing twice:
p[c[0:int(len(c)/2)].argsort()[int(len(c)/2-1)]] = 0
p[c[0:int(len(c)/2)].argsort()[int(len(c)/2-2)]] = 0
But you set only the left half of the filter. It is important to make a symmetric filter. There are two locations where abs(f) has the same value (up to rounding errors!), there is a positive and a negative frequency that go together. Those two locations should have the same filter value (actually complex conjugate, but you have a real-valued filter so the difference doesn’t matter in this case).
I’m unsure what that indexing does anyway. I would split out the statement into shorter parts on separate lines for readability.
I would do it this way:
import numpy as np
x = ...
x -= np.mean(x)
fft_x = np.fft.fft(x)
c = np.abs(fft_x) # no point in normalizing, doesn't change the order when sorting
f = c[0:len(c)//2].argsort()
f = f[-2:] # the last two elements are the indices to the largest two frequency components
p = np.zeros(len(c))
p[f] = 1 # preserve the largest two components
p[-f] = 1 # set the same components counting from the end
out = np.fft.ifft(fft_x * p).real
# note that np.fft.ifft(fft_x * p).imag is approximately zero if the filter is created correctly
Is it correct that the output of the FFT-function contains the fundamental frequency A0/ C0 […]?
In principle yes, but you subtracted the mean from the signal, effectively setting the fundamental frequency (DC component) to 0.

Add values to an external file in the form of columns

I would like to know how I can put the values ​​resulting from my loops into the external file in form of columns. I have this code:
Perhaps a very specific brief explanation of what I would like to obtain could help me a bit, I have a group of particles that fall on the surface of the earth around a point (0.0) each with their respective weights and I would like to know the sum of the weights of the particles that fall within the rings of internal radius Ri and external Rj (the external radius of the first ring becomes the internal radius of the ring that follows it)
#Insert the radio values
#For example Ri_0=-20
#For example Rj_max=4000
#For example Bin=40
data=pd.read_csv("photons.txt", header=0, delim_whitespace=True)
df=pd.DataFrame(data)`
Ri_0=input("Insert the internal minimum radius value: ")
Rj_max=input("Insert the external maximum radius value: ")
Bin= input("Insert the bin: ")
R_internals=range(Ri_0,Rj_max+1,Bin)
Ri=list(R_internals)
Rj=[]
R=[]
for m in Ri:
R_externals=m+Bin
Rj.append(R_externals)
for d,f in zip(Ri,Rj):
R_average=(d+f)/2
R.append(R_average)
import zipfile
#Loops
count=0
#I think the problem is in this loop
for i,j in zip(Ri,Rj):
for r in df["radio"]:
if r >= i and r <= j:
d=df[df['radio']==r]['ParWeight'].iloc[0]
count=count+d
I have a problem adding the sum of the weights of the particles that fall within the internal radius ring Ri and external Rj and then adding it to an external file in the form of a data column and next to it the value of R that comes to be the average of Ri and Rj, I get a systematic error because it adds the value of the sum of weights of all the particles and does not separate them by ring, I attach the file "photons.txt" in the next link [https://drive.google.com/file/d/1YM0U3UN4p1OGvbiajZakMWtQSmyZoCkN/view?usp=sharing]
I was several days to trying to solve that problem.
Thanks so much.
Bryan, before function definition I am using your code and your data. Then my idea is to find where each radius is fall into in your Ri array. Once I find where each radius falls in I add up all weights per each group. Please let me know if you have any questions. Cheers.
import pandas as pd
import numpy as np
data_file = r"D:\workspace\projects\misc\data\photons.txt"
df = pd.read_csv(data_file, delimiter="\t")
Ri_0 = -20
Rj_max = 4000
Bin = 40
R_internals=range(Ri_0,Rj_max+1,Bin)
Ri=list(R_internals)
def getRadiusBoundary(x, listBoundaries):
""" This function will place a specific radius to its boundaries.
Indexes of external and internal radiuses are used
"""
for i in range(len(listBoundaries)):
if x <= listBoundaries[0]:
return 0
elif x > listBoundaries[len(listBoundaries) - 1]:
return len(listBoundaries)
idx = [i+1 for i in range(len(listBoundaries)-1) if listBoundaries[i] < x and x <= listBoundaries[i+1]]
return idx[0]
# this is a dictionary that maps radius groups to its boundaries
dictRadiusMapping = {i+1: [Ri[i], Ri[i+1]] for i in range(len(Ri)-1)}
dictRadiusMapping[0] = [-np.inf, Ri[0]]
dictRadiusMapping[len(Ri)] = [Ri[len(Ri) - 1], np.inf]
# here I am placing each radius into its appropriate group (finding its [external, internal] boundaries)
df["radius_group"] = df["radio"].apply(lambda x: getRadiusBoundary(x, Ri))
radius_sums = df.groupby("radius_group")["ParWeight"].sum().reset_index() # summing for each radius group
radius_sums.columns = ["radius_group", "weight_sum"]
radius_sums["radiuses"] = radius_sums["radius_group"].map(dictRadiusMapping)

Remove outliers from Pandas Dataframe (Circular Data)

I would like to remove outliers from Pandas dataframe using some user defined function. There are some answers to the same question I am asking in Stackoverflow but the difference is that the Data-set I have are circular data. Therefore, using Pandas built-in functions mean(), std() would not be appropriate. For example in circular data values of 355 and 5 have only a difference of 10 but linear difference gives 350.
I have thousands of dataframes like the one below. We clearly see that Geophone 6 is an outlier.
Geophone azimuth incidence
0 1 194.765326 29.703151
1 2 193.143982 23.380681
2 3 199.327911 34.752212
3 4 195.641010 49.186893
4 5 193.479015 21.192982
5 6 0.745142 3.410046
6 7 192.380435 29.778807
7 8 196.700814 19.750237
It can also be confirmed when plotting the data in a polar diagram.
I have written two functions mean_angle and variance_angle which calculates circular mean and variance to be applied to the data. Variance gives a value between 0 and 1. When data are close to each other Variance value gets closer to 0 and vise versa.
import numpy as np
def mean_angle(deg):
deg = np.deg2rad(deg)
S = np.array(deg)
C = np.array(deg)
S = S[np.isfinite(S)] #remove np.nan
C = C[np.isfinite(C)]
S = np.sum(np.sin(S))
C = np.sum(np.cos(C))
mu = np.arctan(S/C)
mu = np.rad2deg(mu)
if S>0 and C>0:
mu = mu
elif S>0 and C<0:
mu = mu +180
elif S<0 and C<0:
mu = mu+180
elif S<0 and C>0:
mu = mu +360
return mu
def variance_angle(deg):
"""
deg: angles in degrees
"""
deg = np.deg2rad(deg)
S = np.array(deg)
C = np.array(deg)
S = S[np.isfinite(S)] #remove np.nan
C = C[np.isfinite(C)]
length = C.size
S = np.sum(np.sin(S))
C = np.sum(np.cos(C))
R = np.sqrt(S**2 + C**2)
R_avg = R/length
V = 1- R_avg
return V
mean_azimuth = mean_angle(df.azimuth)
variance = variance_angle(df.azimuth)
print(mean_azimuth)
197.4122778774279
print(variance)
0.24614383460498535
However, when excluding row 5 from calculation, mean and variance become 195.06226604362286 , 0.0007544067627361928 respectively. The Variance is changed from 0.25 to almost 0.
Therefore, I would like to find a way to remove any circular outlier value/s (azimuth) which makes circular variance high using the defined functions shown above.
In this example incidence is also an outlier for the same Geophone but It actually does not have any relation to azimuth. There are other data where incidenceis within the range but azimuth is an outlier.
Any help is really appreciated.
One way to do outlier detection is to compute mean and std of the data, then remove points that lie somewhere outside A*std of the mean (where you tune A to be whatever is reasonable for your data.)
So you could use your functions to compute mean and variance of your dataframe, then pass over the dataframe again to remove data points outside A*std of the mean.

How to reshape a networkx graph in Python?

So I created a really naive (probably inefficient) way of generating hasse diagrams.
Question:
I have 4 dimensions... p q r s .
I want to display it uniformly (tesseract) but I have no idea how to reshape it. How can one reshape a networkx graph in Python?
I've seen some examples of people using spring_layout() and draw_circular() but it doesn't shape in the way I'm looking for because they aren't uniform.
Is there a way to reshape my graph and make it uniform? (i.e. reshape my hasse diagram into a tesseract shape (preferably using nx.draw() )
Here's what mine currently look like:
Here's my code to generate the hasse diagram of N dimensions
#!/usr/bin/python
import networkx as nx
import matplotlib.pyplot as plt
import itertools
H = nx.DiGraph()
axis_labels = ['p','q','r','s']
D_len_node = {}
#Iterate through axis labels
for i in xrange(0,len(axis_labels)+1):
#Create edge from empty set
if i == 0:
for ax in axis_labels:
H.add_edge('O',ax)
else:
#Create all non-overlapping combinations
combinations = [c for c in itertools.combinations(axis_labels,i)]
D_len_node[i] = combinations
#Create edge from len(i-1) to len(i) #eg. pq >>> pqr, pq >>> pqs
if i > 1:
for node in D_len_node[i]:
for p_node in D_len_node[i-1]:
#if set.intersection(set(p_node),set(node)): Oops
if all(p in node for p in p_node) == True: #should be this!
H.add_edge(''.join(p_node),''.join(node))
#Show Plot
nx.draw(H,with_labels = True,node_shape = 'o')
plt.show()
I want to reshape it like this:
If anyone knows of an easier way to make Hasse Diagrams, please share some wisdom but that's not the main aim of this post.
This is a pragmatic, rather than purely mathematical answer.
I think you have two issues - one with layout, the other with your network.
1. Network
You have too many edges in your network for it to represent the unit tesseract. Caveat I'm not an expert on the maths here - just came to this from the plotting angle (matplotlib tag). Please explain if I'm wrong.
Your desired projection and, for instance, the wolfram mathworld page for a Hasse diagram for n=4 has only 4 edges connected all nodes, whereas you have 6 edges to the 2 and 7 edges to the 3 bit nodes. Your graph fully connects each "level", i.e. 4-D vectors with 0 1 values connect to all vectors with 1 1 value, which then connect to all vectors with 2 1 values and so on. This is most obvious in the projection based on the Wikipedia answer (2nd image below)
2. Projection
I couldn't find a pre-written algorithm or library to automatically project the 4D tesseract onto a 2D plane, but I did find a couple of examples, e.g. Wikipedia. From this, you can work out a co-ordinate set that would suit you and pass that into the nx.draw() call.
Here is an example - I've included two co-ordinate sets, one that looks like the projection you show above, one that matches this one from wikipedia.
import networkx as nx
import matplotlib.pyplot as plt
import itertools
H = nx.DiGraph()
axis_labels = ['p','q','r','s']
D_len_node = {}
#Iterate through axis labels
for i in xrange(0,len(axis_labels)+1):
#Create edge from empty set
if i == 0:
for ax in axis_labels:
H.add_edge('O',ax)
else:
#Create all non-overlapping combinations
combinations = [c for c in itertools.combinations(axis_labels,i)]
D_len_node[i] = combinations
#Create edge from len(i-1) to len(i) #eg. pq >>> pqr, pq >>> pqs
if i > 1:
for node in D_len_node[i]:
for p_node in D_len_node[i-1]:
if set.intersection(set(p_node),set(node)):
H.add_edge(''.join(p_node),''.join(node))
#This is manual two options to project tesseract onto 2D plane
# - many projections are available!!
wikipedia_projection_coords = [(0.5,0),(0.85,0.25),(0.625,0.25),(0.375,0.25),
(0.15,0.25),(1,0.5),(0.8,0.5),(0.6,0.5),
(0.4,0.5),(0.2,0.5),(0,0.5),(0.85,0.75),
(0.625,0.75),(0.375,0.75),(0.15,0.75),(0.5,1)]
#Build the "two cubes" type example projection co-ordinates
half_coords = [(0,0.15),(0,0.6),(0.3,0.15),(0.15,0),
(0.55,0.6),(0.3,0.6),(0.15,0.4),(0.55,1)]
#make the coords symmetric
example_projection_coords = half_coords + [(1-x,1-y) for (x,y) in half_coords][::-1]
print example_projection_coords
def powerset(s):
ch = itertools.chain.from_iterable(itertools.combinations(s, r) for r in range(len(s)+1))
return [''.join(t) for t in ch]
pos={}
for i,label in enumerate(powerset(axis_labels)):
if label == '':
label = 'O'
pos[label]= example_projection_coords[i]
#Show Plot
nx.draw(H,pos,with_labels = True,node_shape = 'o')
plt.show()
Note - unless you change what I've mentioned in 1. above, they still have your edge structure, so won't look exactly the same as the examples from the web. Here is what it looks like with your existing network generation code - you can see the extra edges if you compare it to your example (e.g. I don't this pr should be connected to pqs:
'Two cube' projection
Wikimedia example projection
Note
If you want to get into the maths of doing your own projections (and building up pos mathematically), you might look at this research paper.
EDIT:
Curiosity got the better of me and I had to search for a mathematical way to do this. I found this blog - the main result of which being the projection matrix:
This led me to develop this function for projecting each label, taking the label containing 'p' to mean the point has value 1 on the 'p' axis, i.e. we are dealing with the unit tesseract. Thus:
def construct_projection(label):
r1 = r2 = 0.5
theta = math.pi / 6
phi = math.pi / 3
x = int( 'p' in label) + r1 * math.cos(theta) * int('r' in label) - r2 * math.cos(phi) * int('s' in label)
y = int( 'q' in label) + r1 * math.sin(theta) * int('r' in label) + r2 * math.sin(phi) * int('s' in label)
return (x,y)
Gives a nice projection into a regular 2D octagon with all points distinct.
This will run in the above program, just replace
pos[label] = example_projection_coords[i]
with
pos[label] = construct_projection(label)
This gives the result:
play with r1,r2,theta and phi to your heart's content :)

Discretization of probability array in Python

I have a numpy array (actually imported from a GIS raster map) which contains
probability values of occurrence of a species like following example:
a = random.randint(1.0,20.0,1200).reshape(40,30)
b = (a*1.0)/sum(a)
Now I want to get a discrete version for that array again. Like if I have
e.g. 100 individuals which are located on the area of that array (1200 cells) how are they
distributed? Of course they should be distributed according to their probability,
meaning lower values indicated lower probability of occurrence. However, as everything is statistics there is still the chance that a individual is located at a low probability
cell. It should be possible that multiple individuals can occupy on cell...
It is like transforming a continuous distribution curve into a histogram again. Like many different histograms may result in a certain distribution curve it should also be the other way round. Accordingly applying the algorithm I am looking for will produce different discrete values each time.
...is there any algorithm in python which can do that? As I am not that familiar with discretization maybe someone can help.
Use random.choice with bincount:
np.bincount(np.random.choice(b.size, 100, p=b.flat),
minlength=b.size).reshape(b.shape)
If you don't have NumPy 1.7, you can replace random.choice with:
np.searchsorted(np.cumsum(b), np.random.random(100))
giving:
np.bincount(np.searchsorted(np.cumsum(b), np.random.random(100)),
minlength=b.size).reshape(b.shape)
So far I think ecatmur's answer seems quite reasonable and simple.
I just want to add may a more "applied" example. Considering a dice
with 6 faces (6 numbers). Each number/result has a probability of 1/6.
Displaying the dice in form of an array could look like:
b = np.array([[1,1,1],[1,1,1]])/6.0
Thus rolling the dice 100 times (n=100) results in following simulation:
np.bincount(np.searchsorted(np.cumsum(b), np.random.random(n)),minlength=b.size).reshape(b.shape)
I think that can be an appropriate approach for such an application.
Thus thank you ecatmur for your help!
/Johannes
this is similar to my question i had earlier this month.
import random
def RandFloats(Size):
Scalar = 1.0
VectorSize = Size
RandomVector = [random.random() for i in range(VectorSize)]
RandomVectorSum = sum(RandomVector)
RandomVector = [Scalar*i/RandomVectorSum for i in RandomVector]
return RandomVector
from numpy.random import multinomial
import math
def RandIntVec(ListSize, ListSumValue, Distribution='Normal'):
"""
Inputs:
ListSize = the size of the list to return
ListSumValue = The sum of list values
Distribution = can be 'uniform' for uniform distribution, 'normal' for a normal distribution ~ N(0,1) with +/- 5 sigma (default), or a list of size 'ListSize' or 'ListSize - 1' for an empirical (arbitrary) distribution. Probabilities of each of the p different outcomes. These should sum to 1 (however, the last element is always assumed to account for the remaining probability, as long as sum(pvals[:-1]) <= 1).
Output:
A list of random integers of length 'ListSize' whose sum is 'ListSumValue'.
"""
if type(Distribution) == list:
DistributionSize = len(Distribution)
if ListSize == DistributionSize or (ListSize-1) == DistributionSize:
Values = multinomial(ListSumValue,Distribution,size=1)
OutputValue = Values[0]
elif Distribution.lower() == 'uniform': #I do not recommend this!!!! I see that it is not as random (at least on my computer) as I had hoped
UniformDistro = [1/ListSize for i in range(ListSize)]
Values = multinomial(ListSumValue,UniformDistro,size=1)
OutputValue = Values[0]
elif Distribution.lower() == 'normal':
"""
Normal Distribution Construction....It's very flexible and hideous
Assume a +-3 sigma range. Warning, this may or may not be a suitable range for your implementation!
If one wishes to explore a different range, then changes the LowSigma and HighSigma values
"""
LowSigma = -3#-3 sigma
HighSigma = 3#+3 sigma
StepSize = 1/(float(ListSize) - 1)
ZValues = [(LowSigma * (1-i*StepSize) +(i*StepSize)*HighSigma) for i in range(int(ListSize))]
#Construction parameters for N(Mean,Variance) - Default is N(0,1)
Mean = 0
Var = 1
#NormalDistro= [self.NormalDistributionFunction(Mean, Var, x) for x in ZValues]
NormalDistro= list()
for i in range(len(ZValues)):
if i==0:
ERFCVAL = 0.5 * math.erfc(-ZValues[i]/math.sqrt(2))
NormalDistro.append(ERFCVAL)
elif i == len(ZValues) - 1:
ERFCVAL = NormalDistro[0]
NormalDistro.append(ERFCVAL)
else:
ERFCVAL1 = 0.5 * math.erfc(-ZValues[i]/math.sqrt(2))
ERFCVAL2 = 0.5 * math.erfc(-ZValues[i-1]/math.sqrt(2))
ERFCVAL = ERFCVAL1 - ERFCVAL2
NormalDistro.append(ERFCVAL)
#print "Normal Distribution sum = %f"%sum(NormalDistro)
Values = multinomial(ListSumValue,NormalDistro,size=1)
OutputValue = Values[0]
else:
raise ValueError ('Cannot create desired vector')
return OutputValue
else:
raise ValueError ('Cannot create desired vector')
return OutputValue
ProbabilityDistibution = RandFloats(1200)#This is your probability distribution for your 1200 cell array
SizeDistribution = RandIntVec(1200,100,Distribution=ProbabilityDistribution)#for a 1200 cell array, whose sum is 100 with given probability distribution
The two main lines that are important are the last two lines in the code above

Categories